Version Control for 77,000 Examples: Git at Scale
Version Control for 77,000 Examples: Git at Scale with DVC
Read Time: 22 minutes | Level: Advanced | Production-Tested Strategy
The Scale Challenge Solved
Managing 77,000 examples without proper version control nearly killed the project. Here's what went wrong and how we fixed it.
The Nightmare: Pre-Version Control
Month 6 disaster:
- 23,000 examples in various folders
- 12 team members with conflicting copies
- No change tracking or rollback capability
- Manual merging taking 8 hours per integration
- Lost 2,400 examples due to accidental overwrites
The breaking point:
# This was our "version control"
datasets/
├── final_v1/
├── final_v2/
├── ACTUALLY_final/
├── final_FIXED/
├── training_data_march/
├── training_data_march_BACKUP/
├── training_data_march_john_edits/
└── DONT_DELETE_training_data_march_sarah/
The Solution: <a href="https://git-scm.com/" target="_blank" rel="noopener noreferrer">Git</a> + <a href="https://dvc.org/" target="_blank" rel="noopener noreferrer">DVC</a> Architecture
Git handles:
- Code and configuration files
- Metadata and annotation schemas
- Processing scripts and validation code
- Branching and collaboration workflows
DVC (Data Version Control) handles:
- Large dataset files (1.2TB total)
- Binary training examples
- Model artifacts and checkpoints
- Dataset splits and preprocessed data
The Complete Architecture
Repository Structure
ai-training-dataset/
├── .git/ # Git repository
├── .dvc/ # DVC configuration
├── .dvcignore # DVC ignore patterns
├── data/
│ ├── raw/ # Raw examples (.dvc tracked)
│ ├── processed/ # Processed examples (.dvc tracked)
│ ├── splits/ # Train/val/test splits (.dvc tracked)
│ └── metadata/ # JSON metadata (git tracked)
├── scripts/
│ ├── preprocessing/ # Data processing scripts
│ ├── validation/ # Quality validation
│ └── augmentation/ # Data augmentation
├── configs/
│ ├── data_schema.yaml # Data structure definitions
│ ├── quality_rules.yaml # Quality validation rules
│ └── pipeline.yaml # Processing pipeline config
├── docs/
│ ├── CHANGELOG.md # Dataset version changes
│ ├── SCHEMA.md # Data schema documentation
│ └── CONTRIBUTING.md # Collaboration guidelines
└── dvc.yaml # DVC pipeline definition
DVC Setup and Configuration
# Initialize DVC in existing Git repository
cd ai-training-dataset
git init
dvc init --no-scm
# Configure remote storage (AWS S3)
dvc remote add -d s3remote s3://your-dataset-bucket/data
dvc remote modify s3remote region us-west-2
# Configure DVC cache for large datasets
dvc config cache.local /opt/dvc-cache
dvc config cache.s3 s3://your-dataset-bucket/cache
The Branching Strategy
Feature Branch Workflow
# Create feature branch for new data addition
git checkout -b feature/medical-domain-expansion
# Add new medical domain examples
dvc add data/raw/medical_examples.jsonl
git add data/raw/medical_examples.jsonl.dvc
git commit -m "Add 5,000 medical domain examples
- Covers cardiology, neurology, radiology
- Quality score: 8.7/10 average
- Source: Expert medical professionals"
# Push data to remote storage
dvc push
# Push code changes to Git
git push origin feature/medical-domain-expansion
Data Pipeline Integration
DVC Pipeline Definition
# dvc.yaml
stages:
prepare:
cmd: python scripts/preprocessing/prepare_data.py
deps:
- data/raw/
- scripts/preprocessing/prepare_data.py
outs:
- data/prepared/
validate:
cmd: python scripts/validation/validate_quality.py
deps:
- data/prepared/
- configs/quality_rules.yaml
metrics:
- metrics/quality_scores.json
augment:
cmd: python scripts/augmentation/augment_dataset.py
deps:
- data/prepared/
- configs/augmentation_config.yaml
outs:
- data/augmented/
split:
cmd: python scripts/preprocessing/create_splits.py
deps:
- data/augmented/
outs:
- data/splits/train.jsonl
- data/splits/val.jsonl
- data/splits/test.jsonl
Collaboration Workflows
Team Member Onboarding
# New team member setup
git clone https://github.com/company/ai-training-dataset.git
cd ai-training-dataset
# Setup DVC and pull data
dvc install
dvc pull
# Verify setup
dvc status
python scripts/validation/verify_setup.py
Daily Workflow for Contributors
# Start of day: sync with latest
git pull origin develop
dvc pull
# Create feature branch
git checkout -b feature/improve-quality-scores
# Make changes to dataset
# ... edit files, add examples, etc ...
# Track new data files
dvc add data/improved/new_examples.jsonl
# Commit changes
git add .
git commit -m "Improve quality scores for edge cases"
# Push data and code
dvc push
git push origin feature/improve-quality-scores
Storage Optimization
Cloud Storage Strategy
S3 Bucket Structure:
s3://ai-training-datasets/
├── datasets/
│ ├── v1.0/ # Immutable version snapshots
│ ├── v2.0/
│ └── current/ # Working versions
├── cache/ # DVC cache storage
├── backups/
│ ├── daily/
│ └── weekly/
└── exports/ # Dataset exports for clients
Version Management Strategies
Semantic Versioning for Datasets
Dataset Version Format: MAJOR.MINOR.PATCH
MAJOR: Breaking changes to schema or format
MINOR: New features, data additions, non-breaking changes
PATCH: Bug fixes, quality improvements, small corrections
Examples:
v1.0.0 - Initial 7,000 examples
v1.1.0 - Added augmentation (77,000 examples)
v1.1.1 - Fixed quality issues in medical domain
v2.0.0 - New schema with additional metadata fields
Release Management
# Create release branch
git checkout -b release/v2.1.0
# Finalize version
echo "2.1.0" > VERSION
git add VERSION
git commit -m "Bump version to 2.1.0"
# Create release tag
git tag -a v2.1.0 -m "Release v2.1.0
- Added 10,000 new examples
- Improved quality scores by 15%
- Enhanced metadata schema
- Better edge case coverage"
# Push release
git push origin release/v2.1.0
git push origin v2.1.0
# Create immutable dataset snapshot
dvc commit
dvc push -r s3remote-releases
Business Impact
Collaboration Efficiency
Metrics improvement:
- Integration time: 95% reduction (8 hours → 24 minutes)
- Merge conflicts: 89% reduction
- Data loss incidents: Zero (from 3 major losses)
- Team velocity: 340% increase
- Onboarding time: 78% reduction (2 days → 5.3 hours)
Cost Analysis
Infrastructure costs:
- S3 storage: $145/month (1.2TB)
- Transfer costs: $23/month
- GitHub LFS alternative cost: $450/month
- Savings: $282/month (66% cost reduction)
Development efficiency:
- Reduced debugging time: 15 hours/week saved
- Faster iteration cycles: 3x improvement
- Quality gate automation: 22 hours/week saved
- Total efficiency gain: 40 hours/week
Implementation Roadmap
Week 1: Foundation Setup
# Day 1: Repository setup
git init ai-training-dataset
cd ai-training-dataset
dvc init
# Day 2: Configure remotes
dvc remote add -d s3 s3://your-bucket/data
dvc remote add backup s3://your-backup-bucket/data
# Day 3: Initial data migration
dvc add data/raw/
git add data/raw/.dvc
git commit -m "Initial dataset commit"
# Day 4-5: Team setup and testing
# Train team members on workflow
# Test collaboration scenarios
Week 2: Pipeline Integration
# Setup DVC pipelines
dvc stage add -n prepare -d data/raw/ -o data/prepared/ \
python scripts/prepare.py
# Configure quality gates
# Setup automated validation
# Integrate with CI/CD
The Git + DVC solution transformed our 77,000 example dataset from a collaboration nightmare into a streamlined, scalable system that supports 6 parallel teams and continuous integration.
Your next step: Start with a small pilot - version control 1,000 examples using DVC, then scale up gradually. The collaborative benefits appear immediately.
Ready to scale your dataset version control? Get the complete Git + DVC setup guide, automation scripts, and team collaboration templates that manage our 77,000 example dataset.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!