Building Your First Dataset - My 77,000 Example Journey

In 2023, I decided to create a PostgreSQL expert AI. Not because I had to, but because I was curious: Could I make an AI that knew PostgreSQL better than most developers?
6 months later, I had built 77,175 training examples. The result? An AI that could debug PostgreSQL like a senior DBA. Let me show you exactly how I did it.
📊My Story: From Zero to 77,000 Training Examples
The Challenge
- ✗No existing PostgreSQL-specific models
- ✗Generic models gave incorrect SQL advice
- ✗Documentation was scattered everywhere
The Journey
- ✓Time invested: 6 months
- ✓Examples created: 77,175
- ✓Coffee consumed: ~500 cups
- ✓Result: AI debugging like a senior DBA
What is Training Data, Really?
Training data is like a cookbook for AI:
Real Examples from My Dataset
The Data Collection Strategy
Here's exactly how I built my dataset:
Source 1: Stack Overflow Mining
18,000 examplesWhat I did:
- 1. Scraped PostgreSQL tagged questions
- 2. Filtered for answered questions with 5+ upvotes
- 3. Cleaned and formatted Q&A pairs
Quality tricks:
- • Only kept accepted answers
- • Removed outdated version-specific info
- • Combined multiple good answers into comprehensive responses
Source 2: Mailing List Gold
32,385 examplesPostgreSQL mailing lists = 20+ years of expert discussions
My process:
- 1. Downloaded pgsql-general archives
- 2. Extracted problem-solution threads
- 3. Converted discussions to Q&A format
Why this was golden:
- • Real production problems
- • Solutions from PostgreSQL core developers
- • Edge cases you won't find anywhere else
Source 3: Documentation Examples
5,985 examplesOfficial docs are great but dense. I transformed them:
Source 4: GitHub Issues
3,039 examplesPopular PostgreSQL projects have issue trackers full of real problems:
- 1. Scraped issues from pg_admin, postgres, major extensions
- 2. Extracted problem descriptions and solutions
- 3. Included error messages and stack traces
This gave me real-world debugging scenarios!
Source 5: My Secret Weapon - Synthetic Data
10,005 examplesI generated variations using templates:
Quality Control: The Make or Break
Bad data = Bad AI. Here's how I ensured quality:
The Validation Process
Step 1: Automated checks
- • No empty responses
- • Minimum 50 characters per answer
- • No obvious errors (like MySQL syntax in PostgreSQL context)
Step 2: Random sampling
- • Manually reviewed 5% of examples
- • If error rate > 2%, reviewed entire batch
Step 3: Technical accuracy
- • Tested SQL examples in actual PostgreSQL
- • Verified version-specific features
- • Checked for deprecated syntax
Step 4: Diversity check
- • Ensured mix of difficulty levels
- • Covered all major PostgreSQL features
- • Included both common and edge cases
The Format That Works
After much experimentation, here's the winning format:
{ "conversations": [ { "role": "system", "content": "You are a PostgreSQL expert database administrator with 20 years of experience." }, { "role": "user", "content": "My database backup is taking 5 hours. How can I speed it up?" }, { "role": "assistant", "content": "Here are several ways to speed up PostgreSQL backups: 1. Use parallel backup with pg_dump -j option: pg_dump -j 4 -Fd -f backup_dir dbname 2. Consider using pg_basebackup for physical backups 3. Implement incremental backups with pgBackRest or WAL-G 4. Optimize your I/O: backup to different disk than database 5. Use compression carefully - sometimes CPU becomes bottleneck" } ] }
Common Mistakes I Made (So You Don't Have To)
Mistake 1: Too Similar Examples
Mistake 2: Forgetting Context
Mistake 3: One-Size-Fits-All Answers
Your Turn: Start Small Dataset (100 Examples)
Pick Your Domain
Something you know well:
- • Your job field
- • A hobby
- • A skill you have
Create 100 Examples Using This Framework:
Format:
Tools You'll Need:
- • Spreadsheet or text editor
- • JSON formatter (free online)
- • Domain knowledge or research ability
The Results: Was It Worth It?
After 6 months and 77,175 examples:
The Good
- ✓Model knew PostgreSQL inside-out
- ✓Could debug complex issues
- ✓Knew version-specific features and quirks
- ✓Query optimization suggestions were spot-on
The Investment
- Time: ~500 hours
- Cost: ~$50 (OpenAI API for validation)
- Learning: Priceless
- Satisfaction: Enormous
The Outcome
- →Model performs better than GPT-4 on PostgreSQL tasks
- →Being used by 1000+ developers
- →Saved countless debugging hours
- →Proved that individuals can create specialized AI
Lessons Learned
Quality > Quantity
1,000 excellent examples > 10,000 mediocre ones
Real Data > Synthetic
But synthetic fills gaps well
Diversity Matters
Cover edge cases, not just common cases
Test Everything
Bad data compounds during training
Document Sources
You'll need to update/improve later
🎓 Key Takeaways
- ✓Training data is the foundation - quality datasets make quality AI
- ✓Multiple sources are best - Stack Overflow, mailing lists, docs, GitHub issues, synthetic data
- ✓Quality control is critical - automate checks, manually sample, test accuracy
- ✓Consistency matters - use a standardized format for all examples
- ✓Start small - 100 examples is enough to begin your journey
Ready to Learn How to Train AI?
In Chapter 8, discover pre-training vs fine-tuning, learning rates, and the complete training process with real code examples!
Continue to Chapter 8