Collecting clean data for AI training
Garbage in, garbage out. Except in AI, where it's garbage in, confidently-wrong garbage out, with a citation.
1. Humans are creative spellers
I asked field workers to enter an animal's condition. I received 'healthy', 'helthy', 'fine', 'ok', 'okayish', and one heartfelt 'idk looks tired.' A model cannot learn from vibes. It needs structure, dropdowns, and the gentle removal of free will.
Constrain the input or spend your weekends cleaning it. I learned this the way everyone learns things: the slow, regrettable way.
2. Validate at the door, not at the model
The cheapest place to catch bad data is the exact moment someone types it. The most expensive place is three months later when your model insists every cow weighs 4 grams.
I validate on the device, again on the API, and one more time before training — belt, suspenders, and a backup pair of pants.
3. Boring data is beautiful data
Everyone wants the exciting model. Nobody wants the spreadsheet hygiene that makes it possible. But clean, boring, consistent data is the unglamorous secret behind every 'wow' demo.
If your dataset were a houseguest, it should be the one who takes their shoes off without being asked.
