Dataset is contaminated

#2
by LPN64 - opened

I downloaded the dataset available on roboflow and it's full of contaminations, they put same images in test and train set with minor manual augmentation, it's horrible

Simple example from train and test

CarLongPlate307_jpg.rf.de23385fd41895fdb8f7fec44cd3eb9a.jpg

CarLongPlateGen3370_jpg.rf.bbe05d0c4eeccecce52bfc9afdf8d48b.jpg

CarLongPlate307_jpg.rf.de23385fd41895fdb8f7fec44cd3eb9a
CarLongPlateGen3370_jpg.rf.bbe05d0c4eeccecce52bfc9afdf8d48b

Hi @LPN64 , thank you for the careful audit and for flagging this β€” you're absolutely right, and I appreciate you taking the time to document it with concrete examples.
To be transparent: this model was fine-tuned directly on the Roboflow license-plate-recognition-rxg4e dataset without re-auditing the train/test split, so the contamination you found is inherited from the source. That means the reported metrics on the model card are very likely overestimated β€” the test set isn't a true held-out evaluation. That's on me for not validating the split before publishing, and I'll add a clear disclaimer to the model card today.

perfect !

Sign up or log in to comment