A Great Way to DeDupe Image Datasets
Updated: Mar 13
As far as time in manual labor, preparing data for an ML pipeline more often than not takes the majority. Furthermore, building or extending a database usually costs astronomical amounts of time, subtasks, and attention to detail. The latter led me to find a great command-line tool for cleaning out duplicates and near-duplicates, especially when used with iTerm2 (or iTerm) — namely imgdupes.
Note that the aim here is to introduce imgdupes. See reference for the technical details of specifications, algorithms, options, and such (or stay tuned for a future post on the details).
Problem Statement: De-Duplicating an Image Set.
My situation while building a facial image database was as follows: a directory of multiple directories, and with each subdirectory containing images for the respective class. This is a common scenario in ML tasks, as many renowned datasets follow such convention: separate class samples by directory for both convenience and as explicit labels. Thus, I was cleaning face data, and the identity of the faces within named the subdirectories.
Knowing there were several duplicates and near-duplicates (e.g., neighboring video frames), and that this was not good for the problem I aim to solve, I needed an algorithm or tool to find duplicates. Precisely, I needed a tool to discover, display, and prompt to delete all duplicate images. I was fortunate to stumble upon a wonderful python-based command-line tool called imgdupes.
Friend Link to finish reading on Medium.