[TAG] JPEG de-duplication

Ben Okopnik ben at linuxgazette.net
Thu Jul 29 07:52:05 MSD 2010


On Thu, Jul 29, 2010 at 08:20:21AM +0530, Kapil Hari Paranjape wrote:
> 
> What _should_ be possible (but slow!) is to write something that uses
> the magick library to convert the image into a standard bitmap (like
> ppmraw) and _then_ match signatures (or just do a bit-by-bit
> comparison). This would work fine for loss-less compression like png
> but will not be so great for lossy formats like jpeg. Moreover, there
> would be problems of comparison between vector and bitmap formats
> since the conversion to bitmap would be lossy in the former case.

Actually, for the real-world case of comparing camera-produced images, I
think we can reject any that aren't in the same format (that would be a
much more complex task, I agree.) If we're just trying to eliminate
actual copies, then that would be pretty simple:

1st pass: use unique file sizes as keys, lists of files with that size
as values

2nd pass: any lists with 2 or more files get checked for format and camera
make/model equivalence

(optional) 3rd pass: any lists that still have 2 or more entries get
checked for signature equivalence.

The actual solution is left to the student. :)


-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *



More information about the TAG mailing list