>>As you do run significant numbers (50 mill) be sure to think about hash collisions - ALL hash functions, CRC included, have information loss which might result in false positives. Very early in computing (disc space being VERY costly then) I had built a structure identifying duplicates via 3 different hashes taken together to form the key and even then check for excact duplication and increment trailing integer in case of collisions...
>
>You could explain more about "information loss" and "false positives"?
WIki is short, but IMO descriptive:
http://en.wikipedia.org/wiki/Collision_%28computer_science%29>
>It seemed that the hash method was a more sufficient method to obtain an ID on the file. Are you saying that doing it twice might result in a different value sometimes?
CRC is more aimed at generating different results after minimum loss/change of data. 2 wildly different jpg's could still have the same CRC32 value, as all possible jpg need to be classified/identified in an fixed length identity string or integer - resulting in the pidgeon hole principle. Having more holes results in less pidgeons per hole if the hash function is good and does not cluster pidgeons.
Coding to include checking for false positives - 2 different files with the same hash result - will not slow down operations much but is prudent IMO if you are using hashes to identify.
HTH
thomas