Level Extreme platform
Subscription
Corporate profile
Products & Services
Support
Legal
Français
What's wrong with CRC32
Message
From
09/01/2014 22:22:06
 
 
To
09/01/2014 10:51:02
General information
Forum:
ASP.NET
Category:
Other
Environment versions
Environment:
VB 9.0
OS:
Windows 7
Network:
Windows 2003 Server
Database:
MS SQL Server
Application:
Web
Miscellaneous
Thread ID:
01591437
Message ID:
01591613
Views:
34
This message has been marked as a message which has helped to the initial question of the thread.
>>As you do run significant numbers (50 mill) be sure to think about hash collisions - ALL hash functions, CRC included, have information loss which might result in false positives. Very early in computing (disc space being VERY costly then) I had built a structure identifying duplicates via 3 different hashes taken together to form the key and even then check for excact duplication and increment trailing integer in case of collisions...
>
>You could explain more about "information loss" and "false positives"?

WIki is short, but IMO descriptive: http://en.wikipedia.org/wiki/Collision_%28computer_science%29
>
>It seemed that the hash method was a more sufficient method to obtain an ID on the file. Are you saying that doing it twice might result in a different value sometimes?

CRC is more aimed at generating different results after minimum loss/change of data. 2 wildly different jpg's could still have the same CRC32 value, as all possible jpg need to be classified/identified in an fixed length identity string or integer - resulting in the pidgeon hole principle. Having more holes results in less pidgeons per hole if the hash function is good and does not cluster pidgeons.

Coding to include checking for false positives - 2 different files with the same hash result - will not slow down operations much but is prudent IMO if you are using hashes to identify.

HTH

thomas
Previous
Next
Reply
Map
View

Click here to load this message in the networking platform