What's wrong with CRC32 - Level Extreme

Level Extreme platform

Subscription

Corporate profile

Products & Services

Support

Legal

Français

What's wrong with CRC32

Message

From

09/01/2014 22:22:06

Thomas Ganss
Main Trend
Frankfurt, Germany

09/01/2014 10:51:02

Michel Fournier
Level Extreme Inc.
Petit-Rocher, New Brunswick, Canada

General information

Forum:

ASP.NET

Category:

Other

Title:

Re: What's wrong with CRC32

Environment versions

Environment:

VB 9.0

OS:

Windows 7

Network:

Windows 2003 Server

Database:

MS SQL Server

Application:

Web

Miscellaneous

Thread ID:

01591437

Message ID:

01591613

Views:

This message has been marked as a message which has helped to the initial question of the thread.

>>As you do run significant numbers (50 mill) be sure to think about hash collisions - ALL hash functions, CRC included, have information loss which might result in false positives. Very early in computing (disc space being VERY costly then) I had built a structure identifying duplicates via 3 different hashes taken together to form the key and even then check for excact duplication and increment trailing integer in case of collisions...
>
>You could explain more about "information loss" and "false positives"?

WIki is short, but IMO descriptive: http://en.wikipedia.org/wiki/Collision_%28computer_science%29
>
>It seemed that the hash method was a more sufficient method to obtain an ID on the file. Are you saying that doing it twice might result in a different value sometimes?

CRC is more aimed at generating different results after minimum loss/change of data. 2 wildly different jpg's could still have the same CRC32 value, as all possible jpg need to be classified/identified in an fixed length identity string or integer - resulting in the pidgeon hole principle. Having more holes results in less pidgeons per hole if the hash function is good and does not cluster pidgeons.

Coding to include checking for false positives - 2 different files with the same hash result - will not slow down operations much but is prudent IMO if you are using hashes to identify.

HTH

thomas

Map

View

Click here to load this message in the networking platform