Dotnetpro database performance contest

Plateforme Level Extreme

Abonnement

Profil corporatif

Produits & Services

Support

Légal

English

Dotnetpro database performance contest

Message

14/05/2007 11:03:26

Samuel David
Etecnologia
Miami, Floride, États-Unis

14/05/2007 07:13:34

Thomas Ganss
Main Trend
Frankfurt, Allemagne

Information générale

Forum:

Visual FoxPro

Catégorie:

VFP Compiler for .NET

Titre:

Re: Dotnetpro database performance contest

Divers

Thread ID:

01224231

Message ID:

01225256

Vues:

Hi Thomas:

Even several bytes (16 bytes) Hash Functions does not warrant collision free results, but they indeed make the collision detection time lower. The right approach should be one based in the record length, so the number of bytes of the hash function is dependant of the record length in order to keep the expected falses positives values small and at the same time the hash funciton not to be so time consuming. Someone could come with a length based formula that gives the most optimum hash length based in the time used to compute them and the savings in comparisons of false positives.

By the way if the hash function also is proportional to the byte values, i.e. if the hash value is lower for values as "abc" and greater for values "mbc", then it can be used to speed up the sorting and then just sort the few collisions. It is faster to sort just hashcodes and then perform the lenghty byte comparisons.

So this is another application for sorting, and I can see others as grouping.

>Hi Markus
>
>feeling lucky today ? <bg>
>

>*--------------------------------------------
>* Copy only the unique records to a new table
>* using a different table structure.
>*--------------------------------------------
>SET UNIQUE ON
>INDEX ON Hash( Vorname + Name + Strasse + HausNr + PLZ + Ort + eMail, 5 ) TO _FOLDER_ + "Table1.idx" COMPACT
>

is guaranteed to be faster than Two-Step-Dance-Routine, but what happens in case of hash collisions due to imperfect hash functions / not enough key size ?
>
>I spent some time checking out duplicate frequency cut off points to get the best strategy for hash collision treatment and you optimize it away<g>. To be honest, it might be a better strategy to aim for larger key values, increasing the possible hash bins for the 5*10**5 records than to implement time consuming collision straegies, as collision managment can be quite costly.
>I had planned to check different Hashfunctions (not down to 2Byte size, but perhaps 4 or 3 Bytes) - 3 bytes will give you 29% chace of accidental failure, but 4 bytes with a nearly perfectly distributing hash function will give you 0.012% failure chance - not bad for a contest but would be bad for production code and especially bad for select distinct Samuel was writing about <bg>.

Répondre

Fil

Voir

Click here to load this message in the networking platform