Plateforme Level Extreme
Abonnement
Profil corporatif
Produits & Services
Support
Légal
English
Data Cleansing Exercise
Message
De
24/02/2013 08:05:37
Thomas Ganss (En ligne)
Main Trend
Frankfurt, Allemagne
 
Information générale
Forum:
Visual FoxPro
Catégorie:
Codage, syntaxe et commandes
Divers
Thread ID:
01566820
Message ID:
01566823
Vues:
74
We do such stuff rule based rather often to identify doubles or identical housing,
but your example alone cannot work without further rules/lookup tables equalizing
Barrack with Barry ;-)

Furthermore there will be egde cases like

John Henry
Jon Henry
Henry John
John Henri
Henrick John

which usually can also be classified only if you have further columns specifiying adress, ZIP, phone etc.
One pillar of that is utilizing Levensthein - in the vfp Wiki you find the source of my speedup of the vfp routines -
which we use in a C translated version for speed reasons (~8 times faster).

The other is a rule engine, where we have weights added for certain patterns, allowing
for thing like switching first and last name around, wrong street #, misprinted adress by xx chars,
similar adress like Hillborough Road vs Hillboro Street and much more.

Such weigthed and patterned result set is manually checked again as last step.

We looked at some of the 100 ~ 1800€ packages offered, but went with our own routines,
as we have a client with very specific needs and writing and more important -
filling the rule engine took effort > 9999€, but was split over many runs of client work.

Sending/selling this stuff is clearly not possible.
If you have more information in other columns,
this would still be some work for us besides writing some setup/import code,
like setting up filters for your adress format - without any changes in the engines.
Checking 3*10**6 also takes time as there are a lot of byte comparisons on each field,
and the result sets can be significant as well.

But first you need a clearer definition of your own rules to start any serious consideration ;-)

HTH

thomas

>I have a Table (Has 3 million entries) and need to do some analysis / cleansing on it
>In its simplest form , is is a Table with one column, called Name, and holds following data
>John Jones
>Johnny Jones
>Mary Black
>Mary Ann Black
>Padd Reilly
>Paddy O'reilly
>Barrack O Bama
>Barrack O'Bama
>Barry O Bama
>George Bush
>Georg Bush
>
>etc
>
>I need to come up with a list of unique entries, using some 'Consolidation factor'
>The 'Consolidation Factor' can be arbitrary as long as its transaparent and everybody knows what it is
>So above would translate to 5 entreis
>John JOnes
>Mary Black
>Paddy reilly
>Barrack O bama
>George Bush
>
>
>Anybody aware of any tools around that does this sort of thing. Can be Foxpro or C#.
>
>Tia
>Gerard
Précédent
Suivant
Répondre
Fil
Voir

Click here to load this message in the networking platform