Data Cleansing Exercise - Level Extreme

Plateforme Level Extreme

Abonnement

Profil corporatif

Produits & Services

Support

Légal

English

Data Cleansing Exercise

Message

25/02/2013 14:30:49

Mike Yearwood
Toronto, Ontario, Canada

25/02/2013 10:44:47

Gerard O'Carroll
Kernel Software Ltd.
Dublin, Irlande

Information générale

Forum:

Visual FoxPro

Catégorie:

Codage, syntaxe et commandes

Titre:

Re: Data Cleansing Exercise

Divers

Thread ID:

01566820

Message ID:

01566885

Vues:

>Hi Thomas . Thnaks for your reply
>
>All we have is names .. no additional details, se need a way of 'Approximating'
>e.g. John Henry == Jon Henry = John Henri
>I mean all these 'Sound' the same phonetically so If I use Levensthein , does that automatically do stuff like this
>other examples come to mind
> Mr.= Mister
>Dr. = Doctor
>etc

I used the Double Metaphone technique. What I did was added a field to my table. After the user entered a new record, I double-metaphone encoded the entry and stored the encoded value in the new field. Then when searching for a name I encoded the search string and used rushmore to optimize against the new field. Worked very fast.

http://en.wikipedia.org/wiki/Metaphone#Double_Metaphone.

http://fox.wikis.com/wc.dll?Wiki~DoubleMetaphone-SoundexAlternative~WIN_COM_API

>
>Initially I am looking for some tool whcih will say cut the 2.5 million down somewhat
>
>Regards,
>Gerard
>
>
>
>
>>We do such stuff rule based rather often to identify doubles or identical housing,
>>but your example alone cannot work without further rules/lookup tables equalizing
>>Barrack with Barry ;-)
>>
>>Furthermore there will be egde cases like
>>
>>John Henry
>>Jon Henry
>>Henry John
>>John Henri
>>Henrick John
>>
>>which usually can also be classified only if you have further columns specifiying adress, ZIP, phone etc.
>>One pillar of that is utilizing Levensthein - in the vfp Wiki you find the source of my speedup of the vfp routines -
>>which we use in a C translated version for speed reasons (~8 times faster).
>>
>>The other is a rule engine, where we have weights added for certain patterns, allowing
>>for thing like switching first and last name around, wrong street #, misprinted adress by xx chars,
>>similar adress like Hillborough Road vs Hillboro Street and much more.
>>
>>Such weigthed and patterned result set is manually checked again as last step.
>>
>>We looked at some of the 100 ~ 1800€ packages offered, but went with our own routines,
>>as we have a client with very specific needs and writing and more important -
>>filling the rule engine took effort > 9999€, but was split over many runs of client work.
>>
>>Sending/selling this stuff is clearly not possible.
>>If you have more information in other columns,
>>this would still be some work for us besides writing some setup/import code,
>>like setting up filters for your adress format - without any changes in the engines.
>>Checking 3*10**6 also takes time as there are a lot of byte comparisons on each field,
>>and the result sets can be significant as well.
>>
>>But first you need a clearer definition of your own rules to start any serious consideration ;-)
>>
>>HTH
>>
>>thomas
>>
>>>I have a Table (Has 3 million entries) and need to do some analysis / cleansing on it
>>>In its simplest form , is is a Table with one column, called Name, and holds following data
>>>John Jones
>>>Johnny Jones
>>>Mary Black
>>>Mary Ann Black
>>>Padd Reilly
>>>Paddy O'reilly
>>>Barrack O Bama
>>>Barrack O'Bama
>>>Barry O Bama
>>>George Bush
>>>Georg Bush
>>>
>>>etc
>>>
>>>I need to come up with a list of unique entries, using some 'Consolidation factor'
>>>The 'Consolidation Factor' can be arbitrary as long as its transaparent and everybody knows what it is
>>>So above would translate to 5 entreis
>>>John JOnes
>>>Mary Black
>>>Paddy reilly
>>>Barrack O bama
>>>George Bush
>>>
>>>
>>>Anybody aware of any tools around that does this sort of thing. Can be Foxpro or C#.
>>>
>>>Tia
>>>Gerard

Répondre

Fil

Voir

Click here to load this message in the networking platform