Data Cleansing Exercise - Level Extreme

Level Extreme platform

Subscription

Corporate profile

Products & Services

Support

Legal

Français

Data Cleansing Exercise

Message

From

25/02/2013 10:44:47

Gerard O'Carroll
Kernel Software Ltd.
Dublin, Ireland

24/02/2013 08:05:37

Thomas Ganss
Main Trend
Frankfurt, Germany

General information

Forum:

Visual FoxPro

Category:

Coding, syntax & commands

Title:

Re: Data Cleansing Exercise

Miscellaneous

Thread ID:

01566820

Message ID:

01566861

Views:

Hi Thomas . Thnaks for your reply

All we have is names .. no additional details, se need a way of 'Approximating'
e.g. John Henry == Jon Henry = John Henri
I mean all these 'Sound' the same phonetically so If I use Levensthein , does that automatically do stuff like this
other examples come to mind
Mr.= Mister
Dr. = Doctor
etc

Initially I am looking for some tool whcih will say cut the 2.5 million down somewhat

Regards,
Gerard

>We do such stuff rule based rather often to identify doubles or identical housing,
>but your example alone cannot work without further rules/lookup tables equalizing
>Barrack with Barry ;-)
>
>Furthermore there will be egde cases like
>
>John Henry
>Jon Henry
>Henry John
>John Henri
>Henrick John
>
>which usually can also be classified only if you have further columns specifiying adress, ZIP, phone etc.
>One pillar of that is utilizing Levensthein - in the vfp Wiki you find the source of my speedup of the vfp routines -
>which we use in a C translated version for speed reasons (~8 times faster).
>
>The other is a rule engine, where we have weights added for certain patterns, allowing
>for thing like switching first and last name around, wrong street #, misprinted adress by xx chars,
>similar adress like Hillborough Road vs Hillboro Street and much more.
>
>Such weigthed and patterned result set is manually checked again as last step.
>
>We looked at some of the 100 ~ 1800€ packages offered, but went with our own routines,
>as we have a client with very specific needs and writing and more important -
>filling the rule engine took effort > 9999€, but was split over many runs of client work.
>
>Sending/selling this stuff is clearly not possible.
>If you have more information in other columns,
>this would still be some work for us besides writing some setup/import code,
>like setting up filters for your adress format - without any changes in the engines.
>Checking 3*10**6 also takes time as there are a lot of byte comparisons on each field,
>and the result sets can be significant as well.
>
>But first you need a clearer definition of your own rules to start any serious consideration ;-)
>
>HTH
>
>thomas
>
>>I have a Table (Has 3 million entries) and need to do some analysis / cleansing on it
>>In its simplest form , is is a Table with one column, called Name, and holds following data
>>John Jones
>>Johnny Jones
>>Mary Black
>>Mary Ann Black
>>Padd Reilly
>>Paddy O'reilly
>>Barrack O Bama
>>Barrack O'Bama
>>Barry O Bama
>>George Bush
>>Georg Bush
>>
>>etc
>>
>>I need to come up with a list of unique entries, using some 'Consolidation factor'
>>The 'Consolidation Factor' can be arbitrary as long as its transaparent and everybody knows what it is
>>So above would translate to 5 entreis
>>John JOnes
>>Mary Black
>>Paddy reilly
>>Barrack O bama
>>George Bush
>>
>>
>>Anybody aware of any tools around that does this sort of thing. Can be Foxpro or C#.
>>
>>Tia
>>Gerard

Map

View

Click here to load this message in the networking platform