General information
Category:
Coding, syntax & commands
Hi Thomas . Thnaks for your reply
All we have is names .. no additional details, se need a way of 'Approximating'
e.g. John Henry == Jon Henry = John Henri
I mean all these 'Sound' the same phonetically so If I use Levensthein , does that automatically do stuff like this
other examples come to mind
Mr.= Mister
Dr. = Doctor
etc
Initially I am looking for some tool whcih will say cut the 2.5 million down somewhat
Regards,
Gerard
>We do such stuff rule based rather often to identify doubles or identical housing,
>but your example alone cannot work without further rules/lookup tables equalizing
>Barrack with Barry ;-)
>
>Furthermore there will be egde cases like
>
>John Henry
>Jon Henry
>Henry John
>John Henri
>Henrick John
>
>which usually can also be classified only if you have further columns specifiying adress, ZIP, phone etc.
>One pillar of that is utilizing Levensthein - in the vfp Wiki you find the source of my speedup of the vfp routines -
>which we use in a C translated version for speed reasons (~8 times faster).
>
>The other is a rule engine, where we have weights added for certain patterns, allowing
>for thing like switching first and last name around, wrong street #, misprinted adress by xx chars,
>similar adress like Hillborough Road vs Hillboro Street and much more.
>
>Such weigthed and patterned result set is manually checked again as last step.
>
>We looked at some of the 100 ~ 1800€ packages offered, but went with our own routines,
>as we have a client with very specific needs and writing and more important -
>filling the rule engine took effort > 9999€, but was split over many runs of client work.
>
>Sending/selling this stuff is clearly not possible.
>If you have more information in other columns,
>this would still be some work for us besides writing some setup/import code,
>like setting up filters for your adress format - without any changes in the engines.
>Checking 3*10**6 also takes time as there are a lot of byte comparisons on each field,
>and the result sets can be significant as well.
>
>But first you need a clearer definition of your own rules to start any serious consideration ;-)
>
>HTH
>
>thomas
>
>>I have a Table (Has 3 million entries) and need to do some analysis / cleansing on it
>>In its simplest form , is is a Table with one column, called Name, and holds following data
>>John Jones
>>Johnny Jones
>>Mary Black
>>Mary Ann Black
>>Padd Reilly
>>Paddy O'reilly
>>Barrack O Bama
>>Barrack O'Bama
>>Barry O Bama
>>George Bush
>>Georg Bush
>>
>>etc
>>
>>I need to come up with a list of unique entries, using some 'Consolidation factor'
>>The 'Consolidation Factor' can be arbitrary as long as its transaparent and everybody knows what it is
>>So above would translate to 5 entreis
>>John JOnes
>>Mary Black
>>Paddy reilly
>>Barrack O bama
>>George Bush
>>
>>
>>Anybody aware of any tools around that does this sort of thing. Can be Foxpro or C#.
>>
>>Tia
>>Gerard
Previous
Next
Reply
View the map of this thread
View the map of this thread starting from this message only
View all messages of this thread
View all messages of this thread starting from this message only