Data Cleansing Exercise - Level Extreme

Plateforme Level Extreme

Abonnement

Profil corporatif

Produits & Services

Support

Légal

English

Data Cleansing Exercise

Message

26/02/2013 08:27:05

Thomas Ganss (En ligne)
Main Trend
Frankfurt, Allemagne

25/02/2013 10:44:47

Gerard O'Carroll
Kernel Software Ltd.
Dublin, Irlande

Information générale

Forum:

Visual FoxPro

Catégorie:

Codage, syntaxe et commandes

Titre:

Re: Data Cleansing Exercise

Divers

Thread ID:

01566820

Message ID:

01566919

Vues:

Hi Gerard,

I could load a small sample of your data in our program, say 10K with some duplicate data added on purpose, say 2/3 starting with same 2 or 3 letters of last name only (to highten chances of hits) other 1/3 random letters to get cross letter results, but am pretty sure that it will not work really great as our approach is geared to maximize finding similar / identical people via similarity and supporting field info. Having no supporting fields some rules won't fire (as we have checks supressing inflating scores on empty fields). So the destribution of scores will differ markedly from our usual results as well ;-)

Furthermore stuff like some people writing their last names first needs to taken into account as well - John Henry = Henry, John but what about Henry John? We do have some routines checking for reordered words and we also have a routine exchanging letters often misunderstood into the a few "normalized" base letters, which is much better than soundex but is geared for german language.

But let me be frank - my guess is you still don't have a clear idea/description of your target. Needle the ones signing the cheque for better description before starting work - otherwise even after iterations of attempts ***both*** sides will feel unhappy, as neither a wonderful result is found and the money needed to find the current result is to high because of too many tries done with unsufficient parameters.

Mr, Mister and all titles and graduations from my POV need to be eliminated in advance via word lists -
but again I am unsure of what you want to accomplish. One guess might be an atomized/factorized word list, which can be used to describe similarity with each single record of your data - but I might be totally off base as well ;-) But such a beast will help a lot when one name is missing, as thse tend to throw off measures like Levensthein and Jaro-Winkler.

my 0.02€

thomas

>Hi Thomas . Thnaks for your reply
>
>All we have is names .. no additional details, se need a way of 'Approximating'
>e.g. John Henry == Jon Henry = John Henri
>I mean all these 'Sound' the same phonetically so If I use Levensthein , does that automatically do stuff like this
>other examples come to mind
> Mr.= Mister
>Dr. = Doctor
>etc
>
>Initially I am looking for some tool whcih will say cut the 2.5 million down somewhat
>

Répondre

Fil

Voir

Click here to load this message in the networking platform