Plateforme Level Extreme
Abonnement
Profil corporatif
Produits & Services
Support
Légal
English
Data Cleansing Exercise
Message
Information générale
Forum:
Visual FoxPro
Catégorie:
Codage, syntaxe et commandes
Divers
Thread ID:
01566820
Message ID:
01566866
Vues:
50
>All we have is names .. no additional details, se need a way of 'Approximating'
>e.g. John Henry == Jon Henry = John Henri
>I mean all these 'Sound' the same phonetically so If I use Levensthein , does that automatically do stuff like this
>other examples come to mind
> Mr.= Mister
>Dr. = Doctor

Gerard,

What you ask for is a tall order. There are a large number of variables. Especially when you factor in dirty data (where a user accidentally typed in j0hn instead of john by hitting the wrong key, etc.).

You're just going to have to do it piecemeal. With millions of records, I would start building candidate files, going in and manually looking at your query results, and seeing if there aren't many hard-and-fast combinations you can explicitly convert (such as "john" and "jon"). But to do it properly requires a very comprehensive algorithm with full phonetic rules, adapted to your audience (world-wide people, or just those in Ireland), etc.

A tall order.

One think I would make sure you do is keep a copy of the pared down file. Then, go back through the list of combined records manually by sampling and search for any obvious mistakes and undo them.
Précédent
Répondre
Fil
Voir

Click here to load this message in the networking platform