Information générale
Catégorie:
Codage, syntaxe et commandes
>All we have is names .. no additional details, se need a way of 'Approximating'
>e.g. John Henry == Jon Henry = John Henri
>I mean all these 'Sound' the same phonetically so If I use Levensthein , does that automatically do stuff like this
>other examples come to mind
> Mr.= Mister
>Dr. = Doctor
Gerard,
What you ask for is a tall order. There are a large number of variables. Especially when you factor in dirty data (where a user accidentally typed in j0hn instead of john by hitting the wrong key, etc.).
You're just going to have to do it piecemeal. With millions of records, I would start building candidate files, going in and manually looking at your query results, and seeing if there aren't many hard-and-fast combinations you can explicitly convert (such as "john" and "jon"). But to do it properly requires a very comprehensive algorithm with full phonetic rules, adapted to your audience (world-wide people, or just those in Ireland), etc.
A tall order.
One think I would make sure you do is keep a copy of the pared down file. Then, go back through the list of combined records manually by sampling and search for any obvious mistakes and undo them.
Précédent
Répondre
Voir le fil de ce thread
Voir le fil de ce thread à partir de ce message seulement
Voir tous les messages de ce thread
Voir tous les messages de ce thread à partir de ce message seulement