Matching customers whose information is not quiet equal

Level Extreme platform

Subscription

Corporate profile

Products & Services

Support

Legal

Français

Matching customers whose information is not quiet equal

Message

From

03/04/2001 16:18:19

Dragan Nedeljkovich (Online)
Now officially retired
Zrenjanin, Serbia

03/04/2001 14:08:24

Dave Nantais
Light speed database solutions
Ontario, Canada

General information

Forum:

Visual FoxPro

Category:

Databases,Tables, Views, Indexing and SQL syntax

Title:

Re: Matching customers whose information is not quiet equal

Miscellaneous

Thread ID:

00491037

Message ID:

00491538

Views:

>>Consider the expense for a brute machine first, and then try some ideas on it. One thing that crosses my mind is to chop the complete address into words, or triplets of words, sort them alphabetically (starting with whatever's supposed to be the last name) and then try to match them; the more words for two records match, the more probably the addresses are the same...
>
> I was hoping SQL Server 2000 would have some kind of 'Fuzzy grouping' mechanism that would help me group 'similar' customers.
>
> However, your ideas have given me a workable solution. My biggest concern is grouping together all customer who have words like "PO BOX" in their address fields. Obviously, one cannot group on "PO BOX" it is "meaningless". I am going to build a table of "meaningless" words. Words like "PO BOX" , "Rural Route", "Rural Delivery", etc.. etc.
> My concern now is .. what are the 'word equivalents' in India, Singapore, Malaysia for things like "Rural Route", "PO Box", "Post Office Box", etc... etc. I have come up with a way to "discover" what those "meaningless" words are.
>
> Then I will remove those "meaningless" words from any customer record. And group on the remaining words.
> Basically I have come up with a "uniqueness" index for each word in each address field in each record of the customer database. The more unique the word the stronger the "grouping" should a matching word be found in another customer record.
>
> Next, the "customer service manager" will "verify" that these groups are in fact all the same customer.
>
> Thanks for the brainstorming help.

Reminds me of the 'search by nickname' I once did for a phone book app - which turned out to be actually 'search by keyword'. The initial selection for keywords was 'all the words in name, address, company name and comments' - then I had to filter out the meaningless stuff, which was easy in this case. I just disregarded the numbers and short words, and let the user add and delete from there.
One way to discover the meaningless ones is to find out which ones appear too often in same pairs - you may come up with 2000 Main Streets, but with 40000 PO BOXes, or to discover that there's many of LPO (last post office) or whatever may appear. I.e. the more frequent the word (or pair), the more probable a candidate for manual marking as meaningless or meaningful.

Good luck (back home it is the traditional miners' greeting :)

back to same old
the first online autobiography, unfinished by design
What, me reckless? I'm full of recks!
Balkans, eh? Count them.

Map

View

Click here to load this message in the networking platform