Interface for de-duplicating customer records

Level Extreme platform

Subscription

Corporate profile

Products & Services

Support

Legal

Français

Interface for de-duplicating customer records

Message

From

21/02/2011 14:43:37

David Schlesinger
No Company
Elmwood Park, New Jersey, United States

19/02/2011 08:06:30

Frank Cazabon
Samaan Systems Ltd
St Augustine, Trinidad

General information

Forum:

Visual FoxPro

Category:

Other

Title:

Re: Interface for de-duplicating customer records

Environment versions

Visual FoxPro:

VFP 9 SP1

Miscellaneous

Thread ID:

01159500

Message ID:

01501150

Views:

Frank, thanks. I looked at Netrics.com - looks good. I'm waiting for info from them. dtsearch seems more like search of text.
In the meantime I posted this message: 1501103
Posted my progress so far on parsing a string for number of words and spaces and prioritizing search based on how many words match, type of match etc. The function I wrote is helping me do the matching. It looks like I will use levenshtein as the LAST search if other multi-word searches have failed. If you get a chance check out that message and let me know if you have any more feedback.
Thanks!

>David, you can look at Netrics.com and dtsearch.com. Both offer commercial solutions. The code I gave you I used many years ago and served the purpose I needed then, but I never really examined it for improvements. This type of searching is very useful when the search term or the values in the data can be misspelt. So if someone had used "BRGER" instead of "BURGER" in the data you would get possible matches whereas if you searched on "%BURGER%" you wouldn't.
>
>>I think that, at least for this particular data, / can be preprocessed and replaced with a space. Then I could parse the words separately. Then I could have a threshold for how many words need to match in a multi word string. For speed purposes I think it would make sense to do the like searches, or even exact searches on individual words, first?
>>
>>Surely though, these algorithms have been enhanced and put together in a far more sophisticated way than I could do just starting from scratch, and would be availble in commercial packages, or libraries? No?
>>
>>>Frank,
>>>
>>>Regarding the code you posted, I used the second function you listed below and tried it on real data. I was looking for the string:
>>>
>>>SL/BURGER KING
>>>
>>>With a threshold of 65, the function return the following possible matches.
>>>
>>>AMERISOURCEBERGEN DR GOOD NEIGHBOR
>>>AMERISOURCEBERGEN DR MADISON PHARMACY
>>>AMERISOURCEBERGEN DR ROSELLE PARK RITA PHARM
>>>SL/SOMERSET DINING G
>>>WILF CAMPUS FOR SENIOR LIVING
>>>
>>>Looking at this as a "human" I queried the table for "select luacctnm from acctnrlu where luacctnm like "%BURGER%".
>>>The results are below:
>>>
>>>BURGER KING
>>>BURGER KING CORP
>>>BURGER KING/3153/REM
>>>BURGER KING/DMA/NYC
>>>CHEESEBURGER IN PARADISE
>>>
>>>"select luacctnm from acctnrlu where luacctnm like "%KING%". also yields good results.
>>>
>>>Clearly my "human review of the data" led me to a much better result, a result that matches with what I was looking for, is needed.
>>>How can I improve on my matching?. Would I be better off with something like PHDBase? (never used it)
>>>I know that I can parse words based on spaces, and separately search the words - if 1 or 1+ words match - include it.
>>>But how can I parse "BURGER" out of SL/BURGER? Trim leading or trailing spaces in a loop, and check for a word in a dictionary after each iteration?
>>>
>>>
>>>
>>>
>>>
>>>>
>>>>Hi David,
>>>>
>>>>these are two pieces of code I came up with off of the internet.:
>>>>
>>>>Fuzzy Match:
>>>>

*!*	According to this algorithm, the position of each character of the input value is to be compared with the 
>>>>*!*	position of the same characters found in the corresponding field from the file. It calculates differences 
>>>>*!*	between positions of the matching characters.
>>>>
>>>>*!*	I assign a matching rate based on the difference in the number of positions. If a certain character is 
>>>>*!*	found in the same position of both the search string and the database field, I assign a matching rate of 
>>>>*!*	100. If the two are one position away from each other, the rate is 90. Here is the complete scale:
>>>>
>>>>*!*	Absolute difference between character positions 	Matching rate
>>>>*!*	0 													100
>>>>*!*	1 													90
>>>>*!*	2 													80
>>>>*!*	3 													70
>>>>*!*	4 													60
>>>>*!*	5 													50
>>>>*!*	6 													40
>>>>*!*	7 													30
>>>>*!*	8 													20
>>>>*!*	9 													10
>>>>*!*	10 or more 											0
>>>>LPARAMETERS tcText1, tcText2
>>>>
>>>>m.tcText1 = ALLTRIM(m.tcText1)
>>>>m.tcText2 = ALLTRIM(m.tcText2)
>>>>
>>>>IF EMPTY(m.tcText1) OR EMPTY(m.tcText2)
>>>>	RETURN 0
>>>>ENDIF 
>>>>
>>>>*!*	force both terms to uppercase and put the shorter term in lcText1
>>>>m.lcText1 = UPPER(IIF(LEN(tcText1) < LEN(tcText2), m.tcText1, m.tcText2))
>>>>m.lcText2 = UPPER(IIF(LEN(tcText1) < LEN(tcText2), m.tcText2, m.tcText1))
>>>>
>>>>LOCAL ARRAY laMatchRate[10]
>>>>
>>>>laMatchRate[1] = 100
>>>>laMatchRate[2] = 90
>>>>laMatchRate[3] = 80
>>>>laMatchRate[4] = 70
>>>>laMatchRate[5] = 60
>>>>laMatchRate[6] = 50
>>>>laMatchRate[7] = 40
>>>>laMatchRate[8] = 30
>>>>laMatchRate[9] = 20
>>>>laMatchRate[10] = 10
>>>>
>>>>*!*	rather than go thru character by character if the terms match exactly
>>>>IF m.lcText1 == m.lcText2
>>>>	RETURN 100
>>>>ENDIF
>>>>
>>>>m.lnrate = -1
>>>>FOR i = 1 TO LEN(m.lcText1)
>>>>	IF SUBSTR(m.lcText1, i, 1) = SUBSTR(m.lcText2, i, 1)
>>>>		m.lnRate = m.lnRate + laMatchRate[1]
>>>>	ELSE
>>>>		lnPositionDiff = ABS(AT(SUBSTR(m.lcText1, i, 1), m.lcText2) - i) + 1
>>>>		IF BETWEEN(m.lnPositionDiff, 0, 9)
>>>>			m.lnRate = m.lnRate + laMatchrate[m.lnPositionDiff]
>>>>		ENDIF 
>>>>	ENDIF 
>>>>ENDFOR
>>>>
>>>>*!*	do checks from start of string and from end of string
>>>>FOR i = LEN(m.lcText1) TO 1 STEP -1
>>>>	IF SUBSTR(m.lcText1, i, 1) = SUBSTR(m.lcText2, i, 1)
>>>>		m.lnRate = m.lnRate + laMatchRate[1]
>>>>	ELSE
>>>>		lnPositionDiff = ABS(AT(SUBSTR(m.lcText1, i, 1), m.lcText2) - i) + 1
>>>>		IF BETWEEN(m.lnPositionDiff, 0, 9)
>>>>			m.lnRate = m.lnRate + laMatchrate[m.lnPositionDiff]
>>>>		ENDIF 
>>>>	ENDIF 
>>>>ENDFOR
>>>>
>>>>m.lnLengthDiff = LEN(m.lcText2) - LEN(m.lcText1)
>>>>
>>>>m.lnRank = ROUND((m.lnRate/(LEN(m.lcText1) * laMatchrate[1] * 2)) * 100, 2) - m.lnLengthDiff
>>>>*!*	STRTOFILE(CHR(13) + TRANSFORM(m.lnrank), "Duplicate.log",1)
>>>>RETURN m.lnRank

>>>>
>>>>Levenshtein Distance:
>>>>

*!*	function CalcLevenDistance
>>>>LPARAMETERS m.str1, m.str2
>>>>LOCAL m.i, m.j, m.diagonal, m.let, m.cost, lnLengthStr1, lnLengthStr2
>>>>
>>>>m.lnLengthStr1 = LEN(m.str1)
>>>>m.lnLengthStr2 = LEN(m.str2)
>>>>
>>>>DIMENSION m.arr(m.lnLengthStr1 + 1)
>>>>DIMENSION m.letters(m.lnLengthStr1)
>>>>
>>>>IF m.lnLengthStr1 * m.lnLengthStr2 == 0
>>>>	RETURN m.lnLengthStr1 + m.lnLengthStr2
>>>>ENDIF
>>>>
>>>>FOR m.i = 1 TO ALEN(m.arr)
>>>>	m.arr(m.i) = m.i
>>>>ENDFOR
>>>>
>>>>FOR m.i = 1 TO m.lnLengthStr1
>>>>	m.letters(m.i) = SUBSTR(m.str1, m.i, 1)
>>>>ENDFOR
>>>>
>>>>FOR m.i = 1 TO m.lnLengthStr2
>>>>	m.diagonal = m.arr(1) - 1
>>>>	m.arr(1) = m.i + 1
>>>>	m.let = SUBSTR(m.str2, m.i, 1)
>>>>	FOR m.j = 1 TO m.lnLengthStr1
>>>>		m.cost = m.diagonal
>>>>		IF m.letters(m.j) != m.let
>>>>			m.cost = m.cost + 1
>>>>		ENDIF
>>>>		IF m.cost > m.arr(m.j)
>>>>			m.cost = m.arr(m.j)
>>>>		ENDIF
>>>>		IF m.cost > m.arr(m.j + 1)
>>>>			m.cost = m.arr(m.j + 1)
>>>>		ENDIF
>>>>		m.diagonal = m.arr(m.j + 1) - 1
>>>>		m.arr(m.j + 1) = m.cost + 1
>>>>	ENDFOR
>>>>ENDFOR
>>>>
>>>>RETURN 100 - (m.arr(ALEN(m.arr)) - 1)
>>>>

>>>>
>>>>I can't remember where I got them at the moment. If I recall they were a bit slow when running on a lot of data.

Map

View

Click here to load this message in the networking platform