Interface for de-duplicating customer records

Plateforme Level Extreme

Abonnement

Profil corporatif

Produits & Services

Support

Légal

English

Interface for de-duplicating customer records

Message

18/02/2011 16:59:28

David Schlesinger
No Company
Elmwood Park, New Jersey, États-Unis

18/02/2011 16:48:16

David Schlesinger
No Company
Elmwood Park, New Jersey, États-Unis

Information générale

Forum:

Visual FoxPro

Catégorie:

Autre

Titre:

Re: Interface for de-duplicating customer records

Versions des environnements

Visual FoxPro:

VFP 9 SP1

Divers

Thread ID:

01159500

Message ID:

01500787

Vues:

I think that, at least for this particular data, / can be preprocessed and replaced with a space. Then I could parse the words separately. Then I could have a threshold for how many words need to match in a multi word string. For speed purposes I think it would make sense to do the like searches, or even exact searches on individual words, first?

Surely though, these algorithms have been enhanced and put together in a far more sophisticated way than I could do just starting from scratch, and would be availble in commercial packages, or libraries? No?

>Frank,
>
>Regarding the code you posted, I used the second function you listed below and tried it on real data. I was looking for the string:
>
>SL/BURGER KING
>
>With a threshold of 65, the function return the following possible matches.
>
>AMERISOURCEBERGEN DR GOOD NEIGHBOR
>AMERISOURCEBERGEN DR MADISON PHARMACY
>AMERISOURCEBERGEN DR ROSELLE PARK RITA PHARM
>SL/SOMERSET DINING G
>WILF CAMPUS FOR SENIOR LIVING
>
>Looking at this as a "human" I queried the table for "select luacctnm from acctnrlu where luacctnm like "%BURGER%".
>The results are below:
>
>BURGER KING
>BURGER KING CORP
>BURGER KING/3153/REM
>BURGER KING/DMA/NYC
>CHEESEBURGER IN PARADISE
>
>"select luacctnm from acctnrlu where luacctnm like "%KING%". also yields good results.
>
>Clearly my "human review of the data" led me to a much better result, a result that matches with what I was looking for, is needed.
>How can I improve on my matching?. Would I be better off with something like PHDBase? (never used it)
>I know that I can parse words based on spaces, and separately search the words - if 1 or 1+ words match - include it.
>But how can I parse "BURGER" out of SL/BURGER? Trim leading or trailing spaces in a loop, and check for a word in a dictionary after each iteration?
>
>
>
>
>
>>
>>Hi David,
>>
>>these are two pieces of code I came up with off of the internet.:
>>
>>Fuzzy Match:
>>

*!*	According to this algorithm, the position of each character of the input value is to be compared with the 
>>*!*	position of the same characters found in the corresponding field from the file. It calculates differences 
>>*!*	between positions of the matching characters.
>>
>>*!*	I assign a matching rate based on the difference in the number of positions. If a certain character is 
>>*!*	found in the same position of both the search string and the database field, I assign a matching rate of 
>>*!*	100. If the two are one position away from each other, the rate is 90. Here is the complete scale:
>>
>>*!*	Absolute difference between character positions 	Matching rate
>>*!*	0 													100
>>*!*	1 													90
>>*!*	2 													80
>>*!*	3 													70
>>*!*	4 													60
>>*!*	5 													50
>>*!*	6 													40
>>*!*	7 													30
>>*!*	8 													20
>>*!*	9 													10
>>*!*	10 or more 											0
>>LPARAMETERS tcText1, tcText2
>>
>>m.tcText1 = ALLTRIM(m.tcText1)
>>m.tcText2 = ALLTRIM(m.tcText2)
>>
>>IF EMPTY(m.tcText1) OR EMPTY(m.tcText2)
>>	RETURN 0
>>ENDIF 
>>
>>*!*	force both terms to uppercase and put the shorter term in lcText1
>>m.lcText1 = UPPER(IIF(LEN(tcText1) < LEN(tcText2), m.tcText1, m.tcText2))
>>m.lcText2 = UPPER(IIF(LEN(tcText1) < LEN(tcText2), m.tcText2, m.tcText1))
>>
>>LOCAL ARRAY laMatchRate[10]
>>
>>laMatchRate[1] = 100
>>laMatchRate[2] = 90
>>laMatchRate[3] = 80
>>laMatchRate[4] = 70
>>laMatchRate[5] = 60
>>laMatchRate[6] = 50
>>laMatchRate[7] = 40
>>laMatchRate[8] = 30
>>laMatchRate[9] = 20
>>laMatchRate[10] = 10
>>
>>*!*	rather than go thru character by character if the terms match exactly
>>IF m.lcText1 == m.lcText2
>>	RETURN 100
>>ENDIF
>>
>>m.lnrate = -1
>>FOR i = 1 TO LEN(m.lcText1)
>>	IF SUBSTR(m.lcText1, i, 1) = SUBSTR(m.lcText2, i, 1)
>>		m.lnRate = m.lnRate + laMatchRate[1]
>>	ELSE
>>		lnPositionDiff = ABS(AT(SUBSTR(m.lcText1, i, 1), m.lcText2) - i) + 1
>>		IF BETWEEN(m.lnPositionDiff, 0, 9)
>>			m.lnRate = m.lnRate + laMatchrate[m.lnPositionDiff]
>>		ENDIF 
>>	ENDIF 
>>ENDFOR
>>
>>*!*	do checks from start of string and from end of string
>>FOR i = LEN(m.lcText1) TO 1 STEP -1
>>	IF SUBSTR(m.lcText1, i, 1) = SUBSTR(m.lcText2, i, 1)
>>		m.lnRate = m.lnRate + laMatchRate[1]
>>	ELSE
>>		lnPositionDiff = ABS(AT(SUBSTR(m.lcText1, i, 1), m.lcText2) - i) + 1
>>		IF BETWEEN(m.lnPositionDiff, 0, 9)
>>			m.lnRate = m.lnRate + laMatchrate[m.lnPositionDiff]
>>		ENDIF 
>>	ENDIF 
>>ENDFOR
>>
>>m.lnLengthDiff = LEN(m.lcText2) - LEN(m.lcText1)
>>
>>m.lnRank = ROUND((m.lnRate/(LEN(m.lcText1) * laMatchrate[1] * 2)) * 100, 2) - m.lnLengthDiff
>>*!*	STRTOFILE(CHR(13) + TRANSFORM(m.lnrank), "Duplicate.log",1)
>>RETURN m.lnRank

>>
>>Levenshtein Distance:
>>

*!*	function CalcLevenDistance
>>LPARAMETERS m.str1, m.str2
>>LOCAL m.i, m.j, m.diagonal, m.let, m.cost, lnLengthStr1, lnLengthStr2
>>
>>m.lnLengthStr1 = LEN(m.str1)
>>m.lnLengthStr2 = LEN(m.str2)
>>
>>DIMENSION m.arr(m.lnLengthStr1 + 1)
>>DIMENSION m.letters(m.lnLengthStr1)
>>
>>IF m.lnLengthStr1 * m.lnLengthStr2 == 0
>>	RETURN m.lnLengthStr1 + m.lnLengthStr2
>>ENDIF
>>
>>FOR m.i = 1 TO ALEN(m.arr)
>>	m.arr(m.i) = m.i
>>ENDFOR
>>
>>FOR m.i = 1 TO m.lnLengthStr1
>>	m.letters(m.i) = SUBSTR(m.str1, m.i, 1)
>>ENDFOR
>>
>>FOR m.i = 1 TO m.lnLengthStr2
>>	m.diagonal = m.arr(1) - 1
>>	m.arr(1) = m.i + 1
>>	m.let = SUBSTR(m.str2, m.i, 1)
>>	FOR m.j = 1 TO m.lnLengthStr1
>>		m.cost = m.diagonal
>>		IF m.letters(m.j) != m.let
>>			m.cost = m.cost + 1
>>		ENDIF
>>		IF m.cost > m.arr(m.j)
>>			m.cost = m.arr(m.j)
>>		ENDIF
>>		IF m.cost > m.arr(m.j + 1)
>>			m.cost = m.arr(m.j + 1)
>>		ENDIF
>>		m.diagonal = m.arr(m.j + 1) - 1
>>		m.arr(m.j + 1) = m.cost + 1
>>	ENDFOR
>>ENDFOR
>>
>>RETURN 100 - (m.arr(ALEN(m.arr)) - 1)
>>

>>
>>I can't remember where I got them at the moment. If I recall they were a bit slow when running on a lot of data.

Répondre

Fil

Voir

Click here to load this message in the networking platform