How do I find duplicate records? - Level Extreme

Level Extreme platform

Subscription

Corporate profile

Products & Services

Support

Legal

Français

How do I find duplicate records?

Message

From

19/10/2000 13:05:48

Naomi Nosonovsky
Wisconsin, United States

To

19/10/2000 12:27:50

Vlad Grynchyshyn
Soft Serve
Lviv, Ukraine

General information

Forum:

Visual FoxPro

Category:

Databases,Tables, Views, Indexing and SQL syntax

Title:

Re: How do I find duplicate records?

Miscellaneous

Thread ID:

00431045

Message ID:

00431588

Views:

9

Vlad,

Bellow is my current program. See, if you can improve the logic (for Address+days it works very slow, it works quite fast for two other cases):

********************************************************************
*  Description.......: DEDUFLAG.PRG - De-duper UDF for flagging potential duplicates
*  Calling Samples...:
*  Parameter List....:
*  Created by........: M. Asherman 3/4/99
*  Modified by.......: Nadya Nosonovsky 10/05/2000 04:19:41 PM
********************************************************************
* 3/5/99: fill out logic.
* 3/6/99: add WAIT NOWAIT message for clearer in-progress indicator.
*	Also add a final WAIT NOWAIT message giving summary count of potential dupes flagged.
* 3/12/99: change return value to more useful numeric, instead of success flag.
*	Adjust logic to respect any current FILTER setting, so that only records in view can be flagged.
*	Suppress consideration of records that already have a non-empty DedupeFlag.
*	Introduce #include file deduincl.h, for cleaner localization of arbitrary constants.
*
* This function operates against the currently selected table, which is assumed
* to be the Build Master table to be deduped.  Any current FILTER is respected.

function deduflag
* takes 2 optional arguments and returns a numeric count of records flagged (< 0 for error cases).

lparameters deduby_arg, ndays_arg		&& 2 optional arguments
* deduby_arg:				name of the method of matching to be used:
*							"Address", "Address+Days", or "MapRef".
*							If omitted or empty, use "Address" as the default.
* ndays_arg:				(applies only for "Address+Days" matching)
*							Specifies the maximum difference in transaction
*							dates for records to be considered possible dupes.
*							Defaults to DDNO_DAYS (currently = 7) if omitted.

local deduby, dbfname, prevtalk, prevdel, prevselec, prevorder, prevpos, ;
	ndays, result

* include macro definitions of deduper-related constants
#include deduincl.h

* argument processing
deduby = iif(empty(m.deduby_arg), DDMM_SITUS, m.deduby_arg)	&& default Match Method

* argument processing
ndays = iif(empty(m.ndays_arg), DDNO_DAYS, m.ndays_arg)	&& default Match Method

* preliminary error checking
dbfname = dbf()				&& path/name or the bldmstr file to be deduped

if empty(m.dbfname)			&& no current Build Master table selected
	wait window "DEDUFLAG: " + DDEM_NOBLD + "."
	return DDEC_NOBLD		&& failure
endif

if used(DDAL_QUERY)			&& reserved alias is already in use
	wait window "DEDUFLAG: " + DDEM_INUSE + "."
	return DDEC_INUSE		&& failure
endif

* preliminary inits
prevtalk = set('talk')		&& save initial verbosity setting
prevdel = set('deleted')	&& save initial deletion hiding state
set deleted on				&& hide deleted records from SQL
prevselec = select(0)		&& save initial Build Master selection state
result = 0					&& init. counter of potential dupes flagged

* give an in-progress indication
wait window nowait "Flagging potential dupes.  Please wait..."
set talk on					&& enable query/replace progress indications in status bar

* invoke SQL query to identify candidate duplicates
do case								&& dispatch according to the method of matching

case m.deduby == DDMM_SITUS		&& match on Situs address fields

* Firstly select duplicate values
	select upper(ccode+town+street+str(stnum,4)+stnumext+unit) as situs, ;
		count(*)  as cntSitus;
		from (m.dbfname) ;
		group by 1 ;
		having cntSitus>1 ;
		order by 1 into cursor (DDAL_QUERY)

* Then update the BldMstr table

	update (dbfname) set dedupeflag= DDFC_POSS ;
		where upper(ccode+town+street+str(stnum,4)+stnumext+unit) ;
		in (select situs from (DDAL_QUERY))

	m.result=_tally && Number of updated records

case m.deduby == DDMM_SITDA		&& match on Situs fields + date within range

** Create temproral cursor myCursor
	select upper(ccode+town+street+str(stnum,4)+stnumext+unit) as situs, recnum, date ;
		from (m.dbfname) ;
		where dedupeflag=DDFC_NONE ;
		into cursor myCursor order by recnum, situs, date

** Prepare final cursor by self-join
	select myCursor.* ;
		from myCursor inner join myCursor my on myCursor.recnum<>my.recnum;
		and myCursor.situs=my.situs  ;
		and abs(myCursor.date - my.date) <= m.ndays ;
		group by myCursor.recnum ;
		order by 1 into cursor (DDAL_QUERY)

* check for null query result case - really just a minor optimization, but what the hell
	if m.result >= 0					&& no errors yet
		result = reccount(DDAL_QUERY)	&& count of records that satisfied the dupes query
	endif

* update the DedupeFlag values for all candidate dupes, respecting any current FILTER setting
	if m.result > 0									&& there are possible dupes to be flagged
		result = 0									&& re-init counter, to get the real filtered count

* establish RELATION from query into updatable cursor on Build Master table
		prevorder = order(m.prevselec)				&& save initial Build Master ordering state
		prevpos = possave(m.prevselec)				&& save initial Build Master record position
		set order to recnum in (m.prevselec)		&& switch to ordering for RELATION
		set relation to recnum into (m.prevselec)	&& join from query into updatable cursor

* look over records in query result (normally this shouldn't be very large)
		scan for not eof(m.prevselec)				&& loop over the query result set (joined to updatable cursor)
			replace dedupeflag with DDFC_POSS in (m.prevselec)	&& update possible dupe flag
			result = m.result + 1					&& count of possible dupes actually flagged in this call
		endscan

* post-loop cleanups
		set order to (m.prevorder) in (m.prevselec)	&& restore initial index ordering
		posrest(m.prevpos, m.prevselec)				&& restore initial position
	endif

	if used ('MyCursor')
		use in myCursor
	endif

case m.deduby == DDMM_MAPRF		&& match on MapRef field
* Firstly select duplicate values
	select upper(ccode+town+Mapref) as ccttmap, ;
		count(*)  as cntMapref;
		from (m.dbfname) ;
		group by 1 ;
		having cntMapref>1 ;
		order by 1 into cursor (DDAL_QUERY)

* Then update the BldMstr table
	update (dbfname) set dedupeflag= DDFC_POSS ;
		where upper(ccode+town+Mapref);
		in (select ccttmap from (DDAL_QUERY))

	m.result=_tally && Number of updated records

otherwise							&& unknown method of matching records
	wait window "DEDUFLAG: " + DDEM_UNKME + ": " + m.deduby
	result = DDEC_UNKME				&& set up for failure result
endcase

* clean up and return
set talk &prevtalk					&& restore initial verbosity

if used(DDAL_QUERY)
	if m.result >= 0
		wait window nowait "Flagged " + ltrim(str(m.result)) + " potential duplicates."
	else
		wait clear					&& clear any residual WAIT message
	endif
	use in (DDAL_QUERY)				&& close (and release) the temporary SQL query cursor
else
	wait clear						&& clear any residual WAIT message
endif
select (m.prevselec)				&& restore initial selection
set deleted &prevdel				&& restore initial deletion hiding state
return m.result						&& return count of dupes flagged, or < 0 for abnormal error cases

>Hi!
>
>>Nope, this would not work. Do you want me to explain why or can figure it by yourself? :)
>
> Well, expression should be (MT.ccode+MT.town+MT.street+str(MT.strnum,4)+...), right? I did not tried such king of queries, so will be glad to know exactly what is wrong ;)
>
>>
>>>Still the same idea ;))
>>>
>>>Update MyTable
>>> set flag='D'
>>> where MyTable.IDField IN
>>>(Select MT.IDField
>>> from MyTable MT
>>> where MT.MyDateField betwen m.ldDate - 10 AND m.ldDate + 10
>>> group by (ccode+town+street+str(strnum,4)+...), MT.IDField
>>> having count(MT.IDField)>1)
>>>
>>>I hope this will work, I'm not sure if we can use exression in GROUP BY list.
>>>
>>>
>>>>>Hi!
>>>>>
>>>>>The same idea. 'Where' will filter out records for count() function.
>>>>>
>>>>>Select IDField, count(IDField) as nCount
>>>>> from MyTable
>>>>> where MyDateField betwen m.ldDate - 10 AND m.ldDate + 10
>>>>> group by IDField
>>>>> having nCount>1
>>>>
>>>>Hi Vlad,
>>>>
>>>>Actually, I need to mark all duplicate records in myTable (set flag='D'), which have the same address (ccode+town+street+str(strnum,4)+...) and date within the range. How can I do this?
>>>>
>>>>My current solution is quite combersome, so I'm open for the suggestions.
>>>>
>>>>Thanks in advance.
>>>>>
>>>>>>>>I'm switching an index to a primary and am getting a duplicate index message. How can I find the duplicates amoung 1000's without
>>>>>>>>looking through them all?
>>>>>>>
>>>>>>>Select IDField, count(IDField) as nCount
>>>>>>> from MyTable
>>>>>>> group by IDField
>>>>>>> having nCount>1
>>>>>>
>>>>>>This question was asked so often, what it definitely should be in a FAQ.
>>>>>>
>>>>>>BTW, a little more complicated variation of this problem:
>>>>>>Find all duplicates, which have the same Id field and date within date specified +-10.

If it's not broken, fix it until it is.

My Blog

Click here to load this message in the networking platform