Finding all Duplicate Records

Plateforme Level Extreme

Abonnement

Profil corporatif

Produits & Services

Support

Légal

English

Finding all Duplicate Records

Message

26/02/1998 12:36:25

Cetin Basoz
Engineerica Inc.
Izmir, Turquie

25/02/1998 20:52:25

Tim Lawrence
PCC Airfoils, LLC
Broxton, Georgie, États-Unis

Information générale

Forum:

Visual FoxPro

Catégorie:

Codage, syntaxe et commandes

Titre:

Re: Finding all Duplicate Records

Divers

Thread ID:

00078994

Message ID:

00081437

Vues:

>A while back I asked for help in finding duplicate records. I've worked this job for a week and I have better information to give. This is an auditing application. First, there are several million records in the table. The auditors want to look for records where several of the fields either match or maybe values were transposed (like vendor number and transaction number or invoice number). They want to search the table an only return a set of records where there are not just unique records (one of everything), but where two or more records meet the criteria. Suggestions?

Hi Tim,
If I understood well, there is no key field you can check for uniqueness and values could be transposed in some 3 fields. Well then you could use this prg (a modified version of my similar trouble for 3 integer fields) :

*Dupfind.prg
clear all
public array aRearrange[3]
start = seconds()
*use dups
*index on mRearrange(x,y,z) tag DupCheck

select mRearrange(x,y,z) as uVal   ;
	from dups ;
	order by 1 ;
	where .t. ;
	into cursor pass1
? "Pass1 ended in :",(seconds() - start)/60, " minutes."

select uVal from pass1 ;
	having cnt(*) > 1 ;
	group by uVal ;
	where .t. ;
	into cursor pass2
use dbf("pass2") in 0 again alias duplicates
use in "pass2"
use in "pass1"
? "Pass2 ended in :",(seconds() - start)/60, " minutes."

*!*	select dups
*!*	set order to tag dupcheck
*!*	select duplicates	
*!*	set relation to uVal into dups
*!*	set proc to dupfind
*!*	select dups
*!*	brow
*!*	select duplicates
*!*	brow


function mRearrange
lparameters xx1, yy1, zz1
arearrange[1] = xx1
arearrange[2] = yy1
arearrange[3] = zz1
=asort(aRearrange)
return bintoc(aRearrange[1])+bintoc(aRearrange[2])+bintoc(aRearrange[3])

-If fields are not integers (char) then you could convert them to integer by val(sys(2007, myfield)).
-If you could find a mathematical expression instead of mRearrange() you could use it (ie: If you could say product of x,y,z and sum of x,y,z represent a unique value then you could do "select x*y*z as prod, x+y+z as sumof .." and then you could group by prod, sumof). I tried but couldn't find one.
-Beware when indexing and opening the "dups" (original table) you should "set procedure to dupfind" or keep mRearrange in stored procs.
-Well this is a time consuming method. Test results with my machine (K6-200MMX with 64 Mb RAM) :
Record count is 1 million, simple 3 integer fields, 12Mb
Indexing with mRearrange is about 2 Mins
First SQL completes in approx. 2-3 Mins
All end (from start) in about 3-3.5 Mins
So with a P133, 16 Mb RAM and 10 Million records I expect it to finish in about 2-4 Hours depending on the physical size of table. But I still write it here because I think I saw somewhere in your thread as 36 Hours.
Hope it helps
Cetin

Çetin Basöz

The way to Go
Flutter - For mobile, web and desktop.
World's most advanced open source relational database.
.Net for foxheads - Blog (main)
FoxSharp - Blog (mirror)
Welcome to FoxyClasses

LinqPad - C#,VB,F#,SQL,eSQL ... scratchpad

Répondre

Fil

Voir

Click here to load this message in the networking platform