Plateforme Level Extreme
Abonnement
Profil corporatif
Produits & Services
Support
Légal
English
Finding all Duplicate Records
Message
De
26/02/1998 12:36:25
Cetin Basoz
Engineerica Inc.
Izmir, Turquie
 
 
À
25/02/1998 20:52:25
Information générale
Forum:
Visual FoxPro
Catégorie:
Codage, syntaxe et commandes
Divers
Thread ID:
00078994
Message ID:
00081437
Vues:
46
>A while back I asked for help in finding duplicate records. I've worked this job for a week and I have better information to give. This is an auditing application. First, there are several million records in the table. The auditors want to look for records where several of the fields either match or maybe values were transposed (like vendor number and transaction number or invoice number). They want to search the table an only return a set of records where there are not just unique records (one of everything), but where two or more records meet the criteria. Suggestions?

Hi Tim,
If I understood well, there is no key field you can check for uniqueness and values could be transposed in some 3 fields. Well then you could use this prg (a modified version of my similar trouble for 3 integer fields) :
*Dupfind.prg
clear all
public array aRearrange[3]
start = seconds()
*use dups
*index on mRearrange(x,y,z) tag DupCheck

select mRearrange(x,y,z) as uVal   ;
	from dups ;
	order by 1 ;
	where .t. ;
	into cursor pass1
? "Pass1 ended in :",(seconds() - start)/60, " minutes."

select uVal from pass1 ;
	having cnt(*) > 1 ;
	group by uVal ;
	where .t. ;
	into cursor pass2
use dbf("pass2") in 0 again alias duplicates
use in "pass2"
use in "pass1"
? "Pass2 ended in :",(seconds() - start)/60, " minutes."

*!*	select dups
*!*	set order to tag dupcheck
*!*	select duplicates	
*!*	set relation to uVal into dups
*!*	set proc to dupfind
*!*	select dups
*!*	brow
*!*	select duplicates
*!*	brow


function mRearrange
lparameters xx1, yy1, zz1
arearrange[1] = xx1
arearrange[2] = yy1
arearrange[3] = zz1
=asort(aRearrange)
return bintoc(aRearrange[1])+bintoc(aRearrange[2])+bintoc(aRearrange[3])
-If fields are not integers (char) then you could convert them to integer by val(sys(2007, myfield)).
-If you could find a mathematical expression instead of mRearrange() you could use it (ie: If you could say product of x,y,z and sum of x,y,z represent a unique value then you could do "select x*y*z as prod, x+y+z as sumof .." and then you could group by prod, sumof). I tried but couldn't find one.
-Beware when indexing and opening the "dups" (original table) you should "set procedure to dupfind" or keep mRearrange in stored procs.
-Well this is a time consuming method. Test results with my machine (K6-200MMX with 64 Mb RAM) :
Record count is 1 million, simple 3 integer fields, 12Mb
Indexing with mRearrange is about 2 Mins
First SQL completes in approx. 2-3 Mins
All end (from start) in about 3-3.5 Mins
So with a P133, 16 Mb RAM and 10 Million records I expect it to finish in about 2-4 Hours depending on the physical size of table. But I still write it here because I think I saw somewhere in your thread as 36 Hours.
Hope it helps
Cetin
Çetin Basöz

The way to Go
Flutter - For mobile, web and desktop.
World's most advanced open source relational database.
.Net for foxheads - Blog (main)
FoxSharp - Blog (mirror)
Welcome to FoxyClasses

LinqPad - C#,VB,F#,SQL,eSQL ... scratchpad
Précédent
Répondre
Fil
Voir

Click here to load this message in the networking platform