>A while back I asked for help in finding duplicate records. I've worked this job for a week and I have better information to give. This is an auditing application. First, there are several million records in the table. The auditors want to look for records where several of the fields either match or maybe values were transposed (like vendor number and transaction number or invoice number). They want to search the table an only return a set of records where there are not just unique records (one of everything), but where two or more records meet the criteria. Suggestions?
Hi Tim,
If I understood well, there is no key field you can check for uniqueness and values could be transposed in some 3 fields. Well then you could use this prg (a modified version of my similar trouble for 3 integer fields) :
clear all
public array aRearrange[3]
start = seconds()
select mRearrange(x,y,z) as uVal ;
from dups ;
order by 1 ;
where .t. ;
into cursor pass1
? "Pass1 ended in :",(seconds() - start)/60, " minutes."
select uVal from pass1 ;
having cnt(*) > 1 ;
group by uVal ;
where .t. ;
into cursor pass2
use dbf("pass2") in 0 again alias duplicates
use in "pass2"
use in "pass1"
? "Pass2 ended in :",(seconds() - start)/60, " minutes."
function mRearrange
lparameters xx1, yy1, zz1
arearrange[1] = xx1
arearrange[2] = yy1
arearrange[3] = zz1
=asort(aRearrange)
return bintoc(aRearrange[1])+bintoc(aRearrange[2])+bintoc(aRearrange[3])
-If fields are not integers (char) then you could convert them to integer by val(sys(2007, myfield)).
-If you could find a mathematical expression instead of mRearrange() you could use it (ie: If you could say product of x,y,z and sum of x,y,z represent a unique value then you could do "select x*y*z as prod, x+y+z as sumof .." and then you could group by prod, sumof). I tried but couldn't find one.
-Beware when indexing and opening the "dups" (original table) you should "set procedure to dupfind" or keep mRearrange in stored procs.
-Well this is a time consuming method. Test results with my machine (K6-200MMX with 64 Mb RAM) :
Record count is 1 million, simple 3 integer fields, 12Mb
Indexing with mRearrange is about 2 Mins
First SQL completes in approx. 2-3 Mins
All end (from start) in about 3-3.5 Mins
So with a P133, 16 Mb RAM and 10 Million records I expect it to finish in about 2-4 Hours depending on the physical size of table. But I still write it here because I think I saw somewhere in your thread as 36 Hours.
Hope it helps
Cetin