Level Extreme platform
Subscription
Corporate profile
Products & Services
Support
Legal
Français
Repeated records
Message
From
03/08/2007 17:53:14
 
General information
Forum:
Visual FoxPro
Category:
Databases,Tables, Views, Indexing and SQL syntax
Environment versions
Visual FoxPro:
VFP 9 SP1
OS:
Windows XP SP2
Network:
Windows XP
Database:
MS SQL Server
Miscellaneous
Thread ID:
01245292
Message ID:
01245910
Views:
16
Naomi,

>Actually, after re-reading the code there is nothing wrong with the numbers, in fact they tell us how many unique records we had. But may be I should start with the clean state anyway, e.g. zap after the last method again.

if you want to restructure your benchmark:
a) make 2 or 3 measurements, one for *significant* each test step
(excep for current 3) to see how each test differs under other data distributions

b) make your version 2 the first one to get tested, time to variables to get exact
measurements, but before writing out the time for the test save the information from
the dupe table as well: then you know afterwards the distribution of duplicates
as well as the number of records.

Something along the lines of
Select DupCount, Count(*) an DupDist from Dupes group by 1

c) make each benchmark a function, the whole benchmarking template 1 function,
a log function, parametrize the tablebuilding as well to get something easily
callable many times creating base tables - set it up once and let it run overnight.

I'ld expect version 1 to be the best always - as excl is needed in current 4 as well,
a pack might be added as optional measurement, because the "no deleted recs" happens only here.

Version 4 should be very fast as long as there are many duplicate records. if there are no dupes,
the relative position in timing to all other approaches is the worst of all measurements, but
it might still be faster than 2 or 3. But it needs excl as well.

Version 2 and 3 work on the table in place. Version 2 will have the best relative position
if no duplicates are found: the dupe table is empty, no scan needed, only the time to build
the dist dupe table. In your old benchmark EVERY record was duped, so HAVING>1 could not save
anything. No wonder it shows bad perf on such data. Perhaps Version 2 is faster than 3
(even 4 on large data sets ?) if very few dupes found. I expect version 3 to show
better perf across many distributions, which is the reason to make it the reccommoendation
for those scenarios forbidding excl access for de-duping and where nothing is known about the distribution.

regards

thomas
Previous
Reply
Map
View

Click here to load this message in the networking platform