>>That's a good idea. It makes sense to start from the table with the minimum number of required deletions. As I mentioned before, exact result, i.e. absolutely correct proportion is not reachable, so one should try to get the best randomization.
>
>I have the feeling there is a mathematically best solution. Because the way records will be deleted affects real business money, and because a solution like yours we programmed left us with differences from the optimal solution in the range of 3-7%, I'm afraid our customers will not be satisfied.
>
>My experience says I have to look in the direction of a recursive algorithm. That's fine, but my experience won't come with that algorithm. It reminds me slightly of the problem of 8 queens on a chess board, relatively easy to solve with recursion.
A recursive algorithm would go from one to the next, or from all to one, either adding one or subtracting one level at each step. I think Ed's solution (with some randomization) would do the trick.
So, in practical terms, you need a few cursors:
C1 counters per table, one record per table, to hold the number of candidates, number of records to delete and number of records to keep.
C2 with the keys of the duplicates, the table they come from, status (done deleting or not)
and one cursor per original table, with duplicates only.
Going through C2, you first look for something to keep from table1. If you've found it, delete it, mark as done in C2 for the whole set, decrease the numbers in C1 (for the tables from which it will be deleted) and increase the counter for kept records (for the table from which it will be accepted). Then go through C2 again, looking for a deletable record from table2 etc. Any table whose counter reaches a max (of records to keep or records to delete) is automatically triggering a "do the rest of the records for this table no matter what". See whether this gets you into a blind alley or not ;).