>I've been posed with a problem to solve and I would like to get some input on the solution - any ideas would be greatlu appreciated.
>
>Given the following sample table:
>
>ForiegnKey DataField
>========== ==========
>1 1
>1 2
>1 3
>2 2
>2 3
>2 4
>3 2
>3 4
>
>We assume that the Foreign Key refers to a table of customers, so the problem will refer to "customers,"
>and that the DataField is a product number. We wish to determine the number of "crossovers"
>between any particular value of the data field. In other words, how many customer bought both products 2
>and 4?. In this case, customers 2 and 3 each have records where the data field is 2 and 4, so the answer
>is two.
>
If the number of products is small enough so that it'd be legal to construct a table with each column representing a product and each row a customer, generating a cross-tab, if necessary using the GENXTAB.PRG that comes with older versions of VFP, or the pivot table wizard, could be used; you'd then pose your query against the resulting crosstab, selecting the customers with non-zero values for the products under consideration.
>The conventional approach is to do a SQL self-join, something like the following:
>
>SELE a.DataField DF1,b.DataField df2,COUNT(*) as "TheCount";
> FROM TheTable a JOIN TheTable b ON a.ForeignKey = b.ForeignKey;
> GROUP BY df1,df2
>
>Which yields:
>
>DF1 DF2 TheCount
>=== === ==========
>1 1 1
>1 2 1
>1 3 1
>2 1 1
>2 2 3
>2 3 2
>2 4 2
>3 1 1
>3 2 2
>3 3 2
>3 4 1
>4 2 2
>4 3 1
>4 4 2
>
>To further explain, we can now look at the fifth row and say, "3 customer bought product 2," or look at row
>6 and say "two customers who bought product 2 also bought product 3," or look at row three and say, "only
>one customer who bought product 1 also bought product 3."
>
>The problem is that, when the number of records in the table gets large, this can take days, even if it does
>not crash out Fox or NT.
>
>OBJECTIVE
>Find a way to get the same result as the SQL self-join with linear performance based on the number of records
>in the source table.
>
>ASSUMPTIONS
>1. The table is already sorted on Foreign Key + Data Field
>2. The values for Data Field are known in advance and are stored in another table, usually not more than
>100-200 values
>3. The table can have many millions of records, so creating interim tables by breaking up the SQL into
>two passes risks the 2G limit.
>4. For the doubters, yes the table is indexed and the self-join still takes > 24 hours when indexed.
>
>Thanks