Level Extreme platform
Subscription
Corporate profile
Products & Services
Support
Legal
Français
How do I find duplicate records?
Message
 
 
General information
Forum:
Visual FoxPro
Category:
Databases,Tables, Views, Indexing and SQL syntax
Miscellaneous
Thread ID:
00431045
Message ID:
00431694
Views:
16
Vlad,

I always used between function and up till now was sure, that it's optimizable :)
Ok, let's both check in a Help.

BTW, studing the code I found, that idea is much more complicated, than I thought. I don't have mdate, the records should be marked as duplicate, if they have dates within some range, IOW:
#     Address   Date
1     Ad1       Date()
2     Ad1       Date()-5
3     Ad1       Date()-20
4     Ad2       Date()-25
5     Ad2       Date()-30
6     Ad2       Date()-1000 
Records 1 and 2 are duplicates.
Records 4 and 5 are duplicates.

See the new logic?

Anyway, I just wrote this code and it seems to run quite fast.
* Firstly create address+date field
     select  ;
         upper(ccode+town+street+str(stnum,4)+stnumext+unit) as situs, date ;
          from (m.dbfname) ;          
          group by 1,2 ;
          order by 1,2 into cursor myCursor

** Prepare final cursor by self-join
     select myCursor.Situs+dtoc(myCursor.date) as SitusDate, myCursor.* ;
          from myCursor inner join myCursor my on myCursor.situs=my.situs ;
          and abs(myCursor.date - my.date) <= m.ndays ;
          order by 1 into cursor (DDAL_QUERY)

     if used ('MyCursor')
          use in myCursor
     endif

* Then update the BldMstr table
     update (dbfname) set dedupeflag= DDFC_POSS ;
          where upper(ccode+town+street+str(stnum,4)+stnumext+unit)+dtoc(date) ;
          in (select SitusDate from (DDAL_QUERY))

     m.result=_tally && Number of updated records
>Hi!
>
>This will work more quickly that with subquery, when you don't have indexes for optimisation.
>
>Little note:
>Use
>
>date between mdate-10 and mdate+10
>
>instead of
>
>between(date, mdate-10, mdate+10)
>
>
>in where clause. This because between() function will not be optimized, when between operator in where clause optimized.
>
>
>>This discussion gives me this idea:
>>
>>1) select upper(...+...) as Situs from myTable group by 1 where between(date, mdate-10, mdate+10) having count(*)>1 into cursor curDups
>>
>> update myTable set flag='D' where upper(...) in (select Situs from curDups) and between(date, mdate-10, mdate+10)
>>
>> Seems to work (at least written here :)), though I have to check it.
>>
>> Thanks a lot for your support.
>>
>>>Hi Vlad,
>>>
>>>Once again: this update would not do what I want. Can you see? :)
>>>1) You select correct duplicate addresses in your subquery.
>>>Not let's imagine this situation: Address1 is the duplicate by my definition (it's duplicate and date within 7)
>>>
>>>Record #  Address  Date
>>>1         Address1 date=mdate-1000
>>>2         Address1 date=mdate-6
>>>3         Address1 date=mdate-5
>>>
>>>Records 2 and 3 should be marked, while record 1 should not, right?
>>>You update would mark all of them as duplicates.
>>>
>>>BTW, I do have index on address. See my other post with the whole program.
>>>
>>>>>You group by ID field! :) You will have then all records from the table :)
>>>>
>>>>Sure ;)
>>>>Than you will not be able to use this query even when you will remove ID field from grouping because ID will be returned incorrectly from subquery. Need another approach:
>>>>
>>>>Update MyTable
>>>> set MyTable.flag='D'
>>>> where (MyTable.ccode+MyTable.town+MyTable.street+str(MyTable.strnum,4)+...) IN
>>>>(Select (MT.ccode+MT.town+MT.street+str(MT.strnum,4)+...) as cAddress
>>>> from MyTable MT
>>>> where MT.MyDateField betwen m.ldDate - 10 AND m.ldDate + 10
>>>> group by cAddress
>>>> having count(*)>1)
>>>>
>>>>Will work quite slowly unless you have index by (ccode+town+street+str(strnum,4)+...)
>>>>
>>>>
>>>>
>>>>>
>>>>>>Hi!
>>>>>>
>>>>>>>Nope, this would not work. Do you want me to explain why or can figure it by yourself? :)
>>>>>>
>>>>>> Well, expression should be (MT.ccode+MT.town+MT.street+str(MT.strnum,4)+...), right? I did not tried such king of queries, so will be glad to know exactly what is wrong ;)
>>>>>>
>>>>>>>
>>>>>>>>Still the same idea ;))
>>>>>>>>
>>>>>>>>Update MyTable
>>>>>>>> set flag='D'
>>>>>>>> where MyTable.IDField IN
>>>>>>>>(Select MT.IDField
>>>>>>>> from MyTable MT
>>>>>>>> where MT.MyDateField betwen m.ldDate - 10 AND m.ldDate + 10
>>>>>>>> group by (ccode+town+street+str(strnum,4)+...), MT.IDField
>>>>>>>> having count(MT.IDField)>1)
>>>>>>>>
>>>>>>>>I hope this will work, I'm not sure if we can use exression in GROUP BY list.
>>>>>>>>
>>>>>>>>
>>>>>>>>>>Hi!
>>>>>>>>>>
>>>>>>>>>>The same idea. 'Where' will filter out records for count() function.
>>>>>>>>>>
>>>>>>>>>>Select IDField, count(IDField) as nCount
>>>>>>>>>> from MyTable
>>>>>>>>>> where MyDateField betwen m.ldDate - 10 AND m.ldDate + 10
>>>>>>>>>> group by IDField
>>>>>>>>>> having nCount>1
>>>>>>>>>
>>>>>>>>>Hi Vlad,
>>>>>>>>>
>>>>>>>>>Actually, I need to mark all duplicate records in myTable (set flag='D'), which have the same address (ccode+town+street+str(strnum,4)+...) and date within the range. How can I do this?
>>>>>>>>>
>>>>>>>>>My current solution is quite combersome, so I'm open for the suggestions.
>>>>>>>>>
>>>>>>>>>Thanks in advance.
>>>>>>>>>>
>>>>>>>>>>>>>I'm switching an index to a primary and am getting a duplicate index message. How can I find the duplicates amoung 1000's without
>>>>>>>>>>>>>looking through them all?
>>>>>>>>>>>>
>>>>>>>>>>>>Select IDField, count(IDField) as nCount
>>>>>>>>>>>> from MyTable
>>>>>>>>>>>> group by IDField
>>>>>>>>>>>> having nCount>1
>>>>>>>>>>>
>>>>>>>>>>>This question was asked so often, what it definitely should be in a FAQ.
>>>>>>>>>>>
>>>>>>>>>>>BTW, a little more complicated variation of this problem:
>>>>>>>>>>>Find all duplicates, which have the same Id field and date within date specified +-10.
If it's not broken, fix it until it is.


My Blog
Previous
Next
Reply
Map
View

Click here to load this message in the networking platform