How do I find duplicate records? - Level Extreme

Level Extreme platform

Subscription

Corporate profile

Products & Services

Support

Legal

Français

How do I find duplicate records?

Message

From

19/10/2000 14:56:00

Vlad Grynchyshyn
Soft Serve
Lviv, Ukraine

To

19/10/2000 14:44:11

Naomi Nosonovsky
Wisconsin, United States

General information

Forum:

Visual FoxPro

Category:

Databases,Tables, Views, Indexing and SQL syntax

Title:

Re: How do I find duplicate records?

Miscellaneous

Thread ID:

00431045

Message ID:

00431704

Views:

11

>I always used between function and up till now was sure, that it's optimizable :)
>Ok, let's both check in a Help.

Help does not tells anything about optimization. But I know you should have valid expression that match to index expression to make query optimizable. So, in case of between() function optimizable expression is entire expression with between() function and all its content. In case of between operator optimizable expression is just left part of condition (to left from between operator). I was sure about this so far. Maybe things changed in VFP 6?

>BTW, studing the code I found, that idea is much more complicated, than I thought. I don't have mdate, the records should be marked as duplicate, if they have dates within some range, IOW:
>
>

>#     Address   Date
>1     Ad1       Date()
>2     Ad1       Date()-5
>3     Ad1       Date()-20
>4     Ad2       Date()-25
>5     Ad2       Date()-30
>6     Ad2       Date()-1000
>

>
>Records 1 and 2 are duplicates.
>Records 4 and 5 are duplicates.
>
>See the new logic?

Yep. But when you group by address and date in the first query, you eliminate all duplicates in the same date. Is it correct? I meant when you have 2 records with the same date and address, and these are only records with that address, these two records will not be marked as duplicated.

>
>Anyway, I just wrote this code and it seems to run quite fast.
>
>

>* Firstly create address+date field
>     select  ;
>         upper(ccode+town+street+str(stnum,4)+stnumext+unit) as situs, date ;
>          from (m.dbfname) ;
>          group by 1,2 ;
>          order by 1,2 into cursor myCursor
>
>** Prepare final cursor by self-join
>     select myCursor.Situs+dtoc(myCursor.date) as SitusDate, myCursor.* ;
>          from myCursor inner join myCursor my on myCursor.situs=my.situs ;
>          and abs(myCursor.date - my.date)           order by 1 into cursor (DDAL_QUERY)
>
>     if used ('MyCursor')
>          use in myCursor
>     endif
>
>* Then update the BldMstr table
>     update (dbfname) set dedupeflag= DDFC_POSS ;
>          where upper(ccode+town+street+str(stnum,4)+stnumext+unit)+dtoc(date) ;
>          in (select SitusDate from (DDAL_QUERY))
>
>     m.result=_tally && Number of updated records

>
>
>>Hi!
>>
>>This will work more quickly that with subquery, when you don't have indexes for optimisation.
>>
>>Little note:
>>Use
>>
>>date between mdate-10 and mdate+10
>>
>>instead of
>>
>>between(date, mdate-10, mdate+10)
>>
>>
>>in where clause. This because between() function will not be optimized, when between operator in where clause optimized.
>>
>>
>>>This discussion gives me this idea:
>>>
>>>1) select upper(...+...) as Situs from myTable group by 1 where between(date, mdate-10, mdate+10) having count(*)>1 into cursor curDups
>>>
>>> update myTable set flag='D' where upper(...) in (select Situs from curDups) and between(date, mdate-10, mdate+10)
>>>
>>> Seems to work (at least written here :)), though I have to check it.
>>>
>>> Thanks a lot for your support.
>>>
>>>>Hi Vlad,
>>>>
>>>>Once again: this update would not do what I want. Can you see? :)
>>>>1) You select correct duplicate addresses in your subquery.
>>>>Not let's imagine this situation: Address1 is the duplicate by my definition (it's duplicate and date within 7)
>>>>

>>>>Record #  Address  Date
>>>>1         Address1 date=mdate-1000
>>>>2         Address1 date=mdate-6
>>>>3         Address1 date=mdate-5
>>>>

>>>>Records 2 and 3 should be marked, while record 1 should not, right?
>>>>You update would mark all of them as duplicates.
>>>>
>>>>BTW, I do have index on address. See my other post with the whole program.
>>>>
>>>>>>You group by ID field! :) You will have then all records from the table :)
>>>>>
>>>>>Sure ;)
>>>>>Than you will not be able to use this query even when you will remove ID field from grouping because ID will be returned incorrectly from subquery. Need another approach:
>>>>>
>>>>>Update MyTable
>>>>> set MyTable.flag='D'
>>>>> where (MyTable.ccode+MyTable.town+MyTable.street+str(MyTable.strnum,4)+...) IN
>>>>>(Select (MT.ccode+MT.town+MT.street+str(MT.strnum,4)+...) as cAddress
>>>>> from MyTable MT
>>>>> where MT.MyDateField betwen m.ldDate - 10 AND m.ldDate + 10
>>>>> group by cAddress
>>>>> having count(*)>1)
>>>>>
>>>>>Will work quite slowly unless you have index by (ccode+town+street+str(strnum,4)+...)
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>>Hi!
>>>>>>>
>>>>>>>>Nope, this would not work. Do you want me to explain why or can figure it by yourself? :)
>>>>>>>
>>>>>>> Well, expression should be (MT.ccode+MT.town+MT.street+str(MT.strnum,4)+...), right? I did not tried such king of queries, so will be glad to know exactly what is wrong ;)
>>>>>>>
>>>>>>>>
>>>>>>>>>Still the same idea ;))
>>>>>>>>>
>>>>>>>>>Update MyTable
>>>>>>>>> set flag='D'
>>>>>>>>> where MyTable.IDField IN
>>>>>>>>>(Select MT.IDField
>>>>>>>>> from MyTable MT
>>>>>>>>> where MT.MyDateField betwen m.ldDate - 10 AND m.ldDate + 10
>>>>>>>>> group by (ccode+town+street+str(strnum,4)+...), MT.IDField
>>>>>>>>> having count(MT.IDField)>1)
>>>>>>>>>
>>>>>>>>>I hope this will work, I'm not sure if we can use exression in GROUP BY list.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>Hi!
>>>>>>>>>>>
>>>>>>>>>>>The same idea. 'Where' will filter out records for count() function.
>>>>>>>>>>>
>>>>>>>>>>>Select IDField, count(IDField) as nCount
>>>>>>>>>>> from MyTable
>>>>>>>>>>> where MyDateField betwen m.ldDate - 10 AND m.ldDate + 10
>>>>>>>>>>> group by IDField
>>>>>>>>>>> having nCount>1
>>>>>>>>>>
>>>>>>>>>>Hi Vlad,
>>>>>>>>>>
>>>>>>>>>>Actually, I need to mark all duplicate records in myTable (set flag='D'), which have the same address (ccode+town+street+str(strnum,4)+...) and date within the range. How can I do this?
>>>>>>>>>>
>>>>>>>>>>My current solution is quite combersome, so I'm open for the suggestions.
>>>>>>>>>>
>>>>>>>>>>Thanks in advance.
>>>>>>>>>>>
>>>>>>>>>>>>>>I'm switching an index to a primary and am getting a duplicate index message. How can I find the duplicates amoung 1000's without
>>>>>>>>>>>>>>looking through them all?
>>>>>>>>>>>>>
>>>>>>>>>>>>>Select IDField, count(IDField) as nCount
>>>>>>>>>>>>> from MyTable
>>>>>>>>>>>>> group by IDField
>>>>>>>>>>>>> having nCount>1
>>>>>>>>>>>>
>>>>>>>>>>>>This question was asked so often, what it definitely should be in a FAQ.
>>>>>>>>>>>>
>>>>>>>>>>>>BTW, a little more complicated variation of this problem:
>>>>>>>>>>>>Find all duplicates, which have the same Id field and date within date specified +-10.

Vlad Grynchyshyn, Project Manager, MCP
vgryn@yahoo.com
ICQ #10709245
The professional level of programmer could be determined by level of stupidity of his/her bugs

It is not appropriate to say that question is "foolish". There could be only foolish answers. Everybody passed period of time when knows nothing about something.

Click here to load this message in the networking platform