Most strange corruption ever

Plateforme Level Extreme

Abonnement

Profil corporatif

Produits & Services

Support

Légal

English

Most strange corruption ever

Message

28/08/2003 20:10:38

Stacy Violett
Black Mountain
Polson, Montana, États-Unis

22/08/2002 07:59:53

Peter Stordiau
Heart Informatisering B.V.
Barneveld, Pays-Bas

Information générale

Forum:

Visual FoxPro

Catégorie:

Base de données, Tables, Vues, Index et syntaxe SQL

Titre:

Re: Most strange corruption ever

Divers

Thread ID:

00692378

Message ID:

00824423

Vues:

This does sound familiar to problems I have seen. I was at the DEVCON this year in Palm springs asking how to determine in software whether disk caching was enabled in the operating system... Long story short, I ended up at the Microsoft booth describing a problem very similar to what you are witnessing. They pointed me to a KB article 817805 that I found very interesting.

If you know of this problem already, I guess there will be no 2500.00 prize :(

>Hi there !
>
>Triggered by Pat Chrisco's thread "Missing data records" I'd like to put forward the most integring corruption I personally saw ever. Over one year ago I posted it elsewhere too, with no (useful) response whatsoever. It's behind our tails from off april 2000 and day before yesterday the very last trial in solving it ... dit not.
>
>I could plague you with the several hundreds (!) of pages on the analysis of this problem, but per today it is far more easy to describe the symptoms. If anyone recognizes this, please, please let me know.
>
>The below is the derival of several hundreds of corruption cases.
>
>1.
>The problem shows as corruption at the end of (random) tables.
>Corruption in this case means : from some point the before last block written contains nulls. The nulls start to be somewhere within a record and from the hundreds of cases no rule can be found to the position it occurs.
>I've been doing "millions" of calculations with the offset of the file, the blocks etc. etc. etc. No rules.
>This is still some normal corruption. Now add this to it :
>
>2.
>It happens in 100 % of cases at the first attempt of writing a new record to the file a next day opposed to the last time the file was written to. This could be several days earlier. But anyway, it will never happen on the same day the file was written to.
>This needs some additional explanation :
>Once a record was proven to turn into a properly written (last) block at any day, all subsequent writes will never go wrong for that day.
>Also, the workstation causing the whole thing to occur, won't notice it itself, unless it reads back its own written records.
>Never mind how strange it sounds, but at day 1 a PC writes a new record to a (last) block, all is okay, the file is closed, and at day 2 at first writing of a new record (overflowing the last block) the before last block gets corrupted. This has 100 % sure to do with the overflowing of the block to a new one.
>Is this strange ? no, still not that much. But the next sure is :
>
>3.
>In 100 % of cases, after the corruption is there, a browse shows this :
>After the file is opened without (or with) Index attached, and we set Refresh to 1,1, we can see the lost data appearing ad disappearing at the bottom of the file. Appearing occurs in various (in)complete forms, IOW, portions of the original data are fed to the random PC opening the file (could also be several PC's at the same time).
>[if you recognize this, just stop reading and please let me know.
>If you don't recognize this, you could help me with your skills to approach the problem theoretically, and (please) read further]
>As it always turns out, it is only the very last record in the corrupted block that won't show its data ever anymore, from off "some" position again (read : this portion of the data always shows nulls never showing anything else).
>Knowing that the browse is able to show the complete correct data once in a while in the portion before the last record in the block, we developed the skill to recover the data (never mind how). This always works but for the last record that's really corrupted.
>This too needs some additional info :
>With "the last record" I mean all of the records which have been written to in the corrupted area, and which in general is the very last one, because that is the one where the corruption occurs, and this is at the overflow of the block to the new one. But, if since that time another record in the visual (!) corrupted area is replaced again, this record gets definitly corrupted too. And, all the records added further are okay, because it's always the previous block that's corrupted. Summarized : The real corrupted area is always within the very last record of the corrupted block. And okay, because this record overflows to the next block, the beginning of that next block is useless too (but formally not corrupted).
>When the corruption is not discovered, the block is just somewhere in the middle of the file, because all blocks added afterwards.
>Generally can be said that the loading of the original data into the null-area (but for the last record really being corrupted) can be enforced by means of rlock(). IOW, the normal means of refreshing the cache of the PC allow for getting the original data from somewhere. But wait too long, and it's gone again.
>
>4.
>The situation of the ever changeing content of the block can be kept "alive" for ever and ever. As long as a browse is open to the file it will do that. Once we tried it for over a month.
>Note that we (can) do these analysis by means of renaming the production table into another name, and go ahead with it (and ensuring no one else is accessing the file). When we do not open the file for browsing it etc., chances are high that opening it after several hours again, the thing is still alive. But, not doing that, and try it the day after, and it's dead and the corrupted situation is fixed.
>It's fixed to the random current situation of the contents as looking at the browse and pressing "Save" at a random time.
>
>5.
>Where the browse shows the alive file changeing it's contents, no other tool exists to show this behaviour. I.e., any editor will always show a fixed corrupted situation exactly the same opposed to other tools, and exactly the same during time.
>We've been around with sniffers and all, and the only thing we could learn from that was that the various Win-OSses deal very different with the browse command. By itself something to think of !!
>The helpful thing about the other editors is that it is able to show the real situation for normal analysis, since browse is useless for that (ever showing something else). But note the real situation : which is nulls spreading over several records within the corrupted block and which is NOT the last record only. So where the he.. could Browse at random PC's obtain the original data from ?????
>BTW, just like any editor will show the null situation as it really is, any type of copy command will result in a file with the same as can be seen with the editor. One difference in the result file : it's always fixed looked at by means of browse.
>
>6.
>It 100 % sure occurs only within Novell 4 and 5. So not 3 (but higher we don't know of).
>There is not any part of software (but Novell) or hardware that wasn't replaced. Think in the area of network IC, harddisk, motherboard, whole server, client software, backup software ... just everything. But :
>We have just one customer having it all the time, two customers who had it once, and one customer who had it three times. And ... we have it ourselves all the time. All the time means : several times per week, sometimes staying away for a few weeks.
>Similarities for the sites who have it can really not be found, where routers etc. etc. were taken into consideration as well; There is one similarity only, Novell 4/5. Also note that the one customer having it all the time as mentioned, switched about a year ago to NT (server) as the solution, and he never had it since. It was this customer switching all the hard- and software and nothing helped. All 'n all it's definitly Novell causing it.
>
>7.
>It occurs in both FPD2.5 and VFP5.
>Please note that the software running the business code uses the same sources, though the VFP5 version loads a bunch of classes on top of all. Anyway, the DML commands used are exactly the same in both versions. BTW, no SQL involved, but for an SQL - INSERT providing for most of the Appends.
>
>8.
>Some additional note might be that last year three times all our servers were jacked, with my conclusion that it hardly can be the configuration triggering it. IOW, after such a steel you can imagine to end up with all kinds of temporary configurations being very different from the originals (with - without mailserver, with - without WTS server, with - without webserver etc. etc.). During those 3 periods we had numerous configurations, and all the time the thing persisted.
>But we have so many other customers with the very same sofwtare and similar configurations never having the problem ...
>
>To come to an end of this for now, the app subject to the corruption is always the same. Though I suspect some conflicting situation with Novell somewhere, really nothing can be found. IOW, numerous suspicious things were changed, but nothing helped. It can't be the Novell client software either, because one of the changes we did was switching to a MS client (and trying all the various versions of clients).
>
>Where our customers may be looked at as high transaction systems, for us ourselves this won't apply. Furthermore, the tables (i.e. modules) we use ourselves, are very different from the ones the customers use.
>The problem really occurs in random tables, and not any similarity can be found in the tables where it occurred. Not their record lenghts, not their fieldnames or types, nothing.
>
>I am about 100 % sure the problem can occur in Indexes too, knowing of sitations records could not be found and one minute later they could.
>
>Also note that we have this transaction module, exactly registring when which record by whom was written, including the contents. I'v been bended over this data for ages, and can't draw any conclusions from it. Not from the originating PC or user (which always is known, but is random), and not from the PC/user where the corruption occurs the next day (random as well).
>
>My own conclusions :
>
>Somewhere within Novell's cache blocks something is around (note that Novell does not recognize this problem and that they cannot reproduce it at will, just as we cannot). Whether "flying around" network packages are involved ? not sure, but it sure looks like it (but all settings concerned were changed too). One thing is 100 % : overnight something is happening, which leaves me thinking that it's the compression facility of Novell. But, we switched that off (needs a reinstall of all ...) to no avail.
>This left the backup software (being a similarity : Arcserve in all cases), and it is just this what we changed the other day. To no avail again.
>But something is happening overnight.
>
>One year ago, as I put this forward elsewhere, I anounced the "reward" of $1,000 for the first one helping me out on this. I won't withdraw that and add another $1.500 now because it is really worth it.
>The person who receives three stars here can expect the $2.500 coming. Even if I need to bring it personally ...

Thanks,

Stacy

Black Mountain Software, Inc.

Répondre

Fil

Voir

Click here to load this message in the networking platform