Plateforme Level Extreme
Abonnement
Profil corporatif
Produits & Services
Support
Légal
English
Comparing 2 tables; getting list of missing records
Message
De
07/08/2005 09:18:38
 
 
À
07/08/2005 07:57:36
Information générale
Forum:
Visual FoxPro
Catégorie:
Base de données, Tables, Vues, Index et syntaxe SQL
Versions des environnements
Visual FoxPro:
VFP 9
Divers
Thread ID:
01037464
Message ID:
01039147
Vues:
24
Hi Olaf!

Great stuff!!! My comments inline.

> Another issue I'm now trying to resolve is manually broken words, eg "re-ply". And I catched some abbreviations like "zB" (which corresponds to eg), that would be written "z.B." and I stripped off the points. That's also an issue when detecting sentence ends. An abbreviation could be wrongly interpreted as such.

Dealing with manually broken (hyphenated) words is a problem because, at least in English, people often glue individual words together into bigger compound words using the same dash char ("-") as they do to split words. However, I've found the content in most English based electronic documents (except PDF) rarely includes hyphenated (split) words and that hyphenated words appear to be more a legacy of hard copy output than a standard practice of contemporary publishers. So, given that assumption, I think it might be reasonable to treat dashes as word separators vs. hyphenated word segments that need to be re-combined into a single word. Do you think this logic could apply to German text as well?

As for words with embedded periods: My suggestion is to treat these as abbreviations and not spellcheck them at all?

A toughter question raised by your example is the issue of mixed case (vs. proper or upper) case words like "zB". We have few words with mixed case in the English language (at least that I can think of). Most of our mixed case words would be for product names like "TeX" vs. words native to our language.

UPDATE: I just checked how Word 2003 spellchecks text and for US English came up with the following observations:

1. Word treats words with embedded periods as spellcheckable content (it does not ignore them as I suggested)

2. Word correctly recognizes mixed case words - differentiating TeX (no error) from MooN (error). So word is storing full case information (lower, proper, upper???, mixed) in its word lists.

3. Word attempts to spellcheck split words (words split with a hyphen/dash) as individual words even if the word that follows the dash is separated by whitespace (indicating that the word was probably a split word vs. individual words).

> Now I could pass all those words to MS Word and spellcheck them there, but that would perhaps be a legal issue, although I don't take the words from MS Word then, I only check them.

IANAL (I am not a lawyer) but I can't see how spelling a word list violates a spellchecker's license. Test case: What if you wrote a utility that attempted to write articles by randomly selecting text from your word list and then spellchecked those articles. If your articles included every word from your word list, and you used your spellchecked article as the source for your word list, would that also be a violation of MS's spellchecker? BTW: My guess is that your word list will have many words that MS Word doesn't have in its spellchecking dictionary so your spellchecked word list and MS Word's spellcheck word list will still be substantially different.

> Concerning the legal issue of this the question also is, if I hurt copyrights of the Spiegel. I'm not copying their articles, only the words.

Again, I can't see how deriving word lists from Speigel content would be a violation of their copyright. Test case: What if you read each article and manually created a list of unique words from each article. Would such a word list be a violation of copyright? I don't think so. If it was, how could anyone build a word list without violating someone's copyright?

> And even the word "Spiegel" is a quite general word like windows is (Spiegel actually means mirror - by the way Spiegel compares more to the Time magazine thatn the Daily Mirror), nevertheless there are some trademark rights on (or to?) them. If it's not those general purpose names, then those trademark product names like Coca Cola or Mercedes Benz. They need to be on a black list and it's quite hard to find and filter them out.

I've never heard of any words being owned by another party and not available for use in general writing or electronic storage. The worst case scenario I could imagine would be the need to indicate trademark or copyright ownership of a word via a (tm) or (c) or country specific notation. However, even that would be difficult (and perhaps unnecessary), because using your examples of Coca Cola and Mercedes Benz, our word list would be storing these trademark phrases as 4 separate words (benz, coca, cola, mercedes) vs. 2 trademarked names.

Bottom line: I'm looking forward to hearing the thoughts of those wiser than me, but from a "gut check" perspective, I don't see any trademark or copyright violations to your approach.

Thanks for contributing all these ideas! Awesome!

Danke Shoen,

Malcolm
Malcolm Greene
Brooks-Durham
mgreene@bdurham.com
Précédent
Suivant
Répondre
Fil
Voir

Click here to load this message in the networking platform