Get words and spaces from a string

Level Extreme platform

Subscription

Corporate profile

Products & Services

Support

Legal

Français

Get words and spaces from a string

Message

From

21/02/2011 20:44:11

Al Doman
M3 Enterprises Inc.
North Vancouver, British Columbia, Canada

21/02/2011 17:58:53

David Schlesinger
No Company
Elmwood Park, New Jersey, United States

General information

Forum:

Visual FoxPro

Category:

Coding, syntax & commands

Title:

Re: Get words and spaces from a string

Miscellaneous

Thread ID:

01501103

Message ID:

01501199

Views:

>Let me clarify. There are three categories of things I want to accomplish:
>
>1) I wrote the word parsing function because vfp only has getwordnum and getwordcount. I think that my code is not efficient, and probably not fast. Also, I'm guessing that others have written generic enhancements to vfp's string parsing functions. If so I was hoping someone could point to some library or classes. But actually this part of it is lower priority - item #2 below is more important
>
>2) I'm really looking for a set of routines, hopefully in vfp, that do advanced string matching. Frank Cazabon pointed me in the right direction with one commercial product in particular - netrics.com, looks good but I am waiting on info, and, I'm guessing it is very expensive. He also pointed me to the levenshtein algorithm, but that turns out to only be one small part of what I'm trying to accomplish. That seems good for identifying misspelled words, or similar words. The bottom line is that I want to take one business name string, and look for "closest matches" in a list (of about 1500) business names, and business name aliases.
>
>I'm inclined to think that google like searching or full text searching is not what I want. What I want is much more structured and straightforward - find a close match to a multi word string in one column of one table. Ideally someone who has already gone through the "pain" of figuring out efficient algorithm for searching and re-searching and knows the heuristics regarding what types of searches are likely to return better results - .e.g. threshold for "closeness" before you throw a result away. And ideally something that is very specialized that knows something about typical business data. For example in my searches, I'm operating under the assumption that "matching on multiple words" will be better than matching on a single word, so I try to search as much of the string/words as possible, but if I don't get a match, I don't know what to do next. For example, a lot of the businesses have the "word" & in them - when I match on the single word "&" I get a ton of irrelevant matches - so what should I do - just ignore the word &, or maybe I should do a specific search on "A & B". I'm already stripping out apostrophes. There are lots of conditions like these to evaluate. So, ideally I'm looking for functions where somebody has already done the work of applying these common scenarios to business data to come up with good matches.

Thanks for the clarification. I don't have any code samples that could help you on your current path. However I have encountered a couple of alternate approaches. Both of them involve feedback from the user.

1. (specialized) Simple incremental search. If you assume the users always know the start, or first few letters of the business name an incremental search with a grid, listbox or combo can work well. Fifteen hundred businesses is not a lot, on modern hardware incremental search can be basically instantaneous. Once users are accustomed to how it works they can zero in on the one they want very quickly.

2. (more general) Display detailed search results and let the user choose. Suppose the user searches on "Hackensack University". You might be able to present results like this:

Your search string: [Hackensack University]

Exact Matches: 1
  Hackensack University

Includes phrase [Hackensack University]: 2
  Hackensack University
  Hackensack University Hospital

Includes 2 words [Hackensack][University]: 4
  Hackensack University
  Hackensack University Hospital
  University of Hackensack
  University of Hackensack School of Medicine

Includes 1 word [Hackensack]: xxx
  ...

Includes 1 word [University]: yyy
  ...

It would probably be a good idea to filter each "section" above so that results already appearing in a "higher" section don't appear again in the lower ones.

Another way to present the output of the second approach is to assign a numeric "relevance" score to each of the results, then order descending. The simple example above, if filtered might look something like this:

Your search string: [Hackensack University]

100: Hackensack University
 90: Hackensack University Hospital
 50: University of Hackensack
 50: University of Hackensack School of Medicine
 25: xxx entries that contain [Hackensack]
 25: yyy entries that contain [University]

An option for the above would be to give the user a field that says "Only show results with zz% or higher relevance".

Yet another idea is to look into regular expressions (RegEx). I don't know how to use them (I consider that a shortcoming in my development skills) but I understand they are very powerful for text parsing, and I believe the libraries are highly optimized on most OSs. RegEx is callable/available from VFP, ISTR some threads here from time to time showing how to use them. There are also some results on the Fox Wiki e.g. http://fox.wikis.com/wc.dll?Wiki~RegExp~VFP .

Regards. Al

"Violence is the last refuge of the incompetent." -- Isaac Asimov
"Never let your sense of morals prevent you from doing what is right." -- Isaac Asimov

Neither a despot, nor a doormat, be

Every app wants to be a database app when it grows up

Map

View

Click here to load this message in the networking platform