Level Extreme platform
Subscription
Corporate profile
Products & Services
Support
Legal
Français
Get words and spaces from a string
Message
From
24/02/2011 19:32:06
 
 
To
24/02/2011 09:42:51
General information
Forum:
Visual FoxPro
Category:
Coding, syntax & commands
Miscellaneous
Thread ID:
01501103
Message ID:
01501746
Views:
61
>Ideally, also, if the once phrases and words are evaluated, a lower priority search would be to find all permutations of words or phrases with similar or misspelled words. However, the levenshtein algorithm I used for this seems way too slow to use in a "real time" application. I am guessing that the next step to improve this would be to have an indexed dictionary of words and common misspellings / variations. Any feedback on this piece would be appreciated as well.

To me, implementing your proposed dictionary sounds like unappealing grunt work. That's why we pay Google minimum wage ;-)

I'm not sure about the scope of your task, so I don't really know what to recommend:

- You said you're working with a list of about 1,500 businesses - is it just the business names? If so, at 100 characters each you're only looking at 150K of text data, which is not much.

- You seem to be spending a fair amount of time on this. What's your budget? Throwing hardware (e.g. fast computer/CPU) at a speed problem can be amazingly cheap. Also, I was quite serious about the Google Search Appliance earlier - if it's within your budget it's a quick way to get world-class search results out of almost anything you decide to feed it - and it may be useful in other ways to your organization, helping its justification. It could be a good investment if it frees up your time for other things.

- Can you explain further what you mean by "real-time"? I assume you don't mean in the robotics/control sense, in which case you probably wouldn't be using VFP on Windows. Do you mean you need it to respond quickly, and if so, how quickly?

Having said all that, I can think of a couple more things that might be interesting to implement:

1. If the material to search is small (or maybe even not so small) you could look into trading off size for speed. One example of this is so-called "rainbow tables", where you pre-compute and store values. Rainbow tables turn computations and unindexed/unoptimized string searches into indexed lookups.

Here's a simple example of how it might work in your situation:

Suppose you have a list of 1,500 business names in Biz.dbf, and 20 of them contain the word "Hackensack". If you don't use rainbow tables, your query looks something like
* Query1:
SELECT ;
  BizName ;
  FROM Biz ;
  WHERE "Hackensack" $ BizName

* I believe this really simple example can actually be optimized on some backends, but more complicated
* examples likely won't be
With a rainbow table approach, you would first create a lookup table, Rainbow.dbf, which (ignoring primary keys) has 2 columns: SearchString, and BizName. The table is indexed on SearchString. You populate the table by:

- Determining each unique word amongst all the BizNames in Biz.dbf ;
- Running Query1 for each of those words ;
- and adding the results of each query to Rainbow.dbf.

This will result in a table probably much larger than Biz.dbf. In Rainbow.dbf, there will be 20 rows with SearchString = "Hackensack". Your search query is reduced to
SELECT ;
  BizName ;
  FROM Rainbow ;
  WHERE SearchString = "Hackensack"

* Indexed and fast
The next step would be to extend this to two-word phrases. You determine every possible two-word phrase for each business in Biz.dbf, run those phrases through Query1, and add the results to Rainbow.dbf. This technique can be extended to longer phrases, presence of multiple words, 2 of 3 words etc.

Two major disadvantages of rainbow tables:

- they can get very large, very quickly. Ideally they should never be so large that accessing them causes the system to page to disk. For maximum performance, ideally they should fit within the CPU's L1, L2 or L3 cache (where available).

- they work best for static data. If the data being searched change, maintaining the associated rainbow tables can be challenging

2. Record users' search sessions. Basically, what you do here is record the user's search strings, and the associated search result(s) they eventually choose. You might use this to build a list of misspelled words (where they might initially get zero results) associated with the business names they eventually found. There should be lots of ways you could use that kind of information.




BTW I just tried to access http://www.arete-erp.com/ and got an ASP.Net error page about not being able to connect to SQL Server.
Regards. Al

"Violence is the last refuge of the incompetent." -- Isaac Asimov
"Never let your sense of morals prevent you from doing what is right." -- Isaac Asimov

Neither a despot, nor a doormat, be

Every app wants to be a database app when it grows up
Previous
Next
Reply
Map
View

Click here to load this message in the networking platform