Get words and spaces from a string

Level Extreme platform

Subscription

Corporate profile

Products & Services

Support

Legal

Français

Get words and spaces from a string

Message

From

01/03/2011 19:44:35

David Schlesinger
No Company
Elmwood Park, New Jersey, United States

24/02/2011 19:32:06

Al Doman
M3 Enterprises Inc.
North Vancouver, British Columbia, Canada

General information

Forum:

Visual FoxPro

Category:

Coding, syntax & commands

Title:

Re: Get words and spaces from a string

Miscellaneous

Thread ID:

01501103

Message ID:

01502316

Views:

Yes, I agree that implementing the dictionary would be unappealing grunt work. The link you sent me for the google search appliance didn't work. But I finally got around to looking it up. It looks powerful, but it is expensive. But I'll keep it in mind for the future. For the short term though, I think I have to live without it.

Regarding the scope of the task, budget, and size of the data - my immediate project is "B to B" with a small universe of data. However, more generally, there are lots of VFP systems I am working on here, and most of the apps are "B to C" with millons of customer name and address records. So I need to look to the future for other systems while I work on this project.

Regarding "real time", yes you are right I don't mean real time in the formal definition of that term. I mean "respond quickly", which for this application means sub second response time. My current matching is working just fine in term of performance, even with matching on permutations, etc. However, the only thing that is not fast enough is the levenshtein algorithm - takes about 3 seconds to search for a match on one word in the table of 1500 business. Even with fast hardware I don't think this will be nearly good enough?

The rainbow tables thing - I had never heard of that. When I looked it up online - it looked like it was a specialized technique to crack passwords. I didn't see any indication of how it could be used for my purpose here. If you could point me in the direction of a resource on how I could implement it - that would be great.

As far as "recording users session" goes, that is exactly what I am doing - building a table of aliases to the master business name. So the next time the system will be able to do an exact match.

Regarding the website - I know this - I am so busy now that I haven't had time to address this. It's a dotnetnuke website. I setup a trial version of sql server 2008, then the trial expired and the database was no longer accessible.

Thanks for all your help! It is extremely useful!

>>Ideally, also, if the once phrases and words are evaluated, a lower priority search would be to find all permutations of words or phrases with similar or misspelled words. However, the levenshtein algorithm I used for this seems way too slow to use in a "real time" application. I am guessing that the next step to improve this would be to have an indexed dictionary of words and common misspellings / variations. Any feedback on this piece would be appreciated as well.
>
>To me, implementing your proposed dictionary sounds like unappealing grunt work. That's why we pay Google minimum wage ;-)
>
>I'm not sure about the scope of your task, so I don't really know what to recommend:
>
>- You said you're working with a list of about 1,500 businesses - is it just the business names? If so, at 100 characters each you're only looking at 150K of text data, which is not much.
>
>- You seem to be spending a fair amount of time on this. What's your budget? Throwing hardware (e.g. fast computer/CPU) at a speed problem can be amazingly cheap. Also, I was quite serious about the Google Search Appliance earlier - if it's within your budget it's a quick way to get world-class search results out of almost anything you decide to feed it - and it may be useful in other ways to your organization, helping its justification. It could be a good investment if it frees up your time for other things.
>
>- Can you explain further what you mean by "real-time"? I assume you don't mean in the robotics/control sense, in which case you probably wouldn't be using VFP on Windows. Do you mean you need it to respond quickly, and if so, how quickly?
>
>Having said all that, I can think of a couple more things that might be interesting to implement:
>
>1. If the material to search is small (or maybe even not so small) you could look into trading off size for speed. One example of this is so-called "rainbow tables", where you pre-compute and store values. Rainbow tables turn computations and unindexed/unoptimized string searches into indexed lookups.
>
>Here's a simple example of how it might work in your situation:
>
>Suppose you have a list of 1,500 business names in Biz.dbf, and 20 of them contain the word "Hackensack". If you don't use rainbow tables, your query looks something like
>

>* Query1:
>SELECT ;
>  BizName ;
>  FROM Biz ;
>  WHERE "Hackensack" $ BizName
>
>* I believe this really simple example can actually be optimized on some backends, but more complicated
>* examples likely won't be
>

>With a rainbow table approach, you would first create a lookup table, Rainbow.dbf, which (ignoring primary keys) has 2 columns: SearchString, and BizName. The table is indexed on SearchString. You populate the table by:
>
>- Determining each unique word amongst all the BizNames in Biz.dbf ;
>- Running Query1 for each of those words ;
>- and adding the results of each query to Rainbow.dbf.
>
>This will result in a table probably much larger than Biz.dbf. In Rainbow.dbf, there will be 20 rows with SearchString = "Hackensack". Your search query is reduced to
>

>SELECT ;
>  BizName ;
>  FROM Rainbow ;
>  WHERE SearchString = "Hackensack"
>
>* Indexed and fast
>

>The next step would be to extend this to two-word phrases. You determine every possible two-word phrase for each business in Biz.dbf, run those phrases through Query1, and add the results to Rainbow.dbf. This technique can be extended to longer phrases, presence of multiple words, 2 of 3 words etc.
>
>Two major disadvantages of rainbow tables:
>
>- they can get very large, very quickly. Ideally they should never be so large that accessing them causes the system to page to disk. For maximum performance, ideally they should fit within the CPU's L1, L2 or L3 cache (where available).
>
>- they work best for static data. If the data being searched change, maintaining the associated rainbow tables can be challenging
>
>2. Record users' search sessions. Basically, what you do here is record the user's search strings, and the associated search result(s) they eventually choose. You might use this to build a list of misspelled words (where they might initially get zero results) associated with the business names they eventually found. There should be lots of ways you could use that kind of information.
>
>
>
>
>BTW I just tried to access http://www.arete-erp.com/ and got an ASP.Net error page about not being able to connect to SQL Server.

Map

View

Click here to load this message in the networking platform