URL follow - Level Extreme

Level Extreme platform

Subscription

Corporate profile

Products & Services

Support

Legal

Français

URL follow

Message

From

21/11/2010 14:48:44

Thomas Ganss
Main Trend
Frankfurt, Germany

21/11/2010 10:10:46

Grady McCue
Old fellow
The Grove, Alberta, Canada

General information

Forum:

Visual FoxPro

Category:

Coding, syntax & commands

Title:

Re: URL follow

Miscellaneous

Thread ID:

01489932

Message ID:

01490043

Views:

>>Seems you want a crawling/extracting robot - if you are working from a home line, I recommend going back to automating IE, as many of your problems are more than halfway eliminated by the already parsed state of the page through the IE DOM. For this specific UC you should iterate the links-collection, which can be handled as a zero based array from vfp. As I consturcted quite a few of those in IE4 times I know what a dev-time saver the object model can be. The limiting factor during runtime will probably always be the internet connection or the server if you start parallel processes on multicore machines, even if parsing/rendering in IE takes more time than dedicated code.

>Thanks Thomas. Yes I am writing my first crawler. I am trying to find a way to get to the areas of news sites where they allow people to leave comments. I want to mine the comments for a snippet of the writings. The URL for the comments areas are assigned addtional information.
>

>Example: Primary URL http://world-news.newsvine.com/
>Example: Secondary URL. http://world-news.newsvine.com/_news/2010/11/21/5502717-report-would-be-plane-bombers-post-attack-details#comments
>
>So I want to know if I can(with VFP and help from the UT) determine the added portion of the secondary URL.
>
>I could also use the crawler above to get a lot of data, but I also want to 'know' what code is used to acquire all the text from a website. I don't want the HTML etc. Just text. With Tore's help I am getting to the primary URL and extracting text snippets. I just need to drill down a bit more.

Hmmmpfh!
This is an area where vfp would still be among the top choices (untyped for easy COM is possible in .Net4, but fast turnaround and data integration speak pro vfp).

But my robot project was clearly the most rewritten one of my career, going from vfp5 with IE3, later IE4 to VB with IE4 (Com ok, rest yuck) to J++ with IE4 (then came the Sun/MS spat, exit J++) to IE5 and vfp6. When COM got better in vfp7, the project went stale...
Also heavy changes in page layout meant fast action often.

Go for a explicitly decoupled design, perhaps a template pattern with lots of hook methods. Delegate/fill those hooks with calls to different strategy/factory calls or factory generated objects. For instance when iterating over the links collection in you example, you will need a method specific for that site to filter out unneccessary links - and this method will be similar or sometimes identical except for some properties, but not in the same inheritance tree. Dynamically loading "factory methods" or similar techniques will allow you to keep most of the app running while exchanging small bits to adapt to changed HTML. Decide if you want to keep your rules in a table based schema or also in an inheritance tree accessed by factory calls - and if your decision 6 months ago is still the best for the current work load.

Go for IE helpers in a big way (I had to write some of such myself...), create a helper form for logging actual manual tries on an IE instance (for instance to find patterns for "usable HREF"). EXPECT to have to redesign paritially and rewrite often.

Find the current IE-DOM group at MS. If you don't know whose advice to follow, if there is a current answer by Igor Tandetnik, believe him. He was a mixture of Sergey with the explorative urge of Rick thrown in - back when I was offered to write part of a book on automating IE he knew more about it and probably went on to publish something later ...

have fun

thomas

Map

View

Click here to load this message in the networking platform