How to Get Source Code From HTML Document ? - Level Extreme

Level Extreme platform

Subscription

Corporate profile

Products & Services

Support

Legal

Français

How to Get Source Code From HTML Document ?

Message

From

18/02/2004 17:05:30

Greg Moss
Soul prop
Illinois, United States

To

18/02/2004 13:49:47

Thomas Ganss
Main Trend
Frankfurt, Germany

General information

Forum:

Visual FoxPro

Category:

Other

Title:

Re: How to Get Source Code From HTML Document ?

Miscellaneous

Thread ID:

00878010

Message ID:

00878526

Views:

13

Greetings,

I would agree it definitely depends on the case if you use DOM or just pull the source. The problem I have with the DOM model is that you end up with lots of loops and code to move through the document. Which means if the document changes in terms of length or various other additions or subtractions to the page it can throw off your logic pretty quick if your trying to extract a few pieces of data out of a full page.

With regular expressions it typically becomes a one line solution to get the data you want. Regular expressions can also be more robust if other tables, rows, and extra data get put in the source document because the regular expression logic is not based on positioning in relation to the entire document. I also find it a lot easier to tweak a regular expression into getting the right result when something breaks. With typical string parsing and the DOM it becomes a lot more messy IMO.

I find the DOM great if I need to add content or do programmatic edits on the document. I find the DOM less effective when I simply need to get specific data from the document to use in a database or another process. There was a time where I used the DOM to get at the main segment I needed to work on and then use regular expressions to get the data into final form. But once my experience in regular expressions improved and I learned just how powerful they were, the DOM was just getting in the way and adding extra code to my parsing. Now I almost always just grab the source directly and write the regular expression needed to find the data I want.

IMO, regular expressions are vital to any sort of serious parsing problems. Particularly if you are working with multiple result sets in your data. Regular expressions take all the matches and put them nicely into an array for you with one line of code. With VFP alone there is no way to grab multiple result sets from a source without writing loops and getting into complex programming structure.

For simple parsing that isn't integral to the application I don't bother with reg exps. But if the project is revolving around data extraction and parsing, reg exps are the only way to go, IMO.

The ease and power of regular expressions for finding and parsing text just can't be matched by traditional programming languages. That's why regular expressions were developed in the first place.

Greg

>Hi,
>>You can also use the Internet Transfer Control as well. I've used it before to pull source
>If you are already used to it - why not. But since it is only an activeX wrapper around WinInet I think it is safer to leave that step out, if nothing exists. There are some vfp wrappers around WinInet: recommended, because even with a ADSL or cable connection your pc will not be the bottleneck.
>
>> and then use regular expressions (there is a class library for this) to parse out the HTML.
>>I have found this to be faster and easier than trying to use the document model to get the HTML out that I want.
>
>If you need parsing, MSHTML is a great help: much is already done and you can walk the DOM tree.
>Yes it is slower, but if you anticipate changes on the web site, best performance may be secondary to ease of change. Depends on the actual case IMHO.
>
>>Once you get your brain around regular expressions you will be spoiled and never want to use VFP to parse strings.
>
>Possibly...
>
>regards
>
>thomas

Click here to load this message in the networking platform