Level Extreme platform
Subscription
Corporate profile
Products & Services
Support
Legal
Français
Get web page text
Message
From
10/06/2010 13:12:16
 
 
To
10/06/2010 11:19:07
General information
Forum:
Visual FoxPro
Category:
Coding, syntax & commands
Environment versions
Visual FoxPro:
VFP 9 SP2
Miscellaneous
Thread ID:
01468306
Message ID:
01468340
Views:
44
>Hi All,
>
>Given a web page which may contain various parts is there a way to extract the text from what is most likley the pages main purpose? For example take this this page: http://news.bbc.co.uk/1/hi/business/10281079.stm . It contains an article on the oil spill but is surrounded with various other stuff, lnks, adverts, etc. Is there a technique for getting the main articles text. I understand this may not always be exact but I'm looking for a "good enough" solution.

Many years ago [aka less advertising used and much simpler markup]
I used to automate IE as a spider. Even back then a fully automatic way was not possible,
but I built a TV of the HTML with some additional info (name and Text len, current level
and filtering possibilities - like show only links to get hints where to go next).

That way I could extract rules to implement for each site, which were just some memo fields...
As sites changed subtly about 4 times per year, having a fast way to alter my scripts
involved helping me "read" - going at it fully automated was beyond my scope.

my 0.22 EUR

thomas
Previous
Next
Reply
Map
View

Click here to load this message in the networking platform