>Hi All,
>
>Given a web page which may contain various parts is there a way to extract the text from what is most likley the pages main purpose? For example take this this page:
http://news.bbc.co.uk/1/hi/business/10281079.stm . It contains an article on the oil spill but is surrounded with various other stuff, lnks, adverts, etc. Is there a technique for getting the main articles text. I understand this may not always be exact but I'm looking for a "good enough" solution.
Many years ago [aka less advertising used and much simpler markup]
I used to automate IE as a spider. Even back then a fully automatic way was not possible,
but I built a TV of the HTML with some additional info (name and Text len, current level
and filtering possibilities - like show only links to get hints where to go next).
That way I could extract rules to implement for each site, which were just some memo fields...
As sites changed subtly about 4 times per year, having a fast way to alter my scripts
involved helping me "read" - going at it fully automated was beyond my scope.
my 0.22 EUR
thomas