Plateforme Level Extreme
Abonnement
Profil corporatif
Produits & Services
Support
Légal
English
Get web page text
Message
De
10/06/2010 13:12:16
 
 
À
10/06/2010 11:19:07
Information générale
Forum:
Visual FoxPro
Catégorie:
Codage, syntaxe et commandes
Versions des environnements
Visual FoxPro:
VFP 9 SP2
Divers
Thread ID:
01468306
Message ID:
01468340
Vues:
45
>Hi All,
>
>Given a web page which may contain various parts is there a way to extract the text from what is most likley the pages main purpose? For example take this this page: http://news.bbc.co.uk/1/hi/business/10281079.stm . It contains an article on the oil spill but is surrounded with various other stuff, lnks, adverts, etc. Is there a technique for getting the main articles text. I understand this may not always be exact but I'm looking for a "good enough" solution.

Many years ago [aka less advertising used and much simpler markup]
I used to automate IE as a spider. Even back then a fully automatic way was not possible,
but I built a TV of the HTML with some additional info (name and Text len, current level
and filtering possibilities - like show only links to get hints where to go next).

That way I could extract rules to implement for each site, which were just some memo fields...
As sites changed subtly about 4 times per year, having a fast way to alter my scripts
involved helping me "read" - going at it fully automated was beyond my scope.

my 0.22 EUR

thomas
Précédent
Suivant
Répondre
Fil
Voir

Click here to load this message in the networking platform