>A free tool you can try is minetext command line tool (
http://text-mining-tool.com/TextMiningTool%201.1.42.zip )
>
>Usage:
> minetext <input file>
>
> minetext <input file> <output file>
>
>where:
>
> <input file> - any file with one of the following extensions: pdf, doc, rtf, chm, htm, html
> <output file> - file you want to write text mined from input file
As a reference, I would like to add the following tools:
PDF2TXT
Takes a PDF and create a TXT. It has been quite intensively in a major application and it does the job. Parsing would apply to manipulate the data after.
Aspose
Very huge but efficient, nice tool to use. Among some features, I have supported an application which had to convert a PDF into JPG images so to showthe JPG images on site instead of embeding a PDF.
EvoPdf
Grab HTML pages and convert them into PDF.