Environment versions
Network:
Windows 2008 Server
Hi Colin,
I have no idea about the best solution to achieve this. What I know is that, should I be in this situation, I'd certainly try to extract text content with my current copy of quickpdf. This considerable API has a lot in the field of "extraction" as it calls it. From GetPageText with it lot of parameters to more specific text extraction functions.
May I quote the documentation on their GetPageText:
Description - This function provides two different methods
"Using the standard text extraction algorithm:
0 = Extract text in human readable format
1 = Deprecated
2 = Return a CSV string including font, color, size and position of each piece
of text on the page
Using the more accurate but slower text extraction algorithm:
3 = Return a CSV string for each piece of text on the page with the following
format:
Font Name, Text Color, Text Size, X1, Y1, X2, Y2, X3, Y3, X4, Y4, Text
The co-ordinates are the four points bounding the text, measured using the
units set with the SetMeasurementUnits function and the origin set with
the SetOrigin function. Co-ordinate order is anti-clockwise with the bottom
left corner first.
4 = Similar to option 3, but individual words are returned, making searching
for words easier
5 = Similar to option 3 but character widths are output after each block of
text
6 = Similar to option 4 but character widths are output after each line of text
7 = Extract text in human readable format with improved accuracy compared
to option 0
8 = Similar output format as option 0 but using the more accurate algorithm.
Returns unformatted lines."
The issue? PDF is not exactly a nice format to extract data from:(
The .2 cents of a satisfied user with no vested interest in this dev shop in Australia:)
Daniel
Previous
Reply
View the map of this thread
View the map of this thread starting from this message only
View all messages of this thread
View all messages of this thread starting from this message only