Level Extreme platform
Subscription
Corporate profile
Products & Services
Support
Legal
Français
ITextSharp to get only the text
Message
From
11/11/2011 19:09:59
 
 
To
All
General information
Forum:
ASP.NET
Category:
Third party products
Title:
ITextSharp to get only the text
Environment versions
Environment:
VB 9.0
OS:
Windows 7
Network:
Windows 2003 Server
Database:
MS SQL Server
Application:
Web
Miscellaneous
Thread ID:
01528712
Message ID:
01528712
Views:
141
It appears that the PDF I am working with, despite as being shown as a form, have not been constructed with the field approach. Thus, I cannot use iTextSharp PdfReader.AcroFields.Fields.Keys approach to get the fields. I tried with another document from the net, which is a real form and I was able to get all 29 fields in it with all their names.

So, this means I will have to continue to parse manually the document to look for specific strings, etc. We were using a PDF2TXT utility so far but that doesn't work well with French characters. It converts the document in a TXT file and we can see it pretty much as is as if we would be directly in the PDF. However, because of lack of support from the previous company, we have to find an alternative. the iTextSharp.dll is perfect. Well, it is if I would wish to manipulate the content of the PDF such as extracting all the fields as described above. However, because I will have to parse, I would need to get an ASCII version of that file.

I have found some code like this which is what many are using. I have adjusted it a little bit:
        ' Get the TXT from a PDF
        Public Function GetTXT() As Boolean
            Dim lcValue As String = ""
            Dim llSuccess As Boolean = False
            Dim lnCounter As Integer = 0
            Dim lnType As Integer = -1
            Dim loByte() As Byte = Nothing
            Dim loPRTokeniser As iTextSharp.text.pdf.PRTokeniser = Nothing
            Dim loStringBuilder As System.Text.StringBuilder = New System.Text.StringBuilder

            ' Reset the value
            cMessage = ""
            cTXT = ""

            Try

                ' For each page
                For lnCounter = 1 To nPage
                    loByte = oPdfReader.GetPageContent(lnCounter)

                    ' If we have something
                    If Not IsNothing(loByte) Then
                        loPRTokeniser = New iTextSharp.text.pdf.PRTokeniser(loByte)

                        While loPRTokeniser.NextToken()
                            lnType = loPRTokeniser.TokenType()
                            lcValue = loPRTokeniser.StringValue

                            ' If this is a string
                            If lnType = PRTokeniser.TK_STRING Then
                                loStringBuilder.Append(loPRTokeniser.StringValue)
                                'I need to add these additional tests to properly add whitespace to the output string
                            ElseIf lnType = 1 AndAlso lcValue = "-600" Then
                                loStringBuilder.Append(" ")
                            ElseIf lnType = 10 AndAlso lcValue = "TJ" Then
                                loStringBuilder.Append(" ")
                            End If

                        End While

                    End If

                Next

                llSuccess = True
            Catch loError As Exception
                cMessage = loError.Message
            End Try

            Return llSuccess
        End Function
However, the designer complains on the TK_STRING. Can someone point me to a reference of the equivalent value of TK_STRING?
Michel Fournier
Level Extreme Inc.
Designer, architect, owner of the Level Extreme Platform
Subscribe to the site at https://www.levelextreme.com/Home/DataEntry?Activator=55&NoStore=303
Subscription benefits https://www.levelextreme.com/Home/ViewPage?Activator=7&ID=52
Next
Reply
Map
View

Click here to load this message in the networking platform