ITextSharp to get only the text

Level Extreme platform

Subscription

Corporate profile

Products & Services

Support

Legal

Français

ITextSharp to get only the text

Message

From

11/11/2011 19:09:59

Michel Fournier
Level Extreme Inc.
Petit-Rocher, New Brunswick, Canada

All

General information

Forum:

ASP.NET

Category:

Third party products

Title:

ITextSharp to get only the text

Environment versions

Environment:

VB 9.0

OS:

Windows 7

Network:

Windows 2003 Server

Database:

MS SQL Server

Application:

Web

Miscellaneous

Thread ID:

01528712

Message ID:

01528712

Views:

141

It appears that the PDF I am working with, despite as being shown as a form, have not been constructed with the field approach. Thus, I cannot use iTextSharp PdfReader.AcroFields.Fields.Keys approach to get the fields. I tried with another document from the net, which is a real form and I was able to get all 29 fields in it with all their names.

So, this means I will have to continue to parse manually the document to look for specific strings, etc. We were using a PDF2TXT utility so far but that doesn't work well with French characters. It converts the document in a TXT file and we can see it pretty much as is as if we would be directly in the PDF. However, because of lack of support from the previous company, we have to find an alternative. the iTextSharp.dll is perfect. Well, it is if I would wish to manipulate the content of the PDF such as extracting all the fields as described above. However, because I will have to parse, I would need to get an ASCII version of that file.

I have found some code like this which is what many are using. I have adjusted it a little bit:

        ' Get the TXT from a PDF
        Public Function GetTXT() As Boolean
            Dim lcValue As String = ""
            Dim llSuccess As Boolean = False
            Dim lnCounter As Integer = 0
            Dim lnType As Integer = -1
            Dim loByte() As Byte = Nothing
            Dim loPRTokeniser As iTextSharp.text.pdf.PRTokeniser = Nothing
            Dim loStringBuilder As System.Text.StringBuilder = New System.Text.StringBuilder

            ' Reset the value
            cMessage = ""
            cTXT = ""

            Try

                ' For each page
                For lnCounter = 1 To nPage
                    loByte = oPdfReader.GetPageContent(lnCounter)

                    ' If we have something
                    If Not IsNothing(loByte) Then
                        loPRTokeniser = New iTextSharp.text.pdf.PRTokeniser(loByte)

                        While loPRTokeniser.NextToken()
                            lnType = loPRTokeniser.TokenType()
                            lcValue = loPRTokeniser.StringValue

                            ' If this is a string
                            If lnType = PRTokeniser.TK_STRING Then
                                loStringBuilder.Append(loPRTokeniser.StringValue)
                                'I need to add these additional tests to properly add whitespace to the output string
                            ElseIf lnType = 1 AndAlso lcValue = "-600" Then
                                loStringBuilder.Append(" ")
                            ElseIf lnType = 10 AndAlso lcValue = "TJ" Then
                                loStringBuilder.Append(" ")
                            End If

                        End While

                    End If

                Next

                llSuccess = True
            Catch loError As Exception
                cMessage = loError.Message
            End Try

            Return llSuccess
        End Function

However, the designer complains on the TK_STRING. Can someone point me to a reference of the equivalent value of TK_STRING?

Michel Fournier
Level Extreme Inc.
Designer, architect, owner of the Level Extreme Platform
Subscribe to the site at https://www.levelextreme.com/Home/DataEntry?Activator=55&NoStore=303
Subscription benefits https://www.levelextreme.com/Home/ViewPage?Activator=7&ID=52

Map

View

Click here to load this message in the networking platform