Can anyone clarify PDF markup please.

SBSP

Senior Member
Joined
Sep 7, 2007
Messages
667
Reaction score
16
I'm using PDFSharp to read a PDF document created in Crystal Report.

I referenced the PDFSharp Libraries using VB.NET

From the below you can see i load the text from the text elements into a textbox.
Code:
        Dim SourceFileName As String = "D:\vbpROJECTS\firstpagetest.pdf"
        Dim Inputdocument As PdfDocument = PdfReader.Open(SourceFileName, PdfDocumentOpenMode.Import)
        Dim TextExtract As String

        'MsgBox(Inputdocument.Pages(0).Contents.Elements.Count)
        TextExtract = Inputdocument.Pages(1).Contents.Elements.GetDictionary(0).Stream.ToString
        TextBox1.Clear()
        TextBox1.Text = TextExtract

The textbox is then populated with the below text
Td
(.*/$%%&) Tj
0 -220 Td
(.8*.%%&) Tj
0 -220 Td
(.8*7%%&) Tj
0 -220 Td
(.8*7%%') Tj
0 -220 Td
(.8*;%%&) Tj
0 -220 Td
(.8*-%%&) Tj
0 -220 Td
(.8*-%%') Tj
0 -220 Td
(.8*-%%@) Tj
0 -220 Td
(.8*-%%B) Tj

In between the Td and Tj tags is the actual text.
Tj means show text as per adobe's PDF specification see chapter 9

If I copy the markup into excel. Col1 is the actual text you see when its open in adobe reader and the strange encoding (Strange to me) :)



pdf.png

There's a pattern!
D=.
A= *
U= /
T= $
0= %
1= &

The double %% gave it away :wtf:

The reason I'm asking this is becaue I need to find speciffic text in a PDF document and then split it up. I familiar on how to do all the splitting and creating new documents and stuff.

This only happens when I try to read a PDF created in Crystal Reports, If i create a word document and then save it as PDF the text between the brackets sitting between td and tj has clear text.

anyone familiar with this ?

I dont want to use itextSharp.
 
https://stackoverflow.com/a/31831124/5225984

seemingly, CID-Fonts are a pdf-internal construct and they are not really fonts in that sense - they seem to be more like graphics-subroutines, that can be invoked by addressing them (with 16-bit addresses).

for example, "H€llo World!" might become <01020303040506040703080905> and now you can just put that string into the pdf and have it printed, using the Tj operator as usual...
 
Top
Sign up to the MyBroadband newsletter
X