Can anyone clarify PDF markup please.

SBSP · Dec 6, 2017

I'm using PDFSharp to read a PDF document created in Crystal Report.

I referenced the PDFSharp Libraries using VB.NET

From the below you can see i load the text from the text elements into a textbox.

Code:

        Dim SourceFileName As String = "D:\vbpROJECTS\firstpagetest.pdf"
        Dim Inputdocument As PdfDocument = PdfReader.Open(SourceFileName, PdfDocumentOpenMode.Import)
        Dim TextExtract As String

        'MsgBox(Inputdocument.Pages(0).Contents.Elements.Count)
        TextExtract = Inputdocument.Pages(1).Contents.Elements.GetDictionary(0).Stream.ToString
        TextBox1.Clear()
        TextBox1.Text = TextExtract

The textbox is then populated with the below text

Td
(.*/$%%&) Tj
0 -220 Td
(.8*.%%&) Tj
0 -220 Td
(.8*7%%&) Tj
0 -220 Td
(.8*7%%') Tj
0 -220 Td
(.8*;%%&) Tj
0 -220 Td
(.8*-%%&) Tj
0 -220 Td
(.8*-%%') Tj
0 -220 Td
(.8*-%%@) Tj
0 -220 Td
(.8*-%%B) Tj

In between the Td and Tj tags is the actual text.
Tj means show text as per adobe's PDF specification see chapter 9

If I copy the markup into excel. Col1 is the actual text you see when its open in adobe reader and the strange encoding (Strange to me)

There's a pattern!

D=.
A= *
U= /
T= $
0= %
1= &

The double %% gave it away :wtf:

The reason I'm asking this is becaue I need to find speciffic text in a PDF document and then split it up. I familiar on how to do all the splitting and creating new documents and stuff.

This only happens when I try to read a PDF created in Crystal Reports, If i create a word document and then save it as PDF the text between the brackets sitting between td and tj has clear text.

anyone familiar with this ?

I dont want to use itextSharp.

r4nd0m · Dec 6, 2017

https://stackoverflow.com/a/31831124/5225984

seemingly, CID-Fonts are a pdf-internal construct and they are not really fonts in that sense - they seem to be more like graphics-subroutines, that can be invoked by addressing them (with 16-bit addresses).

for example, "H€llo World!" might become <01020303040506040703080905> and now you can just put that string into the pdf and have it printed, using the Tj operator as usual...

Join the MyBroadband community

Get started

Can anyone clarify PDF markup please.

SBSP

Senior Member

r4nd0m

Expert Member