How to Convert pdf to Excel using Python, Java or PL/SQl?

Acid0 · Jul 15, 2016

Hi there

I hope one of you can help with the following.

The clients want to use an pdf file they have(report) and convert it into excel.

Is there a way to do it in either Python, Java or PL/SQL(oracle) to a valid ccv or xls file?
Without purchasing any third party software?

Regards

MagicDude4Eva · Jul 15, 2016

PDFBox (or iText) and Apache POI

Acid0 · Jul 15, 2016

Thanks Magic for the direction will definitely look into this options.

We have itext 1.3 (the open source version) but I cant see in the jar file anything related to the extraction but will investigate, and Apache is also already installed by default.

MagicDude4Eva · Jul 15, 2016

Acid0 said:
Thanks Magic for the direction will definitely look into this options.

We have itext 1.3 (the open source version) but I cant see in the jar file anything related to the extraction but will investigate, and Apache is also already installed by default.

Just wanted to mention that there is no quick way of turning a PDF into an Excel. It really depends on the type of PDF, but in all cases it would be a two step process:
1) You use PDFBox to extract content from the PDF
2) You use Apache POI to write the Excel

PDFBox helps you with the extraction of content and the project website has many examples. There is also a Github project which helps with extracting table-content from PDFs - perhaps this is your use-case anyway - https://github.com/thoqbk/traprange

GreGorGy · Jul 15, 2016

If you print to a text file in windows then import into excel as fixed width you may get away with it

zippy · Jul 16, 2016

Data exchange using PDF seems about the silliest thing anyone can do. There are reasons someone goes to the trouble of publishing s document in PDF format and data exchange is not one of them. You should rather contact the owners and ask them to provide the data in more civilised data exchange format. Maybe that isn't possible ? Do you have the owners permission ? Or maybe you don't care about permission or IP. In that case do what the rest of hackers do: Right your own tools

Because of the lack of structure in PDF it's probably easier to write your own based on the structure of the document you intend importing. It's binary format, but you can figure it out by reading the PDF API docs.

Or employ cheap data capture clerks

if you can't find these tools easily with Google, it's probably not because there isn't such a tool

If looked at the Oracle community sites you will notice that there are native Oracle tools to write PDF docs, but not read them. Again, there is a reason for that

Good luck anyway

SYNERGY · Jul 16, 2016

Right click, open with Excel

bekdik · Jul 16, 2016

SYNERGY said:
Right click, open with Excel

You've done this?

SYNERGY · Jul 16, 2016

bekdik said:
You've done this?

I have actually, with Word.

Adobe DC, and online converters were messing up the format.

Worked surprisingly well.

Pho3nix · Jul 16, 2016

Open in Word 2013 and up. Copy contents to excel...

Photorer · Jul 21, 2016

try PDFtk....... it does all sorts of PDF manipulation on the server.....

If you have a Windows server then use something from the ActivePDF stable...........

Or a desktop version using NitroPdf Pro............

krycor · Jul 21, 2016

Apache PDFBox® - A Java PDF Library and Apache POI - the Java API for Microsoft Documents

Both of which was mentioned above (see post by MagicDude4Eva).. i just thought i'd put links to it etc. I can't really speak much about PDFBox as I've not used it before but I have used POI for both CSV and XLSX (xcel formated with xml) files which u can even do graphs etc i did tis 4+yrs ago

Join the MyBroadband community

Get started

How to Convert pdf to Excel using Python, Java or PL/SQl?

Acid0

Executive Member

MagicDude4Eva

Banned

Acid0

Executive Member

MagicDude4Eva

Banned

GreGorGy

BULLSFAN

zippy

Honorary Master

SYNERGY

Executive Member

bekdik

Honorary Master

SYNERGY

Executive Member

Pho3nix

The Legend

Photorer

Member

krycor

Honorary Master