How to Convert pdf to Excel using Python, Java or PL/SQl?

Acid0

Executive Member
Joined
Feb 10, 2009
Messages
5,262
Hi there

I hope one of you can help with the following.

The clients want to use an pdf file they have(report) and convert it into excel.

Is there a way to do it in either Python, Java or PL/SQL(oracle) to a valid ccv or xls file?
Without purchasing any third party software?

Regards
 

Acid0

Executive Member
Joined
Feb 10, 2009
Messages
5,262
Thanks Magic for the direction will definitely look into this options.

We have itext 1.3 (the open source version) but I cant see in the jar file anything related to the extraction but will investigate, and Apache is also already installed by default.
 

MagicDude4Eva

Banned
Joined
Apr 2, 2008
Messages
6,479
Thanks Magic for the direction will definitely look into this options.

We have itext 1.3 (the open source version) but I cant see in the jar file anything related to the extraction but will investigate, and Apache is also already installed by default.

Just wanted to mention that there is no quick way of turning a PDF into an Excel. It really depends on the type of PDF, but in all cases it would be a two step process:
1) You use PDFBox to extract content from the PDF
2) You use Apache POI to write the Excel

PDFBox helps you with the extraction of content and the project website has many examples. There is also a Github project which helps with extracting table-content from PDFs - perhaps this is your use-case anyway - https://github.com/thoqbk/traprange
 

GreGorGy

BULLSFAN
Joined
Jan 18, 2005
Messages
15,289
If you print to a text file in windows then import into excel as fixed width you may get away with it
 

zippy

Honorary Master
Joined
May 31, 2005
Messages
10,321
Data exchange using PDF seems about the silliest thing anyone can do. There are reasons someone goes to the trouble of publishing s document in PDF format and data exchange is not one of them. You should rather contact the owners and ask them to provide the data in more civilised data exchange format. Maybe that isn't possible ? Do you have the owners permission ? Or maybe you don't care about permission or IP. In that case do what the rest of hackers do: Right your own tools :)

Because of the lack of structure in PDF it's probably easier to write your own based on the structure of the document you intend importing. It's binary format, but you can figure it out by reading the PDF API docs.

Or employ cheap data capture clerks :)

if you can't find these tools easily with Google, it's probably not because there isn't such a tool :)

If looked at the Oracle community sites you will notice that there are native Oracle tools to write PDF docs, but not read them. Again, there is a reason for that :)

Good luck anyway :)
 
Last edited:

Photorer

Member
Joined
Dec 15, 2008
Messages
23
try PDFtk....... it does all sorts of PDF manipulation on the server.....

If you have a Windows server then use something from the ActivePDF stable...........

Or a desktop version using NitroPdf Pro............
 

krycor

Honorary Master
Joined
Aug 4, 2005
Messages
18,546
Apache PDFBox® - A Java PDF Library and Apache POI - the Java API for Microsoft Documents

Both of which was mentioned above (see post by MagicDude4Eva).. i just thought i'd put links to it etc. I can't really speak much about PDFBox as I've not used it before but I have used POI for both CSV and XLSX (xcel formated with xml) files which u can even do graphs etc i did tis 4+yrs ago :D
 
Last edited:
Top