How to Convert pdf to Excel using Python, Java or PL/SQl?

Acid0

Executive Member
Joined
Feb 10, 2009
Messages
5,889
Reaction score
2,734
Location
Johannesburg
Hi there

I hope one of you can help with the following.

The clients want to use an pdf file they have(report) and convert it into excel.

Is there a way to do it in either Python, Java or PL/SQL(oracle) to a valid ccv or xls file?
Without purchasing any third party software?

Regards
 
Thanks Magic for the direction will definitely look into this options.

We have itext 1.3 (the open source version) but I cant see in the jar file anything related to the extraction but will investigate, and Apache is also already installed by default.
 
Thanks Magic for the direction will definitely look into this options.

We have itext 1.3 (the open source version) but I cant see in the jar file anything related to the extraction but will investigate, and Apache is also already installed by default.

Just wanted to mention that there is no quick way of turning a PDF into an Excel. It really depends on the type of PDF, but in all cases it would be a two step process:
1) You use PDFBox to extract content from the PDF
2) You use Apache POI to write the Excel

PDFBox helps you with the extraction of content and the project website has many examples. There is also a Github project which helps with extracting table-content from PDFs - perhaps this is your use-case anyway - https://github.com/thoqbk/traprange
 
Data exchange using PDF seems about the silliest thing anyone can do. There are reasons someone goes to the trouble of publishing s document in PDF format and data exchange is not one of them. You should rather contact the owners and ask them to provide the data in more civilised data exchange format. Maybe that isn't possible ? Do you have the owners permission ? Or maybe you don't care about permission or IP. In that case do what the rest of hackers do: Right your own tools :)

Because of the lack of structure in PDF it's probably easier to write your own based on the structure of the document you intend importing. It's binary format, but you can figure it out by reading the PDF API docs.

Or employ cheap data capture clerks :)

if you can't find these tools easily with Google, it's probably not because there isn't such a tool :)

If looked at the Oracle community sites you will notice that there are native Oracle tools to write PDF docs, but not read them. Again, there is a reason for that :)

Good luck anyway :)
 
Last edited:
try PDFtk....... it does all sorts of PDF manipulation on the server.....

If you have a Windows server then use something from the ActivePDF stable...........

Or a desktop version using NitroPdf Pro............
 
Apache PDFBox® - A Java PDF Library and Apache POI - the Java API for Microsoft Documents

Both of which was mentioned above (see post by MagicDude4Eva).. i just thought i'd put links to it etc. I can't really speak much about PDFBox as I've not used it before but I have used POI for both CSV and XLSX (xcel formated with xml) files which u can even do graphs etc i did tis 4+yrs ago :D
 
Last edited:
Top
Sign up to the MyBroadband newsletter
X