PDF mining with metadata

You may love PDF or hate it depending on your need. PDF are great for sharing file as pdf layout doesn't change no matter how or where you view it but they are big pain when you have to extract data.
Thank to team behind pdfminer, we can easily extract metadata like font, location, height from pdf. Document of pdfminer is lacking so you may have to go through code to understand certain attribute or function.
Here is layout of any PDF document:

source: http://bit.ly/2lVMxu2

source: http://bit.ly/2lVMxu2
Here is sample Scripts to get LTtextbox and LTtextline data with position and other metadata.
I have also made webform where you can upload pdf to get parsed information.

Quotes from book I am reading:
No man should judge unless he asks himself in absolute honesty whether in a similar situation he might not have done the same.” 
― Viktor E. Frankl, Man's Search for Meaning

Unknown

I'm a Data-holic. I'm passionate about learning new things and playing with data. This blog is a place where I want to share my data analysis tips and tricks, data visualization and automation tips and tricks.

No comments:

Post a Comment