PDF mining with metadata

You may love PDF or hate it depending on your need. PDF are great for sharing file as pdf layout doesn't change no matter how or where you view it but they are big pain when you have to extract data.
Thank to team behind pdfminer, we can easily extract metadata like font, location, height from pdf. Document of pdfminer is lacking so you may have to go through code to understand certain attribute or function.
Here is layout of any PDF document:

source: http://bit.ly/2lVMxu2

Here is sample Scripts to get LTtextbox and LTtextline data with position and other metadata.
I have also made webform where you can upload pdf to get parsed information.

Quotes from book I am reading:

“No man should judge unless he asks himself in absolute honesty whether in a similar situation he might not have done the same.”

― Viktor E. Frankl, Man's Search for Meaning

Unknown

I'm a Data-holic. I'm passionate about learning new things and playing with data. This blog is a place where I want to share my data analysis tips and tricks, data visualization and automation tips and tricks.

PDF mining with metadata

Unknown

No comments:

Post a Comment

Popular Posts

Search This Blog

Labels Cloud

Twitter Feed

Contact Me

Labels List Numbered

ADs

About Me