python PDF data extraction

Asked 2 years ago, Updated 2 years ago, 97 views

I would like to extract the data from the pdf in Python 3 as shown below picture.
I was able to retrieve the pdf data by using the code on the Internet, but I couldn't read the data horizontally, so I ended up using the following lumps.
In the end, I would like to make a list related to the horizontal direction, so please tell me what to do.
Thank you for your cooperation.

[PDF file "TEST.pdf"]
https://files.fm/u/7r6rn3eu

image

Code

 from pdfminer.pdfparser import PDFParser
from pdfminer.pdfparser import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams
from pdfminer.layout import LTTextBoxHorizontal

# Open a PDF file.
fp = open ('TEST.pdf', 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
document=PDFDocument()
parser.set_document(document)
# Create a PDF document object that stores the document structure.
document.set_parser(parser)
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Set parameters for analysis.
laparams=LAParams()
# Create a PDF page aggregator object.
device=PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter=PDFPageInterpreter(rsrcmgr, device)

pages=list( document.get_pages())
pagecontents=[ ]
for i in range (len(pages)):
    page_1 = pages[i] 
    interpreter.process_page(page_1)
    layout=device.get_result()
    For lin layout:
        if isinstance(l,LTTextBoxHorizontal):
            pagecontents.append(l.get_text())

print(pagecontents)

[Results]

'1.1\n',
'1.2\n',
'1.3\n',
'1.4\n',
'1\n2\n3\n',
'1\n2\n3\n4\n5\n',
'General\nFirst Name\nLast Name\nGender\n',
'Registry\n',
'Number\n',
'Contacts\nTEL number\nFax number\nMobile number\n',
'International number\n',
'Email address\n',
'ABCDE\nXXXX\n',
'JAPAN\n',
'C6S2076\n',
'123456789/123456789\n123456789\n123456789XXXX\n123456789\[email protected]\n']

Final Output Image

List [0] = ["1.1", "General" ]
list[1] = ["1", "First Name", "ABCDE" ]
....
List[11] = ["5", "Email address", "[email protected]" ]

Reference Code
http://cartman0.hatenablog.com/entry/2017/08/26/022957

python python3 pdf

2022-09-30 19:55

1 Answers

I tried debugging with PyCharm, and it appears that the variable layout is listed in the LTTextBoxHorizontal type, and this object in the LTTextBoxHorizontal type holds the placement in the PDF in terms of the xy coordinates.Therefore, I believe that using this value can achieve the desired "final output image."

Enter a description of the image here

However, there may be logic to get objects that better represent the internal configuration of PDFs (this is the first time I've touched on this library, so I don't know the detailed specifications...).Check the following to find out the correct implementation.

  • PDFMiner Documentation
  • PDF Standards (ISO 32000-1?)
  • Structure of PDFMiner objects visible in IDE such as PyCharm


2022-09-30 19:55

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.