I would like to extract the data from the pdf in Python 3 as shown below picture.
I was able to retrieve the pdf data by using the code on the Internet, but I couldn't read the data horizontally, so I ended up using the following lumps.
In the end, I would like to make a list related to the horizontal direction, so please tell me what to do.
Thank you for your cooperation.
[PDF file "TEST.pdf"]
https://files.fm/u/7r6rn3eu
Code
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfparser import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams
from pdfminer.layout import LTTextBoxHorizontal
# Open a PDF file.
fp = open ('TEST.pdf', 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
document=PDFDocument()
parser.set_document(document)
# Create a PDF document object that stores the document structure.
document.set_parser(parser)
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Set parameters for analysis.
laparams=LAParams()
# Create a PDF page aggregator object.
device=PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter=PDFPageInterpreter(rsrcmgr, device)
pages=list( document.get_pages())
pagecontents=[ ]
for i in range (len(pages)):
page_1 = pages[i]
interpreter.process_page(page_1)
layout=device.get_result()
For lin layout:
if isinstance(l,LTTextBoxHorizontal):
pagecontents.append(l.get_text())
print(pagecontents)
[Results]
'1.1\n',
'1.2\n',
'1.3\n',
'1.4\n',
'1\n2\n3\n',
'1\n2\n3\n4\n5\n',
'General\nFirst Name\nLast Name\nGender\n',
'Registry\n',
'Number\n',
'Contacts\nTEL number\nFax number\nMobile number\n',
'International number\n',
'Email address\n',
'ABCDE\nXXXX\n',
'JAPAN\n',
'C6S2076\n',
'123456789/123456789\n123456789\n123456789XXXX\n123456789\[email protected]\n']
Final Output Image
List [0] = ["1.1", "General" ]
list[1] = ["1", "First Name", "ABCDE" ]
....
List[11] = ["5", "Email address", "[email protected]" ]
Reference Code
http://cartman0.hatenablog.com/entry/2017/08/26/022957
I tried debugging with PyCharm, and it appears that the variable layout
is listed in the LTTextBoxHorizontal
type, and this object in the LTTextBoxHorizontal
type holds the placement in the PDF in terms of the xy coordinates.Therefore, I believe that using this value can achieve the desired "final output image."
However, there may be logic to get objects that better represent the internal configuration of PDFs (this is the first time I've touched on this library, so I don't know the detailed specifications...).Check the following to find out the correct implementation.
© 2024 OneMinuteCode. All rights reserved.