Assumptions/What you want to realize
I want to extract data that is underlined by the characters in the PDF file.I want to distinguish between being attracted and not being attracted.
PD Did you edit the PDF?It looks like a ruled line.I don't know how it was written.
We analyze PDF characters in python and create an app to check if the data is underlined and distinguish them.
Analysis was performed on python module pdfminer3k
.
Affected Source Codes
Anaconda prompt
Scripts>pdf2txt.py data.pdf>text2.csv
Scripts >py
>>import csv
>>example_file=open('text2.csv')
>>>example_reader=csv.reader(example_file)
>> example_data=list(example_reader)
>> example_data[5]
The address on the fifth line of csv was underlined in PDF format, but when I extracted it, it was just the following sentence.
Problems you are experiencing
\u3000 Osaka City \u3000\u3000\u3000\u3000\u3000\u3000\u3000
Tried
Draw ruled lines in Excel and output PDF
Tried the analysis above.
I was able to extract the ruled line information called '\x0c', but I am at a loss because it is different from the ruled line underline like the actual image at the top.
Additional Information (e.g. FW/Tool Version)
I would appreciate it if you could post any good advice.
Thank you for your cooperation.
The lines below the characters in the question are likely to be underlined in the font style.You can easily tell if it's an underline or a straight line or image of a figure by reading it in MS Word.
If the analysis is underlined, it is easy to use python-docx
using Word conversion.The sample code is as follows:
from docx import Document
document=Document('sample.docx')
result = [ ]
for paragraph in document.paragraphs:
for run in paragraph.runs:
result.append ([run.text,bool(run.underline)])
print(result)
The code above also excludes the data in the table.If you have a table, the sample code is as follows:
from docx import Document
document=Document('sample.docx')
result = [ ]
for table in document.tables:
for row in table.rows:
for column in row.columns
for paragraph in document.paragraphs:
for run in paragraph.runs:
result.append ([run.text,bool(run.underline)])
print(result)
If you use pdfminer
, pdf2txt.py-txml data.pdf
can retrieve the location of the characters and the font size, but I couldn't get the underline information, but I haven't looked into it further.
Also, if the line below the character is a straight line or an image of a figure, you have to write a program after knowing the PDF file structure, so there is nothing you can do, but it will take a lot of time.
549 PHP ssh2_scp_send fails to send files as intended
542 Unable to install versioned in Google Colab
725 When building Fast API+Uvicorn environment with PyInstaller, console=False results in an error
548 rails db:create error: Could not find mysql2-0.5.4 in any of the sources
539 Uncaught (inpromise) Error on Electron: An object could not be cloned
© 2024 OneMinuteCode. All rights reserved.