EXTRACTION OF INFORMATION OF UNDERLINE RULLED LINE OF ANALYSIS CHARACT

Assumptions/What you want to realize

I want to extract data that is underlined by the characters in the PDF file.I want to distinguish between being attracted and not being attracted.
PD Did you edit the PDF?It looks like a ruled line.I don't know how it was written.

We analyze PDF characters in python and create an app to check if the data is underlined and distinguish them.
Analysis was performed on python module pdfminer3k.
Affected Source Codes

Anaconda prompt
Scripts>pdf2txt.py data.pdf>text2.csv
Scripts >py
>>import csv
>>example_file=open('text2.csv')
>>>example_reader=csv.reader(example_file)
>> example_data=list(example_reader)
>> example_data[5]

The address on the fifth line of csv was underlined in PDF format, but when I extracted it, it was just the following sentence.
Problems you are experiencing

\u3000 Osaka City \u3000\u3000\u3000\u3000\u3000\u3000\u3000

Tried

Draw ruled lines in Excel and output PDF

Tried the analysis above.

I was able to extract the ruled line information called '\x0c', but I am at a loss because it is different from the ruled line underline like the actual image at the top.
Additional Information (e.g. FW/Tool Version)

win10
python 3.6.0
Anaconda3
anaconda-script.py Command line client (version 1.6.0)

I would appreciate it if you could post any good advice.
Thank you for your cooperation.

python

2022-09-30 17:16

1 Answers

The lines below the characters in the question are likely to be underlined in the font style.You can easily tell if it's an underline or a straight line or image of a figure by reading it in MS Word.

If the analysis is underlined, it is easy to use python-docx using Word conversion.The sample code is as follows:

 from docx import Document

document=Document('sample.docx')
result = [ ]
for paragraph in document.paragraphs:
    for run in paragraph.runs:
        result.append ([run.text,bool(run.underline)])

print(result)

The code above also excludes the data in the table.If you have a table, the sample code is as follows:

 from docx import Document

document=Document('sample.docx')
result = [ ]

for table in document.tables:
    for row in table.rows:
        for column in row.columns
            for paragraph in document.paragraphs:
                for run in paragraph.runs:
                    result.append ([run.text,bool(run.underline)])

print(result)

If you use pdfminer, pdf2txt.py-txml data.pdf can retrieve the location of the characters and the font size, but I couldn't get the underline information, but I haven't looked into it further.

Also, if the line below the character is a straight line or an image of a figure, you have to write a program after knowing the PDF file structure, so there is nothing you can do, but it will take a lot of time.

2022-09-30 17:16

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656