I'm going to read a PDF in Python and convert the text information to Excel.
I referred to this site (https://fastclassinfo.com/entry/python_pdf_to_excel/), but the following error message occurred:
AttributeError
'Page' object has no attribute' getText'
Thank you for your help.Thank you for your cooperation.
import Fitz
import openpyxl aspx
from openpyxl.style import Alignment
# Program 2 | Create a list to store PDF text
item_list = [ ]
# Program 3 | Open PDF File
filename = '20180319001_1.pdf'
doc=fitz.open(filename)
# Get text one page at a time for Program 4|PDF
for page in range (len(doc)) :
textblocks=doc[page].getText('blocks')
for textblock in textblocks:
if textblock[4].isspace() == False:
item_list.append([page,textblock[4])])
# Program 5 | Create a new Excel
wb=px.Workbook()
ws = wb.active
# Program 6 | Excel Formatting
myalignment=Alignment(wrap_text=True, shrink_to_fit=False)
ws.column_dimensions['C'].width=100
# Program 7 | Output Excel Header
headers = ['No', 'Page', 'Content']
for i, header in enumerate (headers):
ws.cell (row=1, column=1+i, value=headers[i])
# Program 8 | Output PDF text data to Excel
fory, row in enumerate (item_list):
ws.cell(row=y+2, column=1, value=y+1)
for x, cell in enumerate (row):
ws.cell(row=y+2, column=x+2, value=item_list[y][x])
ws.cell(row=y+2, column=x+2).alignment=myalignment
# Program 9 | Save Excel Files
excelname=f'{filename}_excel_convert.xlsx'
wb.save(excelname)
add
Python 3.9.7
openpyxl Version: 3.0.4
PyMuPDF 1.21.1
In PyMuPDF 1.20.0 it seems to work if you change getText
to get_text
.
g I don't know if it works exactly the same as getText, but I got the text information.Please verify with the questioner if the same information can be obtained.
© 2024 OneMinuteCode. All rights reserved.