Extracting text from docx file on python-docx module.
Therefore, I have one question: Is it possible to check if there is a kurigana (rubi) in the text?
Also, if I could look into it, what would I do to get the contents of the rubi?
I would appreciate it if you could let me know if you know anything.
The environment is Windows 10, Python 3.7.
Thank you for your cooperation.
In Use Python docx to create phonetic guide/'Ruby text' in Word?, the python-docx
does not support loading rubies
After reading the current python-docx
document, no way to get rubi was found.
However, you can use the python-docx
feature to directly analyze and extract the OOXML from the notes around the comment (OOXML) rubi.
The sample code below uses three methods to extract furigana.
If you want to investigate and extract kana characters for each Run, use the 3.
method in the code.
sample code
from docx import Document
with open('sample.docx', 'rb') as f:
document=Document(f)
print("#1.View document rubies")
for in document.element.xpath("//w:ruby/w:rt/w:r/w:t"):
print(t.text)
print("#2. View Rubi and Body of Document")
ns = document.element.nsmap
for ruby in document.element.xpath("//w:ruby"):
r=ruby.xpath("w:rt/w:r/w:t", namespaces=ns) [0].text#namespaces or "lxml.etree.XPathEvalError: Undefined namespace prefix" will occur
t=ruby.xpath("w:rubyBase/w:r/w:t", namespace=ns) [0].text#namespace or "lxml.etree.XPathEvalError: Undefined namespace prefix" will appear
print(f"{t}({r}")")
print("#3.View the rubies and body of runs")
for pin document.paragraphs:
for run in p.runs:
ruby=run._r.xpath("w:ruby")
# check the existence of furigana in the run
if ruby:
r=run._r.xpath("w:ruby/w:rt/w:r/w:t")[0].text
t=run._r.xpath("w:ruby/w:rubyBase/w:r/w:t")[0].text
print(f"{t}({r}")")
Reference:
© 2024 OneMinuteCode. All rights reserved.