For document in xml, if the text value is article-id_pmid, we have created a program to extract if the paragraph in the passage contains first.
As for the paragraph extraction, I used containers, but I didn't know if the number in the text was article-id_pmid, so please let me know. (You can use xpath to extract the pmid and match the number in the text.)?)
I thought it might be difficult because the article-id_pmid and paragraph are in different passwords.
take.py
from bs4 import BeautifulSoup
import csv
# Load xml files
with open('1.xml', 'r', encoding='utf-8') as xml:
soup = BeautifulSoup(xml, 'xml')
# Extract if the paragraph in the passage contains first
texts = group.select(' ''
passage>
infon [key="type"]:-soup-contains("parograph") ~ text:-soup-contains("first")
''')
text = [t.text for text in text]
xml.close()
# Save results to specified file
with open('take.txt', 'w') as txt:
print(text,file=txt)
txt.close
1.txt
1111
2222
1.xml
<collection>
<document>
<id>32691</id>
<passage>
<infon key="article-id_pmid">1111</infon>
</passage>
<passage>
<infon key="section_type">INTRO</infon>
<infon key="type">parograph</infon>
<text>which was first diagnosed in Wuhan.</text>
</passage>
</document>
<document>
<id>31435</id>
<passage>
<infon key="article-id_pmid">2222</infon>
</passage>
<passage>
<infon key="section_type">INTRO</infon>
<infon key="type">parograph</infon>
<text>Challenges for Vaccinologists in the first.</text>
</passage>
</document>
<document>
<id>35643</id>
<passage>
<infon key="article-id_pmid">3333</infon>
</passage>
<passage>
<infon key="section_type">INTRO</infon>
<infon key="type">parograph</infon>
<text>>decreased trade, high unemployment.</text>
</passage>
</document>
</collection>
If the number in the text is article-id_pmid
Let the CSS selector sort this "text number" as the contents of 1.txt
.
Note that -soup-contains(...)
combines arguments under OR
conditions.
XML element: -soup-contains("1111", "2222")
=>
("1111" in "XML element text") OR ("2222" in "XML element text")
<-soup-contains()
is contain
(included) and not an exact match, so be careful about that
from bs4 import BeautifulSoup
# Load xml files
with open('1.xml', 'r', encoding='utf-8') as xml:
soup = BeautifulSoup(xml, 'xml')
#text file loading
with open('1.txt') as f:
nums = [n.strip() for n in f.readlines()]
# Extract if the paragraph in the passage contains first
nums=', '.join(f' "{n}" for n in nums)
text=soup.select(f'''
document:has(>passage>infon [key="article-id_pmid"]:-soup-contains({nums}))>
passage>
infon [key="type"]:-soup-contains("parograph") ~ text:-soup-contains("first")
''')
text = [t.text for text in text]
# Save results to specified file
with open('take.txt', 'w') as txt:
print('\n'.join(text), file=txt)
© 2024 OneMinuteCode. All rights reserved.