For one document in xml, if year is 2021, and the text in the paragraph contains third, we have created a program to extract all the text in the paragraph. I did well with the string search criteria, but if you include the year condition, the string ABC could not be extracted, so I thought it would be difficult.
Source Code 1 (String Search Criteria)
texts=soup.select(f'''
document:has(>passage>infon[key="type"]:-soup-contains("paragram")~text:-soup-contains("third")
passage>
infon [key="type"]:-soup-contains("paragram")~text:-soup-contains("")
''')
Source Code 2 (string search and year criteria)
texts=soup.select(f'''
document: has(>passage>infon [key="year"]:-soup-contains("2021")>
passage:has(>infon[key="type"]:-soup-contains("paragram")~text:-soup-contains("third")>
infon [key="type"]:-soup-contains("paragram")~text:-soup-contains("")
''')
xml
<collection>
<document>
<passage>
<infon key="year">2021</infon>>
</passage>
<passage>
<infon key="type">parograph</infon>
<text>third five</text>
</passage>
<passage>
<infon key="type">parograph</infon>
<text>ABC</text>
</passage>
</document>
<document>
<passage>
<infon key="year">2021</infon>>
</passage>
<passage>
<infon key="type">parograph</infon>
<text>third mix</text>
</passage>
</document>
</collection>
Source Code 2 Results
['third five']
["Third Six"]
:has
to two levels.
from bs4 import BeautifulSoup
with open('result.xml') as xml:
soup = BeautifulSoup(xml, 'xml')
text=soup.select(
' document: has(>passage>infon [key="year"]: -soup-contains("2021"))'
':has(>passage>text:-soup-contains("third")>passage>text')
print(texts)
# [<text>third five</text>, <text> ABC</text>, <text> third mix</text>]
© 2024 OneMinuteCode. All rights reserved.