Extracting Elements in the CSS Selector for bs4

For one document in xml, if year is 2021, and the text in the paragraph contains third, we have created a program to extract all the text in the paragraph. I did well with the string search criteria, but if you include the year condition, the string ABC could not be extracted, so I thought it would be difficult.

Source Code 1 (String Search Criteria)

texts=soup.select(f'''
            document:has(>passage>infon[key="type"]:-soup-contains("paragram")~text:-soup-contains("third")
                passage>
                    infon [key="type"]:-soup-contains("paragram")~text:-soup-contains("")
''')

Source Code 2 (string search and year criteria)

texts=soup.select(f'''
      document: has(>passage>infon [key="year"]:-soup-contains("2021")>
        passage:has(>infon[key="type"]:-soup-contains("paragram")~text:-soup-contains("third")>
            infon [key="type"]:-soup-contains("paragram")~text:-soup-contains("")
''')

xml

<collection>
    <document>
        <passage>
            <infon key="year">2021</infon>>
        </passage>
        <passage>
            <infon key="type">parograph</infon>
            <text>third five</text>
        </passage>
        <passage>
            <infon key="type">parograph</infon>
            <text>ABC</text>
        </passage>
    </document>
    <document>
        <passage>
            <infon key="year">2021</infon>>
        </passage>
        <passage>
            <infon key="type">parograph</infon>
            <text>third mix</text>
        </passage>
    </document>
</collection>

Source Code 2 Results

 ['third five']
["Third Six"]

python python3 beautiful-group

2022-12-15 03:58

1 Answers

:has to two levels.

 from bs4 import BeautifulSoup

with open('result.xml') as xml:
    soup = BeautifulSoup(xml, 'xml')

text=soup.select(
      ' document: has(>passage>infon [key="year"]: -soup-contains("2021"))'
      ':has(>passage>text:-soup-contains("third")>passage>text')

print(texts)

# [<text>third five</text>, <text> ABC</text>, <text> third mix</text>]

2022-12-15 06:37

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656