It's about extracting elements using containers.

Asked 2 years ago, Updated 2 years ago, 384 views

For document in xml, if the text value is article-id_pmid, we have created a program to extract if the paragraph in the passage contains first.

As for the paragraph extraction, I used containers, but I didn't know if the number in the text was article-id_pmid, so please let me know. (You can use xpath to extract the pmid and match the number in the text.)?)
I thought it might be difficult because the article-id_pmid and paragraph are in different passwords.

take.py

 from bs4 import BeautifulSoup
import csv

# Load xml files
with open('1.xml', 'r', encoding='utf-8') as xml:      
    soup = BeautifulSoup(xml, 'xml')

# Extract if the paragraph in the passage contains first

texts = group.select(' ''
passage>
   infon [key="type"]:-soup-contains("parograph") ~ text:-soup-contains("first") 
''')
text = [t.text for text in text]
xml.close()

# Save results to specified file
with open('take.txt', 'w') as txt:
  print(text,file=txt)
txt.close

1.txt

1111
2222

1.xml

<collection>
    <document>
    <id>32691</id>
        <passage>
            <infon key="article-id_pmid">1111</infon>
        </passage>
        <passage>
            <infon key="section_type">INTRO</infon>
            <infon key="type">parograph</infon>
            <text>which was first diagnosed in Wuhan.</text>
        </passage>
    </document>
    <document>
        <id>31435</id>
        <passage>
        <infon key="article-id_pmid">2222</infon>
        </passage>
        <passage>
        <infon key="section_type">INTRO</infon>
        <infon key="type">parograph</infon>
        <text>Challenges for Vaccinologists in the first.</text>
        </passage>
    </document>
    <document>
        <id>35643</id>
        <passage>
        <infon key="article-id_pmid">3333</infon>
        </passage>
        <passage>
        <infon key="section_type">INTRO</infon>
        <infon key="type">parograph</infon>
        <text>>decreased trade, high unemployment.</text>
        </passage>
    </document>
</collection>

python python3 xml

2022-11-12 08:38

1 Answers

If the number in the text is article-id_pmid

Let the CSS selector sort this "text number" as the contents of 1.txt.

Note that -soup-contains(...) combines arguments under OR conditions.

 XML element: -soup-contains("1111", "2222")
=>
("1111" in "XML element text") OR ("2222" in "XML element text")

<-soup-contains() is contain (included) and not an exact match, so be careful about that

 from bs4 import BeautifulSoup

# Load xml files
with open('1.xml', 'r', encoding='utf-8') as xml:
  soup = BeautifulSoup(xml, 'xml')

#text file loading
with open('1.txt') as f:
  nums = [n.strip() for n in f.readlines()]

# Extract if the paragraph in the passage contains first
nums=', '.join(f' "{n}" for n in nums)
text=soup.select(f'''
  document:has(>passage>infon [key="article-id_pmid"]:-soup-contains({nums}))>
    passage>
      infon [key="type"]:-soup-contains("parograph") ~ text:-soup-contains("first")
''')
text = [t.text for text in text]

# Save results to specified file
with open('take.txt', 'w') as txt:
  print('\n'.join(text), file=txt)


2022-11-13 00:10

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.