Creating a Program to Extract Partial Information from an xml File in Python

Asked 1 years ago, Updated 1 years ago, 414 views

What do you want to do

I want to extract the abstract <text> in the <passage> in the xml file named result.xml.
For <text>, I want to extract the Neurological applications of COVID-19, ~ensphalopathy. and The rapid evolution~replication. statements.
Search criteria include either the words COVID-19 or SARS-CoV-2 in the <text> statement.

files:result.xml

<collection>
  <document>
    <passage>
      <infon key="authors">Gupta NA,Lien C,IvM,</infon>
      <offset>0</offset>
      <text>Critical ilness-associated cerebral microbles in severity COVID-19 inspection</text>
      <announcement id="5">
        <location offset="68" length="9"/>
        <text>infection</text>
      </annotation>
    </passage>
    <passage>
      <infon key="section_type">ABSTRACT</infon>
      <infon key="type">abstract</infon>
      <offset>81</offset>
      <text>Neurological compositions of COVID-19 inspection have been included frequently described and included dizziness, headache, loss of taste and smell, stroke, and encephalopathy.</text>
    </passage>
    <passage>
      <infon key="section_type">ABSTRACT</infon>
      <infon key="type">abstract_title_1</infon>
      <offset>584</offset>
      <text>Highlights</text>
    </passage>
  </document>
  <document>
    <passage>
      <infonkey="name_4">surname:Ansari;given-names:M.Azim</infon>
      <offset>0</offset>
      <text>Extensive C->U transition bias in the genomes of a wide range of mammarian RNA viruses;potential associations with transcriptive mutations, damage-or host-medicated editing of viral RNA</text>
      <announcement id="1">
        <infon key="identifier">9606</infon>
        <infon key="type">Species</infon>
        <location offset="67" length="9"/>
        <text>mammalian</text>
      </annotation>
    </passage>
    <passage>
      <infon key="type">abstract</infon>
      <offset>191</offset>
      <text>The rapid evolution of RNA viruses SARS-CoV-2 has been long consolidated to result from a combination of high copying error frequencies during RNA replication.</text>
    </passage>
    <passage>
      <infon key="section_type">ABSTRACT</infon>
      <infon key="type">abstract_title_1</infon>
      <offset>2033</offset>
      <text>Author summary</text>
    </passage>
  </document>
</collection>

Script: 1.py

 from bs4 import BeautifulSoup
with open('result.xml') as xml:
    soup = BeautifulSoup(xml, 'xml')

result = [ ]

for passage in group.find_all('passage'):
    text=passage.text
    if text and ('COVID-19' in text or 'SARS-CoV-2' in text):
        for line in text.splitlines():
            ifline.strip().endswith('.'):
                result3.append(line)

print(*result, sep='^\n')

python python3

2022-10-03 01:00

2 Answers

Use BeautifulSoup.

 from bs4 import BeautifulSoup
soup = BeautifulSoup(xml, 'xml')

result = [ ]

for passage in group.find_all('passage'):
    text=passage.text
    if text and ('COVID-19' in text or 'SARS-CoV-2' in text):
        for line in text.splitlines():
            ifline.strip().endswith('.'):
                result.append(line)

print(*result, sep='\n')

# Neurological compositions of COVID-19 infection have been recently included and include dizziness, headache, loss of taste and smell, stroke, and encephalopathy.
# The rapid evolution of RNA viruses SARS-CoV-2 has been long considered to result from a combination of high copying errors during RNA replication. 

Supplementary

  • The sentence is the end of the line ending with a period (removing the spaces before and after).
  • For
  • if text and (..., we use a short-circuit evaluation to prevent the next test in... from spouting errors when the text is None (if the passage does not contain a text tag).


2022-10-03 01:00

When using the CSS selector with BeautifulSoup.

 from bs4 import BeautifulSoup

with open('result.xml') as xml:
    soup = BeautifulSoup(xml, 'xml')

texts = group.select(' ''
  passage>
  infon [key="type"]: -soup-contains("abstract"):not(:-soup-contains("_title")~
  text:-soup-contains ("COVID-19", "SARS-CoV-2")
''')
text = [t.text for text in text]

print('\n'.join(text))

#
Neurological compositions of COVID-19 infection have been recently included and include dizziness, headache, loss of taste and smell, stroke, and encephalopathy.
The rapid evolution of RNA viruses SARS-CoV-2 has been long considered to result from a combination of high copying errors during RNA replication. 


2022-10-03 01:00

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.