Creating a Program to Extract Partial Information from an xml File in Python

Asked 2 years ago, Updated 2 years ago, 441 views

What do you want to do

I want to extract the abstract <text> in the <passage> in the xml file named result.xml.
For <text>, I want to extract the Neurological applications of COVID-19, ~ensphalopathy. and The rapid evolution~replication. statements.
Search criteria include either the words COVID-19 or SARS-CoV-2 in the <text> statement.


      <infon key="authors">Gupta NA,Lien C,IvM,</infon>
      <text>Critical ilness-associated cerebral microbles in severity COVID-19 inspection</text>
      <announcement id="5">
        <location offset="68" length="9"/>
      <infon key="section_type">ABSTRACT</infon>
      <infon key="type">abstract</infon>
      <text>Neurological compositions of COVID-19 inspection have been included frequently described and included dizziness, headache, loss of taste and smell, stroke, and encephalopathy.</text>
      <infon key="section_type">ABSTRACT</infon>
      <infon key="type">abstract_title_1</infon>
      <text>Extensive C->U transition bias in the genomes of a wide range of mammarian RNA viruses;potential associations with transcriptive mutations, damage-or host-medicated editing of viral RNA</text>
      <announcement id="1">
        <infon key="identifier">9606</infon>
        <infon key="type">Species</infon>
        <location offset="67" length="9"/>
      <infon key="type">abstract</infon>
      <text>The rapid evolution of RNA viruses SARS-CoV-2 has been long consolidated to result from a combination of high copying error frequencies during RNA replication.</text>
      <infon key="section_type">ABSTRACT</infon>
      <infon key="type">abstract_title_1</infon>
      <text>Author summary</text>


 from bs4 import BeautifulSoup
with open('result.xml') as xml:
    soup = BeautifulSoup(xml, 'xml')

result = [ ]

for passage in group.find_all('passage'):
    if text and ('COVID-19' in text or 'SARS-CoV-2' in text):
        for line in text.splitlines():

print(*result, sep='^\n')

python python3

2022-10-03 01:00

2 Answers

Use BeautifulSoup.

 from bs4 import BeautifulSoup
soup = BeautifulSoup(xml, 'xml')

result = [ ]

for passage in group.find_all('passage'):
    if text and ('COVID-19' in text or 'SARS-CoV-2' in text):
        for line in text.splitlines():

print(*result, sep='\n')

# Neurological compositions of COVID-19 infection have been recently included and include dizziness, headache, loss of taste and smell, stroke, and encephalopathy.
# The rapid evolution of RNA viruses SARS-CoV-2 has been long considered to result from a combination of high copying errors during RNA replication. 


  • The sentence is the end of the line ending with a period (removing the spaces before and after).
  • For
  • if text and (..., we use a short-circuit evaluation to prevent the next test in... from spouting errors when the text is None (if the passage does not contain a text tag).

2022-10-03 01:00

When using the CSS selector with BeautifulSoup.

 from bs4 import BeautifulSoup

with open('result.xml') as xml:
    soup = BeautifulSoup(xml, 'xml')

texts =' ''
  infon [key="type"]: -soup-contains("abstract"):not(:-soup-contains("_title")~
  text:-soup-contains ("COVID-19", "SARS-CoV-2")
text = [t.text for text in text]


Neurological compositions of COVID-19 infection have been recently included and include dizziness, headache, loss of taste and smell, stroke, and encephalopathy.
The rapid evolution of RNA viruses SARS-CoV-2 has been long considered to result from a combination of high copying errors during RNA replication. 

2022-10-03 01:00

If you have any answers or tips

© 2025 OneMinuteCode. All rights reserved.