Creating a Program to Extract Partial Information from an xml File in Python

What do you want to do

I want to extract the abstract <text> in the <passage>in the xml file named result.xml. For <text>, I want to extract the Neurological applications of COVID-19, ~ensphalopathy. and The rapid evolution~replication. statements. Search criteria include either the words COVID-19 or SARS-CoV-2 in the <text> statement.

files:result.xml

<collection>
  <document>
    <passage>
      <infon key="authors">Gupta NA,Lien C,IvM,</infon>
      <offset>0</offset>
      <text>Critical ilness-associated cerebral microbles in severity COVID-19 inspection</text>
      <announcement id="5">
        <location offset="68" length="9"/>
        <text>infection</text>
      </annotation>
    </passage>
    <passage>
      <infon key="section_type">ABSTRACT</infon>
      <infon key="type">abstract</infon>
      <offset>81</offset>
      <text>Neurological compositions of COVID-19 inspection have been included frequently described and included dizziness, headache, loss of taste and smell, stroke, and encephalopathy.</text>
    </passage>
    <passage>
      <infon key="section_type">ABSTRACT</infon>
      <infon key="type">abstract_title_1</infon>
      <offset>584</offset>
      <text>Highlights</text>
    </passage>
  </document>
  <document>
    <passage>
      <infonkey="name_4">surname:Ansari;given-names:M.Azim</infon>
      <offset>0</offset>
      <text>Extensive C->U transition bias in the genomes of a wide range of mammarian RNA viruses;potential associations with transcriptive mutations, damage-or host-medicated editing of viral RNA</text>
      <announcement id="1">
        <infon key="identifier">9606</infon>
        <infon key="type">Species</infon>
        <location offset="67" length="9"/>
        <text>mammalian</text>
      </annotation>
    </passage>
    <passage>
      <infon key="type">abstract</infon>
      <offset>191</offset>
      <text>The rapid evolution of RNA viruses SARS-CoV-2 has been long consolidated to result from a combination of high copying error frequencies during RNA replication.</text>
    </passage>
    <passage>
      <infon key="section_type">ABSTRACT</infon>
      <infon key="type">abstract_title_1</infon>
      <offset>2033</offset>
      <text>Author summary</text>
    </passage>
  </document>
</collection>

Script: 1.py

 from bs4 import BeautifulSoup
with open('result.xml') as xml:
    soup = BeautifulSoup(xml, 'xml')

result = [ ]

for passage in group.find_all('passage'):
    text=passage.text
    if text and ('COVID-19' in text or 'SARS-CoV-2' in text):
        for line in text.splitlines():
            ifline.strip().endswith('.'):
                result3.append(line)

print(*result, sep='^\n')


python
python3
					
					

	


		
	

	
		2022-10-03 01:00



			

			
			2 Answers


	
		
Use BeautifulSoup.
 from bs4 import BeautifulSoup
soup = BeautifulSoup(xml, 'xml')

result = [ ]

for passage in group.find_all('passage'):
    text=passage.text
    if text and ('COVID-19' in text or 'SARS-CoV-2' in text):
        for line in text.splitlines():
            ifline.strip().endswith('.'):
                result.append(line)

print(*result, sep='\n')

# Neurological compositions of COVID-19 infection have been recently included and include dizziness, headache, loss of taste and smell, stroke, and encephalopathy.
# The rapid evolution of RNA viruses SARS-CoV-2 has been long considered to result from a combination of high copying errors during RNA replication. 
Supplementary

The sentence is the end of the line ending with a period (removing the spaces before and after).
For if text and (..., we use a short-circuit evaluation to prevent the next test in... from spouting errors when the text is None (if the passage does not contain a text tag).


		
		
			

				

					
				

				
					2022-10-03 01:00
				
			
		
	


	
		
When using the CSS selector with BeautifulSoup.
 from bs4 import BeautifulSoup

with open('result.xml') as xml:
    soup = BeautifulSoup(xml, 'xml')

texts = group.select(' ''
  passage>
  infon [key="type"]: -soup-contains("abstract"):not(:-soup-contains("_title")~
  text:-soup-contains ("COVID-19", "SARS-CoV-2")
''')
text = [t.text for text in text]

print('\n'.join(text))

#
Neurological compositions of COVID-19 infection have been recently included and include dizziness, headache, loss of taste and smell, stroke, and encephalopathy.
The rapid evolution of RNA viruses SARS-CoV-2 has been long considered to result from a combination of high copying errors during RNA replication. 


		
		
			

				

					
				

				
					2022-10-03 01:00
				
			
		
	
			
			If you have any answers or tips



		

	
		Popular Tags
	
	python x 4647
android x 1593
java x 1494
javascript x 1427
c x 927
c++ x 878
ruby-on-rails x 696
php x 692
python3 x 685
html x 656
	


	
		Popular Questions
	
	
	644 ML-Agent tutorial says "Heuristic method called but not implemented.Returning placeholder actions." and fails to proceed

	772 GDB gets version error when attempting to debug with the Presense SDK (IDE)

	781 M2 Mac fails to install rbenv install 3.1.3 due to errors

	1024 /usr/bin/google-chrome:symbol lookup error:/usr/bin/google-chrome: undefined symbol:gbm_bo_get_modifier

	860 Uncaught (inpromise) Error on Electron: An object could not be cloned