I want to create a program that extracts the contents of text at a certain location from an xml file, stores it in a text file, extracts co-occurrence words between the csv file and the full text in the text file, and outputs the number of co-occurrence words and the given id.
Example Execution Results)
(common terminology) Number of common terms given to terms id)
acute20000
distress10000
coronavirus1111
China 211111
(Number of id given)
0000 3
1111 3
Source Code
from bs4 import BeautifulSoup
import csv
# Load xml files
with open('ab36_37.xml', 'r', encoding='utf-8') as xml:
soup = BeautifulSoup(xml, 'xml')
# Extract the strings of COVID-19 and SARS-CoV-2 in the paragraph text in the passage
texts = group.select(' ''
passage>
infon[key="type"]:-soup-contains("paragram")~text:-soup-contains("SARS-CoV-2")
''')
text = [t.text for text in text]
xml.close()
# Save results to specified file
with open('re_ab3637.txt', 'w') as txt:
print('\n'.join(text), file=txt)
txt.close
Example csv file
0000,acute
0000, distress
1111, coronavirus
1111, China
Text File Example
Severe acute response distress syndrome due to acute coronavirus (SARS-CoV-2), which was first diagnosed in China, China in December 2019.
You just count the words in CSV from the sample text, right?
I use the expression co-occurrence, so it just sounds difficult
There is a method of counting the frequency of occurrence of a string from a string, so it's fairly easy to do.
https://hibiki-press.tech/python/count/103
Please rename the file as appropriate
defmain():
# Array CSVs with IDs and words one line at a time
with open('words.csv', 'r') as f:
rows=f.readlines()
# put the entire sentence someone looks for in a search for
with open('re_ab3637.txt', 'r') asf:
text=f.read()
# Map to id=>count
id_count = {}
with open('result1.csv', 'w') as f:
For row in rows:
# Divide the string id, word into
tmp = row.split(',')
id=tmp[0]
# remove as it has a new line
word=tmp[1].trip()
# count the number of words in the text
count=text.count(word)
f.write('%s, %d, %s\n'%(word, count, id))
# If you already have an id, add a count.
if id in id_count:
id_count [id] + = count
else:# If not, create an entry
id_count [id] = count
# Output id=>count
with open('result2.csv', 'w') as f:
for id, count in id_count.items():
f.write('%s, %d\n'%(id, count))
if__name__=='__main__':
main()
result1.csv
acute,20000
distress, 10000
coronavirus, 1,1111
China, 2,1111
result2.csv
0000,3
1111,3
© 2025 OneMinuteCode. All rights reserved.