<h1>Title</h1>
<h2>Subtitle 1<h2>
<ul>
<li> Body </li>
<li> Body </li>
...
</ul>
<h3>Subtitle<h3>
<ul>
<li> Body </li>
<li> Body </li>
...
</ul>
<h2>Subtitle 2<h2>
...
<h2>Subtitle 3<h2>
...
What should I do if I want to parse the text from HTML with such a structure as title->subtitle->subtitle within the specified range?
title_lv1 = group.findAll('h1')
for title1 entry_lv1:
if title1.text==sys.argv[1]:
print(title1.findAll('h2')
This will result in an empty list.
python html python3 web-scraping
The html presented in the question contains tags that are not closed, but is assumed to use html with tags closed, such as <h2>...</h2>
.
The html provided is not nested, so you cannot retrieve the h2
in h1
using findAll.
You can get h2
in title1.findAll('h2')
with the following structure:
<h1>
<h2>Nested!</h2>
</h1>
For parallel structures like this one, you must use find_next_sibling
to conditionally retrieve the tag next to you.
I think I can get the desired body with the following code.
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('test.html'), 'html.parser')
Rewrite each title specified in #text to a variable or command line argument
h3=soup.find('h1', text='Title').find_next_sibling('h2', text='Subtitle').find_next_sibling('h3', text='Subsubtitle')
ul=h3.find_next_sibling('ul')
for linul.findAll('li'):
print(li.text)
However, the above code is find_next_sibling
and finds all the tags underneath the tags found, which can cause the following problems:
There is no problem if the tag must exist, but if it is not, do not add text conditions to find_next_sibling
and modify it to use a for statement to identify the range.
<h1>Title</h1>
<h2>Subtitle</h2>
<!--Subtitle missing-->
<h2>Subtitle 2</h2>
<h3>Subtitle</h3><!--←Skip under Subtitle 2 in Sample Code NextSibling -->
© 2024 OneMinuteCode. All rights reserved.