Understanding Perth HTML Using Python 3 and Beautifulgroup 4

Asked 2 years ago, Updated 2 years ago, 147 views

<h1>Title</h1>
<h2>Subtitle 1<h2>
<ul>
<li> Body </li>
<li> Body </li>
...
</ul>
<h3>Subtitle<h3>
<ul>
<li> Body </li>
<li> Body </li>
...
</ul>
<h2>Subtitle 2<h2>
...
<h2>Subtitle 3<h2>
...

What should I do if I want to parse the text from HTML with such a structure as title->subtitle->subtitle within the specified range?

 title_lv1 = group.findAll('h1')
    for title1 entry_lv1:
        if title1.text==sys.argv[1]:
            print(title1.findAll('h2')

This will result in an empty list.

python html python3 web-scraping

2022-09-30 21:25

1 Answers

The html presented in the question contains tags that are not closed, but is assumed to use html with tags closed, such as <h2>...</h2>.

The html provided is not nested, so you cannot retrieve the h2 in h1 using findAll.
You can get h2 in title1.findAll('h2') with the following structure:

<h1>
    <h2>Nested!</h2>
</h1>

For parallel structures like this one, you must use find_next_sibling to conditionally retrieve the tag next to you.
I think I can get the desired body with the following code.

 from bs4 import BeautifulSoup
soup = BeautifulSoup(open('test.html'), 'html.parser')
Rewrite each title specified in #text to a variable or command line argument
h3=soup.find('h1', text='Title').find_next_sibling('h2', text='Subtitle').find_next_sibling('h3', text='Subsubtitle')
ul=h3.find_next_sibling('ul')
for linul.findAll('li'):
  print(li.text)

However, the above code is find_next_sibling and finds all the tags underneath the tags found, which can cause the following problems:
There is no problem if the tag must exist, but if it is not, do not add text conditions to find_next_sibling and modify it to use a for statement to identify the range.

<h1>Title</h1>
<h2>Subtitle</h2>
<!--Subtitle missing-->
<h2>Subtitle 2</h2>
<h3>Subtitle</h3><!--←Skip under Subtitle 2 in Sample Code NextSibling -->


2022-09-30 21:25

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.