To collect tags that do not contain a class of tags in Python web scraping

Asked 2 years ago, Updated 2 years ago, 348 views

code:

blok[0].find_all('span')

out

[<span>(11/20 11:46)</span>,
 <span class="PhotoIcon"></span>,
 <span>(11/20 11:04)</span>,
 <span class="PhotoIcon"></span>,
 <span>(11/20 10:49)</span>,
 <span class="PhotoIcon"></span>,
 <span>(11/20 10:45)</span>,
 <span>(11/2008:58)</span>,
 <span class="PhotoIcon"></span>,
 <span> (11/2007:43)</span>,

python web-scraping beautiful-group

2022-11-21 05:36

5 Answers

If it's a sample, you can just search by string regular expression without using HTML parser

import re

blok = ' '
<span> (11/20 11:46)</span>
<span class="PhotoIcon">/span>
<span>(11/20 11:04)</span>
<span class="PhotoIcon">/span>
<span> (11/20 10:49)</span>
<span class="PhotoIcon">/span>
<span>(11/20 10:45)</span>
<span> (11/2008:58)</span>
<span class="PhotoIcon"></span>,
<span> (11/2007:43)</span>
''']
print(re.findall('<span>.*</span>', block[0]))

Run Results

['<span>(11/20 11:46)</span>', '<span>(11/20 11:04)<>(11/20 11:04)<>(11/20 10:49)<>>(11/20 10:49)>>>(11/10;>>>>>


2022-11-21 08:05

Collect tags that do not contain class

For the CSS selector.

 from bs4 import BeautifulSoup
from print import print

html = ' '
<span> (11/20 11:46)</span>
<span class="PhotoIcon">/span>
<span>(11/20 11:04)</span>
<span class="PhotoIcon">/span>
<span> (11/20 10:49)</span>
<span class="PhotoIcon">/span>
<span>(11/20 10:45)</span>
<span> (11/2008:58)</span>
<span class="PhotoIcon">/span>
<span> (11/2007:43)</span>
'''
soup = BeautifulSoup(html, 'html.parser')
elms=soup.select('span:not([class])')
print(elms)

#
# [<span>(11/20 11:46)</span>,
#  <span>(11/20 11:04)</span>,
#  <span>(11/20 10:49)</span>,
#  <span>(11/20 10:45)</span>,
#  <span>(11/2008:58)</span>,
#  <span>(11/20007:43)</span>]


2022-11-21 08:28

I think it's probably BeautifulSoup, so

In[1]:from bs4 import BeautifulSoup
   ...:
   ...: html='<block>
   ...:  <span>(11/20 11:46)</span>,
   ...:  <span class="PhotoIcon"></span>,
   ...:  <span>(11/20 11:04)</span>,
   ...:  <span class="PhotoIcon"></span>,
   ...:  <span>(11/20 10:49)</span>,
   ...:  <span class="PhotoIcon"></span>,
   ...:  <span>(11/20 10:45)</span>,
   ...:  <span>(11/2008:58)</span>,
   ...:  <span class="PhotoIcon"></span>,
   ...:  <span> (11/2007:43)</span>,
   ...: '''
   ...:

In[2]—soup=BeautifulSoup(html)
   ...: block=soup.block
   ...: block.find_all('span')
   ...:
Out [2]:
[<span>(11/20 11:46)</span>,
 <span class="PhotoIcon"></span>,
 <span>(11/20 11:04)</span>,
 <span class="PhotoIcon"></span>,
 <span>(11/20 10:49)</span>,
 <span class="PhotoIcon"></span>,
 <span>(11/20 10:45)</span>,
 <span>(11/2008:58)</span>,
 <span class="PhotoIcon"></span>,
 <span>(11/20007:43)</span>]

In[3]—block('span', class_=True)
Out [3]:
[<span class="PhotoIcon"></span>,
 <span class="PhotoIcon"></span>,
 <span class="PhotoIcon"></span>,
 <span class="PhotoIcon"></span>]

In[4]—block('span', class_=False)
Out [4]:
[<span>(11/20 11:46)</span>,
 <span>(11/20 11:04)</span>,
 <span>(11/20 10:49)</span>,
 <span>(11/20 10:45)</span>,
 <span>(11/2008:58)</span>,
 <span>(11/20007:43)</span>]

In [5]:

add
By the way, block('span', class_=False) is
block.find_all('span', class_=False) can also be described (shortened only)


2022-11-21 11:45

Thank you for your easy answer.

It was very nice to use sopu.find_all('div', True).


2022-11-21 14:29

How do I avoid unnecessary tags in python web scraping?
Example
IN: blok.find_all('li')
out:<a href="/articles/ASQCP5DZ2QCPULFA00L.html">Second Supplementary Budget Fund, a record 8.9 trillion yen waste hotbed and criticism


2022-11-21 17:44

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.