code:
blok[0].find_all('span')
out
[<span>(11/20 11:46)</span>,
<span class="PhotoIcon"></span>,
<span>(11/20 11:04)</span>,
<span class="PhotoIcon"></span>,
<span>(11/20 10:49)</span>,
<span class="PhotoIcon"></span>,
<span>(11/20 10:45)</span>,
<span>(11/2008:58)</span>,
<span class="PhotoIcon"></span>,
<span> (11/2007:43)</span>,
If it's a sample, you can just search by string regular expression without using HTML parser
import re
blok = ' '
<span> (11/20 11:46)</span>
<span class="PhotoIcon">/span>
<span>(11/20 11:04)</span>
<span class="PhotoIcon">/span>
<span> (11/20 10:49)</span>
<span class="PhotoIcon">/span>
<span>(11/20 10:45)</span>
<span> (11/2008:58)</span>
<span class="PhotoIcon"></span>,
<span> (11/2007:43)</span>
''']
print(re.findall('<span>.*</span>', block[0]))
Run Results
['<span>(11/20 11:46)</span>', '<span>(11/20 11:04)<>(11/20 11:04)<>(11/20 10:49)<>>(11/20 10:49)>>>(11/10;>>>>>
Collect tags that do not contain class
For the CSS selector.
from bs4 import BeautifulSoup
from print import print
html = ' '
<span> (11/20 11:46)</span>
<span class="PhotoIcon">/span>
<span>(11/20 11:04)</span>
<span class="PhotoIcon">/span>
<span> (11/20 10:49)</span>
<span class="PhotoIcon">/span>
<span>(11/20 10:45)</span>
<span> (11/2008:58)</span>
<span class="PhotoIcon">/span>
<span> (11/2007:43)</span>
'''
soup = BeautifulSoup(html, 'html.parser')
elms=soup.select('span:not([class])')
print(elms)
#
# [<span>(11/20 11:46)</span>,
# <span>(11/20 11:04)</span>,
# <span>(11/20 10:49)</span>,
# <span>(11/20 10:45)</span>,
# <span>(11/2008:58)</span>,
# <span>(11/20007:43)</span>]
I think it's probably BeautifulSoup, so
In[1]:from bs4 import BeautifulSoup
...:
...: html='<block>
...: <span>(11/20 11:46)</span>,
...: <span class="PhotoIcon"></span>,
...: <span>(11/20 11:04)</span>,
...: <span class="PhotoIcon"></span>,
...: <span>(11/20 10:49)</span>,
...: <span class="PhotoIcon"></span>,
...: <span>(11/20 10:45)</span>,
...: <span>(11/2008:58)</span>,
...: <span class="PhotoIcon"></span>,
...: <span> (11/2007:43)</span>,
...: '''
...:
In[2]—soup=BeautifulSoup(html)
...: block=soup.block
...: block.find_all('span')
...:
Out [2]:
[<span>(11/20 11:46)</span>,
<span class="PhotoIcon"></span>,
<span>(11/20 11:04)</span>,
<span class="PhotoIcon"></span>,
<span>(11/20 10:49)</span>,
<span class="PhotoIcon"></span>,
<span>(11/20 10:45)</span>,
<span>(11/2008:58)</span>,
<span class="PhotoIcon"></span>,
<span>(11/20007:43)</span>]
In[3]—block('span', class_=True)
Out [3]:
[<span class="PhotoIcon"></span>,
<span class="PhotoIcon"></span>,
<span class="PhotoIcon"></span>,
<span class="PhotoIcon"></span>]
In[4]—block('span', class_=False)
Out [4]:
[<span>(11/20 11:46)</span>,
<span>(11/20 11:04)</span>,
<span>(11/20 10:49)</span>,
<span>(11/20 10:45)</span>,
<span>(11/2008:58)</span>,
<span>(11/20007:43)</span>]
In [5]:
add
By the way, block('span', class_=False)
is
block.find_all('span', class_=False)
can also be described (shortened only)
Thank you for your easy answer.
It was very nice to use sopu.find_all('div', True).
How do I avoid unnecessary tags in python web scraping?
Example
IN: blok.find_all('li')
out:<a href="/articles/ASQCP5DZ2QCPULFA00L.html">Second Supplementary Budget Fund, a record 8.9 trillion yen waste hotbed and criticism
© 2024 OneMinuteCode. All rights reserved.