I would like to know how to further narrow down what is obtained by the select
method of Beautifulsup.
After retrieving the elements of class="A"
, I would like to narrow down the elements of class="B"
and class="C"
below.
<div class="A">
<div class="B">...</div>
<div class="C">...</div>
</div>
The following is how to use bs4.BeautifulSoup.select
.
from bs4 import BeautifulSoup
data = ''
<div class='A'>
<div class='B'></div>
<div class='C'></div>
<div class='D'></div>
<div class='E'></div>
</div>
'''
soup=BeautifulSoup(data, features="html.parser")
selection=soup.select('div.A>div.B,div.C')
print(selection)
=>
[<div class="B"></div>, <div class="C">/div>]
from urlib.request import urlopen
from bs4 import BeautifulSoup
import pandas aspd
URL='https://azure.microsoft.com/ja-jp/services/'
html=urlopen(URL)
bs = BeautifulSoup(html, "html.parser")
# selection
selected = bs.select(
'.product-category, .product-category+p, h3.text-heading5span, h3.text-heading5+p'
)
# filtering
n, tbl = 1, [ ]
for item in selected:
val=item.text
if item.has_attr('class'):
cls = item.attrs ['class'] [0]
if cls=='product-category':
category=val
continue
elifcls=='text-body4':
service_abstract=val
tbl.append([n, category, category_abstract, service, service_abstract])
n + = 1
elif item.name == 'span':
service=val
elif item.name == 'p':
category_abstract=val
# dataframe
df = pd.DataFrame(
tbl, columns = [
'Item Number', 'Category Name', 'Category Summary', 'Service Name', 'Service Description'
]
)
output results
print(df.to_markdown(index=False))
How to further narrow down what you get from the Beautifulsoup select method
After retrieving the element class="A",
if the first div
element appears
(And if it's supposed to be class="A"
, I might write it like the latter.)
clsa=soup.select('div.A')[0]
#clsa=soup.div #Personally, I might use this one
Narrow down the elements of class="B" and class="C" under your control
children
may be used directly under clsa
for elmin clsa.children:
print(elm)
clsa
If not directly below, find_all calls will be either lukewarm or strictly requested
cls_bc=clsa('div')#div under
cls_bc=clsa(class_=True)#Those with class attributes under
cls_bc=clsa('div', class_=True)#div under which the class attribute is present
Prepare functions if they meet reasonably complex conditions (lambda
is acceptable).
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')
cls_bc = clsa(has_class_but_no_id)
Note:
#Get Anchor (Link) from Contents page
url='https://kakuyomu.jp/works/1122334455667788__' #Something appropriate novel table of contents page
r=requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
# When you follow an element (tag), you don't have to modify it completely, such as (elem).
# Optional, such as (elem).li
links = {li.a ['href']: li.a.text
for li in group.find(id='table-of-contents') ('li') if li.a is not None}
for pin links:
title=links[p].trip().split('\n')
print(f'{urljoin(url,p)}:\n\t{title}')
You can connect find
or find_all
.class
is the reserved word for Python, so >a=soup.select("div.A")[0]
a. find_all("div", class_="B")#=><div class="B">/div>
© 2024 OneMinuteCode. All rights reserved.