Understanding How to Use Beautifulsoup

Asked 2 years ago, Updated 2 years ago, 153 views

I would like to know how to further narrow down what is obtained by the select method of Beautifulsup.
After retrieving the elements of class="A", I would like to narrow down the elements of class="B" and class="C" below.

<div class="A">
  <div class="B">...</div>
  <div class="C">...</div>
</div>

beautifulsoup

2022-09-30 20:10

3 Answers

The following is how to use bs4.BeautifulSoup.select.

 from bs4 import BeautifulSoup

data = ''
<div class='A'>
  <div class='B'></div>
  <div class='C'></div>
  <div class='D'></div>
  <div class='E'></div>
</div>
'''

soup=BeautifulSoup(data, features="html.parser")
selection=soup.select('div.A>div.B,div.C')

print(selection)

=>
[<div class="B"></div>, <div class="C">/div>]
from urlib.request import urlopen
from bs4 import BeautifulSoup
import pandas aspd

URL='https://azure.microsoft.com/ja-jp/services/'
html=urlopen(URL)
bs = BeautifulSoup(html, "html.parser")

# selection
selected = bs.select(
  '.product-category, .product-category+p, h3.text-heading5span, h3.text-heading5+p'
)

# filtering
n, tbl = 1, [ ]
for item in selected:
  val=item.text
  if item.has_attr('class'):
    cls = item.attrs ['class'] [0]
    if cls=='product-category':
      category=val
      continue
    elifcls=='text-body4':
      service_abstract=val
      tbl.append([n, category, category_abstract, service, service_abstract])
      n + = 1
  elif item.name == 'span':
    service=val
  elif item.name == 'p':
    category_abstract=val

# dataframe
df = pd.DataFrame(
  tbl, columns = [
    'Item Number', 'Category Name', 'Category Summary', 'Service Name', 'Service Description'
  ]
)

output results
print(df.to_markdown(index=False))


2022-09-30 20:10

How to further narrow down what you get from the Beautifulsoup select method
After retrieving the element class="A",


if the first div element appears (And if it's supposed to be class="A", I might write it like the latter.)

 clsa=soup.select('div.A')[0]

#clsa=soup.div #Personally, I might use this one

Narrow down the elements of class="B" and class="C" under your control

children may be used directly under clsa

for elmin clsa.children:
    print(elm)

clsa If not directly below, find_all calls will be either lukewarm or strictly requested

cls_bc=clsa('div')#div under
cls_bc=clsa(class_=True)#Those with class attributes under
cls_bc=clsa('div', class_=True)#div under which the class attribute is present

Prepare functions if they meet reasonably complex conditions (lambda is acceptable).

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

cls_bc = clsa(has_class_but_no_id)

Note:

#Get Anchor (Link) from Contents page
url='https://kakuyomu.jp/works/1122334455667788__' #Something appropriate novel table of contents page
r=requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

# When you follow an element (tag), you don't have to modify it completely, such as (elem).
# Optional, such as (elem).li

links = {li.a ['href']: li.a.text
        for li in group.find(id='table-of-contents') ('li') if li.a is not None}
for pin links:
    title=links[p].trip().split('\n')
    print(f'{urljoin(url,p)}:\n\t{title}')


2022-09-30 20:10

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.