How can I solve the problem of not being able to read the data when I crawl the web with Beautiful Soup?

Asked 2 years ago, Updated 2 years ago, 106 views

I'm going to use Python Beautiful Soup to crawl the web.

def scrapy():
    url = 'http://cu.bgfretail.com/product/product.do?category=product&depth2=4&sf=N'
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'lxml')

    prodList = soup.find_all("div",   {"class", "prodListWrap"})
    print (prodList)

I'm testing whether I can get products from convenience stores like this The output only results in []. Maybe the prodListWrap class is created dynamically and cannot be received How can we solve this case?

python beautifulsoup

2022-09-22 13:51

1 Answers

If you disable Javascript and open the URL you uploaded, you can check the page that appears to see if it is because it is dynamically loaded with Javascript. You are right about the reason why you mentioned it when you see that the product list doesn't come out.

Once selenium is installed (pip install selenium), it can be used as follows:

from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://cu.bgfretail.com/product/product.do?category=product&depth2=4&sf=N')

html = driver.page_source
soup = BeautifulSoup(html)
prodList = soup.find_all("p", {"class": "prodPrice"})
print(len(prodList))

In this way, Firefox opens url and reads html after loading JavaScript, so you can read a total of 47 products (there were 4 when JavaScript was not working).

Instead of the product price ("class" : "product price" ), you can read it in the class you want.

And in the case of the product list page you uploaded additionally, you have to keep pressing 'More' to retrieve the entire list. More is implemented as a JavaScript function called nextPage.

driver.execute_script("nextPage(1);")
sleep(3)

The following code allows you to simulate the action of pressing More and waiting for 3 seconds. Read more until there are no more items loaded and then read the HTML.


2022-09-22 13:51

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.