Unable to retrieve data while crawling python beautiful soup. Please tell me how to crawl a web page with JavaScript using selenium!

https://mensaar.de/#/menu/sb

I want to get the menu data from the web page above and print it out, but it's not working well. The code I wrote is as follows.

//#encoding = utf-8

import urllib.request
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "https://mensaar.de/#/menu/sb/"
html=urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')

meal_list=soup.findAll("div")
print (meal_list)

I try to print out all the div tags for the test, but the result value is omitted a lot as shown below.

[<div class="navbar-header">
<button class="navbar-toggle" data-target="#mensaar-navbar-collapse" data-toggle="collapse" type="button">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
</div>, <div class="container-fluid">
<div class="collapse navbar-collapse" id="mensaar-navbar-collapse">
<ul class="nav navbar-nav">
<li data-match-route="^/$"><a class="mensaar-brand" data-ng-click="collapseNavbar()" href="#/">MenSaar.de</a></li>
<li data-match-route="^/menu(/\w+)?$" $" data-ng-cloak=""><a data-ng-click="collapseNavbar()" href="#/menu">Speiseplan</a></li>
</ul>
<ul class="nav navbar-nav navbar-right">
<li><a data-match-route="^/privacy$" data-ng-click="collapseNavbar()" href="#/privacy">Datenschutz</a></li>
<li><a href="http://www.studentenwerk-saarland.de/de/Impressum-(2)/Impressum">Impressum</a></li>
</ul>
</div>
</div>, <div class="collapse navbar-collapse" id="mensaar-navbar-collapse">
<ul class="nav navbar-nav">
<li data-match-route="^/$"><a class="mensaar-brand" data-ng-click="collapseNavbar()" href="#/">MenSaar.de</a></li>
<li data-match-route="^/menu(/\w+)?$" $" data-ng-cloak=""><a data-ng-click="collapseNavbar()" href="#/menu">Speiseplan</a></li>
</ul>
<ul class="nav navbar-nav navbar-right">
<li><a data-match-route="^/privacy$" data-ng-click="collapseNavbar()" href="#/privacy">Datenschutz</a></li>
<li><a href="http://www.studentenwerk-saarland.de/de/Impressum-(2)/Impressum">Impressum</a></li>
</ul>
</div>, <div data-ng-view="" id="view"></div>]

Process finished with exit code 0

I searched and found that pages containing JavaScript need to be processed separately to get the data completely, so if there is anyone who knows a solution, please answer...!

python crawling beautifulsoup

2022-09-22 15:45

1 Answers

When the page is completely loaded, data is loaded into ajax. Of course, these parts need to be handled separately.

That is, the request url calls one, but another url within the url.

You need to create a page by combining the values obtained by calling each called url.

However, it is cumbersome to do so, so you can also use a web browser to invoke url and use return values (html).

A typical method is to use selenium webdriver.

The principle is simple, but it automatically launches a web browser to call url and browse to the screen. You can obtain the browsed html code.

You can parse the obtained html using beautiful soap.

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.PhantomJS() # Use phantomjs for browser. i, firefox, chrome are also possible
driver.get('https://mensaar.de/#/menu/sb')
bs = BeautifulSoup(driver.page_source, 'html5lib')
print(bs.findAll("div"))

2022-09-22 15:45

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656