I'm thinking of scraping the following pages in python to extract all and other elements of the university ranking.
https://www.topuniversities.com/university-rankings/university-subject-rankings/2020/arts-humanities
I wrote the code using selenium BeautifulSoup, but I can access the tab, but I am having trouble extracting characters.
from selenium import webdriver
from bs4 import BeautifulSoup
driver=webdriver.Chrome()
url='https://www.topuniversities.com/university-rankings/university-subject-rankings/2020/arts-humanities'
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
found=soup.find('div', class_='tab-content')
s=found.find('div', id='ranking-data-load_ind')
a=s.findAll('div', class_='rowind-row')
print(a)
-->[ ]
I look forward to your kind cooperation.
python html web-scraping
If you look at the HTML source of the web page, the ranking part is dynamically generated in JavaScript.
<script id="ranking-row-html_ind" type="text/html">
<div class="row ind-row">
:
Therefore, you may want to wait until the div
element is rendered.
The following uses Selenium 4.0.0
, BeautifulSoup44.10.0
, but some of the APIs have changed since Selenium4
.It also specifies the headless
mode.
from selenium import webdriver
from selenium.webdriver.chrome import service ascs
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
chrome_service=cs.Service(executable_path='/usr/local/bin/chromedriver')
options=webdriver.ChromeOptions()
opts.headless=True
driver=webdriver.Chrome(service=chrome_service, options=opts)
url='https://www.topuniversities.com/university-rankings/university-subject-rankings/2020/arts-humanities'
driver.get(url)
# wait until div tag rendered
WebDriverWait(driver, 5).until(EC.presence_of_element_located(By.CLASS_NAME, 'ind-row'))))
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
found=soup.find('div', class_='tab-content')
s=found.find('div', id='ranking-data-load_ind')
a=s.findAll('div', class_='rowind-row')
print(a)
# output result
[<div class="row ind-row">
<div class="col-lg-12">
<div class="_qs-ranking-data-header-new-white">
<div class="row">
<div class="col-lg-5_right_background">
:
© 2024 OneMinuteCode. All rights reserved.