University Ranking Scraping at Python

I'm thinking of scraping the following pages in python to extract all and other elements of the university ranking.

https://www.topuniversities.com/university-rankings/university-subject-rankings/2020/arts-humanities

I wrote the code using selenium BeautifulSoup, but I can access the tab, but I am having trouble extracting characters.

 from selenium import webdriver
from bs4 import BeautifulSoup


driver=webdriver.Chrome()

url='https://www.topuniversities.com/university-rankings/university-subject-rankings/2020/arts-humanities'

driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
found=soup.find('div', class_='tab-content')
s=found.find('div', id='ranking-data-load_ind')
a=s.findAll('div', class_='rowind-row')
print(a)

-->[ ]

I look forward to your kind cooperation.

python html web-scraping

2022-09-30 21:59

1 Answers

If you look at the HTML source of the web page, the ranking part is dynamically generated in JavaScript.

<script id="ranking-row-html_ind" type="text/html">
   <div class="row ind-row">
              :

Therefore, you may want to wait until the div element is rendered.

The following uses Selenium 4.0.0, BeautifulSoup44.10.0, but some of the APIs have changed since Selenium4.It also specifies the headless mode.

 from selenium import webdriver
from selenium.webdriver.chrome import service ascs
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

chrome_service=cs.Service(executable_path='/usr/local/bin/chromedriver')
options=webdriver.ChromeOptions()
opts.headless=True
driver=webdriver.Chrome(service=chrome_service, options=opts)

url='https://www.topuniversities.com/university-rankings/university-subject-rankings/2020/arts-humanities'
driver.get(url)

# wait until div tag rendered
WebDriverWait(driver, 5).until(EC.presence_of_element_located(By.CLASS_NAME, 'ind-row'))))

html = driver.page_source
soup = BeautifulSoup(html, 'lxml')

found=soup.find('div', class_='tab-content')
s=found.find('div', id='ranking-data-load_ind')
a=s.findAll('div', class_='rowind-row')
print(a)

# output result
[<div class="row ind-row">
<div class="col-lg-12">
<div class="_qs-ranking-data-header-new-white">
<div class="row">
<div class="col-lg-5_right_background">
              :

2022-09-30 21:59

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656