I have corrected the points you answered before and executed the following code, but the program ends with no output displayed...I would appreciate it if you could answer me if you have any idea of the cause.
#https://review-of-my-life.blogspot.com/2017/10/python-web-scraping-data-collection-analysis.html Practice
# trendAnalytics.py
from selenium import webdriver
from pandas import*
import time
# Access to page
# WebDriver with PATH
browser=webdriver.PhantomJS(executable_path="/content/phantomjs-2.1.1-linux-x86_64/bin/phantomjs") #PhantomJS support seems to be over...?Should I use Headless Chrome...?
# Try https://qiita.com/orangain/items/db4594113c04e8801aad in the cell below.
# DO NOT FORGET to set path
url="http://b.hatena.ne.jp/search/text?safe=on&q=Python&users=50"
browser.get(url)
#!touch trend.csv
# df = pandas.read_csv('trend.csv') # Error Factors
df=pandas.DataFrame()# Reflects previous answers
# Insert title, date, bookmarks into CSV file
page=1#This number shows the number of current page later
while True: # continue until getting the last page
iflen(browser.find_elements_by_css_selector(".pager-next"))>0:
print("####################### page:{}#########################.format(page))
print("Starting to get posts...")
posts=browser.find_elements_by_css_selector(".search-result")# Retrieving something...
for post in posts:
title=post.find_element_by_css_selector("h3").text
date=post.find_element_by_css_selector(".created").text
bookmarks=post.find_element_by_css_selector(".usersspan").text
se=pandas.Series ([title, date, bookmarks], ['title', 'date', 'bookmarks')
df=df.append(se,ignore_index=True)
print(df)
# after getting all posts in a page, click pager next and then get next all posts again
btn=browser.find_element_by_css_selector("a.pager-next").get_attribute("href")#The next page to retrieve posts looks like url.
print("next url:{}.format(btn))
browser.get(btn)#Go to the next page
page+=1
Like browser.implicitly_wait(10)#sleep()?
print("Moving to next page...")
time.sleep(10)#Do you need this...?
else: # if no(next)pager exist, stop.
print("no pager exist anymore")
break
# while end of sentence
df.to_csv("trend1.csv")
print ("DONE")
Output
/usr/local/lib/python 3.6/dist-packages/selenium/webdriver/phantomjs/webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox installation
warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless'
no pager exist anymore
DONE
I use the CSS selector to get links to the next page to move to the next page, but the structure of the page has changed from when the blog article you are referring to was written, and it doesn't work with the same selector.
Therefore, if you do not correct the two parts using browser.find_elements_by_css_selector
, it will not work well.Use the browser's Verification feature to analyze the source site and update it to a new selector.
Please be careful not to put a load on the server by making sufficient time between scraping multiple times during debugging.
Open your browser to http://b.hatena.ne.jp/search/text?safe=on&q=Python&users=50
of the destination URL and check the source.
If you can't find the .pager-next
for the specified CSS selector, the program will end there, so as I wrote, the program is only running.
By the way, if you search pager-next
, you will find the following hits:
HTML/CSS (check browser developer tool)
Display results on the actual browser
© 2024 OneMinuteCode. All rights reserved.