Output of Scrap Results

Asked 2 years ago, Updated 2 years ago, 105 views

I have corrected the points you answered before and executed the following code, but the program ends with no output displayed...I would appreciate it if you could answer me if you have any idea of the cause.

#https://review-of-my-life.blogspot.com/2017/10/python-web-scraping-data-collection-analysis.html Practice
# trendAnalytics.py
from selenium import webdriver  
from pandas import* 
import time

# Access to page
# WebDriver with PATH
browser=webdriver.PhantomJS(executable_path="/content/phantomjs-2.1.1-linux-x86_64/bin/phantomjs") #PhantomJS support seems to be over...?Should I use Headless Chrome...?
# Try https://qiita.com/orangain/items/db4594113c04e8801aad in the cell below.
# DO NOT FORGET to set path
url="http://b.hatena.ne.jp/search/text?safe=on&q=Python&users=50"
browser.get(url)
#!touch trend.csv
# df = pandas.read_csv('trend.csv') # Error Factors
df=pandas.DataFrame()# Reflects previous answers
# Insert title, date, bookmarks into CSV file

page=1#This number shows the number of current page later

while True: # continue until getting the last page
  iflen(browser.find_elements_by_css_selector(".pager-next"))>0:
    print("####################### page:{}#########################.format(page))
    print("Starting to get posts...")
    posts=browser.find_elements_by_css_selector(".search-result")# Retrieving something...

    for post in posts:
      title=post.find_element_by_css_selector("h3").text
      date=post.find_element_by_css_selector(".created").text
      bookmarks=post.find_element_by_css_selector(".usersspan").text
      se=pandas.Series ([title, date, bookmarks], ['title', 'date', 'bookmarks')
      df=df.append(se,ignore_index=True)
      print(df)

    # after getting all posts in a page, click pager next and then get next all posts again
    btn=browser.find_element_by_css_selector("a.pager-next").get_attribute("href")#The next page to retrieve posts looks like url.
    print("next url:{}.format(btn))
    browser.get(btn)#Go to the next page
    page+=1
    Like browser.implicitly_wait(10)#sleep()?
    print("Moving to next page...")
    time.sleep(10)#Do you need this...?

  else: # if no(next)pager exist, stop.
    print("no pager exist anymore")
    break
# while end of sentence

df.to_csv("trend1.csv")
print ("DONE")

Output

/usr/local/lib/python 3.6/dist-packages/selenium/webdriver/phantomjs/webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox installation
  warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless'
no pager exist anymore
DONE

python web-scraping google-colaboratory

2022-09-30 17:26

2 Answers

I use the CSS selector to get links to the next page to move to the next page, but the structure of the page has changed from when the blog article you are referring to was written, and it doesn't work with the same selector.

Therefore, if you do not correct the two parts using browser.find_elements_by_css_selector, it will not work well.Use the browser's Verification feature to analyze the source site and update it to a new selector.

Please be careful not to put a load on the server by making sufficient time between scraping multiple times during debugging.


2022-09-30 17:26

Open your browser to http://b.hatena.ne.jp/search/text?safe=on&q=Python&users=50 of the destination URL and check the source.
If you can't find the .pager-next for the specified CSS selector, the program will end there, so as I wrote, the program is only running.

By the way, if you search pager-next, you will find the following hits:

HTML/CSS (check browser developer tool)

Enter a description of the image here

Display results on the actual browser
Enter a description of the image here


2022-09-30 17:26

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.