Get Google Search Screen Information with Python Scraping

Asked 1 years ago, Updated 1 years ago, 64 views

https://torahack.com/python-scraping-for-seo/
As you can see on this site,

検索 Search results Obtain the URL of the top site
②Extract title, description, etc.
③Output from CSV, download

I'd like to do that, but it doesn't work.
When I created and executed the Google colab file below
The csv file was generated, but the contents were empty.

Does anyone know how to make it work?

https://colab.research.google.com/drive/1mW3E74mPd_cIAtkbEn7sudOT4n1s3aT4

The code is as follows.

import requests
import bs4
from time import sleep
import pandas aspd
from google.colab import files
import re

# enter the keyword one wants to search for
listKeyword=['Dog', 'Cat']

# change the number to match the number of items one wants to acquire
searchNum=str(2)

response=requests.get('https://www.google.co.jp/search?num='+searchNum+'&q='+'.join(listKeyword))
response.raise_for_status()

# parse the HTML obtained from
soup=bs4.BeautifulSoup(response.content, "html.parser")

file_prefix=""
for word in listKeyword:
  if(file_prefix==""):
    file_prefix+=str(word)
  else:
    file_prefix+="-"+str(word)

fileName=file_prefix+'_Top'+searchNum+'.csv'

# Set the header for the csv file
df = pd.DataFrame (columns = ['URL', 'Title', 'Description', 'metakey'])

sleepCounter = 0

# retrieve the URL of the top site of the search results
for a in group.select('div#searchh3.ra'):
  sleepCounter+=1
  url=re.sub(r'/url\?q=|&sa.*',',', a.get('href')))

  try:
    # load the retrieved URL
    search=requests.get(url)
    searchSoup=bs4.BeautifulSoup(search.content, "html.parser")

    # Capturing Title
    titleList=[ ]
    for a in searchSoup.select('title'):
      titleList.append(a.text)     
    title='"

    for index, item in enumerate (titleList):
      if index == 0:
        title=item
      else:
        title=title+', '+item

    # Retrieving Descriptions
    descriptionList = [ ]
    for a in searchSoup.select('meta[name="description"]'):
      descriptionList.append(a.get('content')))
    description='No data'

    for index, item in enumerate (descriptionList):
      if index == 0:
        description=item
      else:
        description=description+', '+item

    # Obtaining Keywords
    keywordList = [ ]
    for a in searchSoup.select('meta[name="keywords"]'):
      keywordList.append(a.get('content')))     
    keywords = 'No data'

    for index, item in enumerate (keywordList):
      if index == 0:
        keywords=item
      else:
        keywords=keyword+', '+item

  except:# exception handling:what to do when a site cannot be loaded
    print('Failed to read website.')

  # Add retrieved URLs, titles, descriptions, and keywords
  outputRow = [url, title, description, keywords ]
  s=pd.Series(outputRow, index=['URL', 'Title', 'Description', 'metakey']) 
  df=df.append(s,ignore_index=True)

  # Allow 10 seconds to wait for every 10 searches to meet the limit on the number of requests per second
  if sleepCounter>10:
    sleep(10)
    sleepCounter = 0

# output to csv
df.to_csv(fileName, index=False)

# Download csv
files.download(fileName)  

python3 web-scraping beautifulsoup

2022-09-30 14:54

1 Answers

The content returned from requests does not seem to hit the select used in the for statement.
I think the reason is that the html format of Google has changed, but I could get a URL like that with the following syntax, so I might be able to achieve my goal by modifying it.
However, if the format changes again, the code will become useless, so it is recommended that you use the method in the comments.

import requests
import bs4
import re

# enter the keyword one wants to search for
listKeyword=['Dog', 'Cat']

# change the number to match the number of items one wants to acquire
searchNum=str(2)

response=requests.get('https://www.google.co.jp/search?num='+searchNum+'&q='+'.join(listKeyword))
response.raise_for_status()
soup=bs4.BeautifulSoup(response.content, "html.parser")

# Search results Get URL of top site (cannot get)
for a in group.select('div#searchh3.ra'):
    print('a, if available, go through here)')

# Obtain the URL starting with '/url' in the search results
for link in [a.get('href') for a in group.select('a') if a.get('href').startswith('/url')]:
    url=re.sub(r'/url\?q=|&sa.*',',',link)
    print(url)


2022-09-30 14:54

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.