https://torahack.com/python-scraping-for-seo/
As you can see on this site,
検索 Search results Obtain the URL of the top site
②Extract title, description, etc.
③Output from CSV, download
I'd like to do that, but it doesn't work.
When I created and executed the Google colab file below
The csv file was generated, but the contents were empty.
Does anyone know how to make it work?
https://colab.research.google.com/drive/1mW3E74mPd_cIAtkbEn7sudOT4n1s3aT4
The code is as follows.
import requests
import bs4
from time import sleep
import pandas aspd
from google.colab import files
import re
# enter the keyword one wants to search for
listKeyword=['Dog', 'Cat']
# change the number to match the number of items one wants to acquire
searchNum=str(2)
response=requests.get('https://www.google.co.jp/search?num='+searchNum+'&q='+'.join(listKeyword))
response.raise_for_status()
# parse the HTML obtained from
soup=bs4.BeautifulSoup(response.content, "html.parser")
file_prefix=""
for word in listKeyword:
if(file_prefix==""):
file_prefix+=str(word)
else:
file_prefix+="-"+str(word)
fileName=file_prefix+'_Top'+searchNum+'.csv'
# Set the header for the csv file
df = pd.DataFrame (columns = ['URL', 'Title', 'Description', 'metakey'])
sleepCounter = 0
# retrieve the URL of the top site of the search results
for a in group.select('div#searchh3.ra'):
sleepCounter+=1
url=re.sub(r'/url\?q=|&sa.*',',', a.get('href')))
try:
# load the retrieved URL
search=requests.get(url)
searchSoup=bs4.BeautifulSoup(search.content, "html.parser")
# Capturing Title
titleList=[ ]
for a in searchSoup.select('title'):
titleList.append(a.text)
title='"
for index, item in enumerate (titleList):
if index == 0:
title=item
else:
title=title+', '+item
# Retrieving Descriptions
descriptionList = [ ]
for a in searchSoup.select('meta[name="description"]'):
descriptionList.append(a.get('content')))
description='No data'
for index, item in enumerate (descriptionList):
if index == 0:
description=item
else:
description=description+', '+item
# Obtaining Keywords
keywordList = [ ]
for a in searchSoup.select('meta[name="keywords"]'):
keywordList.append(a.get('content')))
keywords = 'No data'
for index, item in enumerate (keywordList):
if index == 0:
keywords=item
else:
keywords=keyword+', '+item
except:# exception handling:what to do when a site cannot be loaded
print('Failed to read website.')
# Add retrieved URLs, titles, descriptions, and keywords
outputRow = [url, title, description, keywords ]
s=pd.Series(outputRow, index=['URL', 'Title', 'Description', 'metakey'])
df=df.append(s,ignore_index=True)
# Allow 10 seconds to wait for every 10 searches to meet the limit on the number of requests per second
if sleepCounter>10:
sleep(10)
sleepCounter = 0
# output to csv
df.to_csv(fileName, index=False)
# Download csv
files.download(fileName)
The content returned from requests does not seem to hit the select used in the for statement.
I think the reason is that the html format of Google has changed, but I could get a URL like that with the following syntax, so I might be able to achieve my goal by modifying it.
However, if the format changes again, the code will become useless, so it is recommended that you use the method in the comments.
import requests
import bs4
import re
# enter the keyword one wants to search for
listKeyword=['Dog', 'Cat']
# change the number to match the number of items one wants to acquire
searchNum=str(2)
response=requests.get('https://www.google.co.jp/search?num='+searchNum+'&q='+'.join(listKeyword))
response.raise_for_status()
soup=bs4.BeautifulSoup(response.content, "html.parser")
# Search results Get URL of top site (cannot get)
for a in group.select('div#searchh3.ra'):
print('a, if available, go through here)')
# Obtain the URL starting with '/url' in the search results
for link in [a.get('href') for a in group.select('a') if a.get('href').startswith('/url')]:
url=re.sub(r'/url\?q=|&sa.*',',',link)
print(url)
© 2024 OneMinuteCode. All rights reserved.