You have the code to create a URL list by scraping sites, and you have the code to scrape only the attributes of individual sites.
The URL list shows the URLs of individual sites, but how do I connect to the URLs in the URL list in order and scrape them?Please lend me your wisdom.Thank you for your cooperation.
·Code to obtain URL
import requests, bs4
import codecs
import re
res=requests.get('https://****')
res.raise_for_status()
soup=bs4.BeautifulSoup(res.text, "html.parser")
elems=soup.select('.threadUrlInMetrics')
for elemines:
with open("abcd.txt", "w") as f:
print(elems, file=f)
file=r'abcd.txt'
with open(file) asf:
url_list=f.read()
pattern="https?://[\w/:%#\$&\?\(\)~\.=\+\-]+"
text=url_list
url_list=re.findall(pattern, text)
print(url_list)
·Code for scraping individual sites
import requests, bs4
res=requests.get('https://***')
res.raise_for_status()
soup=bs4.BeautifulSoup(res.text, "html.parser")
elements=soup.select('.container')
for elemines:
print(em)
As @kunif commented, wouldn't it be okay to use the for loop for url_list
?
#Code to get the URL
# ...
url_list=re.findall(pattern, text)
# Code for scraping individual sites
import requests, bs4
for urlin url_list:
res=requests.get(url)
res.raise_for_status()
soup=bs4.BeautifulSoup(res.text, "html.parser")
elements=soup.select('.container')
for elemines:
print(em)
© 2024 OneMinuteCode. All rights reserved.