Following yesterday's question, I ask you a question.
import urllib.request
import bs4
x=informationurls[4] #informationurls is a list of more than 1000 urls in str form
i=[]
html = urllib.request.urlopen(x)
bsObj = bs4.BeautifulSoup(html, "html.parser")
contents = bsObj.find("div", {"class":"user_content"})
print(contents.text)
If I do this, the result I want is very clear ex) Working area
Suwon, Gyeonggi Province
Salary Decision after interview
Selection procedures and submission documents
screening procedure Document screening-> Practical interview-> Executive interview
Submitted documents Korean resume and cover letter
Application period and method
Reception period ASAP
Resume form Company Form
Reception method Job application for people
Other precautions If false information is found in the job application document, the recruitment may be canceled even after the recruitment is confirmed
This is how you get the price you want The problem is actually
import urllib.request
import bs4
for x in informationurls:
i=""
html = urllib.request.urlopen(x)
bsObj = bs4.BeautifulSoup(html, "html.parser")
contents = bsObj.find("div", {"class":"user_content"})
i+=contents.text
print(i)
This is the back It's supposed to come out separately for each url, but it's a problem because it's all in the ", so it's too much to cut it off at once So I put it in the list, and this time, it was recognized as an object for each letter and came into the list one by one, so I thought this was not right.
I have to save it by url again in Excel, but I'm so at a loss. I really need help.
python html crawler crawling
Create a dictionary as shown below, but use the key as url and value as html contents.
def getContents(x):
html = urllib.request.urlopen(x)
bsObj = bs4.BeautifulSoup(html, "html.parser")
return bsObj.find("div", {"class":"user_content"}).text
contentsHolder = {x:getContents(x) for x in informationurls}
© 2024 OneMinuteCode. All rights reserved.