I have a question about Python crawling. (Second...)

Asked 2 years ago, Updated 2 years ago, 111 views

Following yesterday's question, I ask you a question.

import urllib.request
import bs4

x=informationurls[4] #informationurls is a list of more than 1000 urls in str form
i=[]
html = urllib.request.urlopen(x)

bsObj = bs4.BeautifulSoup(html, "html.parser")
contents = bsObj.find("div", {"class":"user_content"})
print(contents.text)

If I do this, the result I want is very clear ex) Working area

Suwon, Gyeonggi Province

Salary Decision after interview

Selection procedures and submission documents

screening procedure Document screening-> Practical interview-> Executive interview

Submitted documents Korean resume and cover letter

Application period and method

Reception period ASAP

Resume form Company Form

Reception method Job application for people

Other precautions If false information is found in the job application document, the recruitment may be canceled even after the recruitment is confirmed

This is how you get the price you want The problem is actually

import urllib.request
import bs4

for x in informationurls:
    i=""
    html = urllib.request.urlopen(x)

    bsObj = bs4.BeautifulSoup(html, "html.parser")
    contents = bsObj.find("div", {"class":"user_content"})
    i+=contents.text
    print(i)

This is the back It's supposed to come out separately for each url, but it's a problem because it's all in the ", so it's too much to cut it off at once So I put it in the list, and this time, it was recognized as an object for each letter and came into the list one by one, so I thought this was not right.

I have to save it by url again in Excel, but I'm so at a loss. I really need help.

python html crawler crawling

2022-09-21 23:16

1 Answers

Create a dictionary as shown below, but use the key as url and value as html contents.

def getContents(x):
    html = urllib.request.urlopen(x)
    bsObj = bs4.BeautifulSoup(html, "html.parser")
    return bsObj.find("div", {"class":"user_content"}).text

contentsHolder = {x:getContents(x) for x in informationurls}


2022-09-21 23:16

If you have any answers or tips

Popular Tags
python x 4647
android x 1593
java x 1494
javascript x 1427
c x 927
c++ x 878
ruby-on-rails x 696
php x 692
python3 x 685
html x 656

© 2024 OneMinuteCode. All rights reserved.