The problem of crawling by reading a text file containing url in Python

Asked 2 years ago, Updated 2 years ago, 55 views

I wrote a code that stores Naver ratings and one line reviews with Python Strangely, it seems that only the url in the first and last lines of the text file executes the code normally. crying What the hell is the problem? Please point out that I am a child who is using Python for the first time...

from urllib.error import HTTPError

import requests
import time
from bs4 import BeautifulSoup

def getData(url):

    req = requests.get(url)
    html = req.text
    soup = BeautifulSoup(html, 'html.parser')

    sentence = soup.select('div.reporter_line > dl > dd')
    score = soup.select('div.re_score_grp > div > div > em')

    for score_value, value in zip(score, sentence):
        print(value.text, score_value.text)

    small_sc = soup.select('li > div.star_score > em')
    small_se = soup.select('div.score_reple > p')

    for score_value, value in zip(small_sc, small_se):
        print(value.text, score_value.text)


file = open('url.txt','r') #txt file contains one line of url to import.

for url in file:
    print(url)
    getData(url)


Execution Results...

https://movie.na...//url

The best masterpiece of the year, 9.75
Such a simple and heavy impression 9
It's frighteningly wonderful and beautiful 9.5
I'll bet you, IMAX, 8.75 in 3D
The advent of real space movies that will make Earth movies a thing of the past 7
Women in space struggle for existence, even if it's simple to dry, loneliness is the power! 7
Some movies are experienced, not watched. It's amazing. 10
Bluffing is appropriate. Space Circus 8
A life of re-living with overwhelming visuals and urgent distress plays! 9
https://movie.na...//url 

There should be content like the top part, but it doesn't come out

https://movie.na..//url

.

.

.
This is the last line.

https://movie.na..//url


A fun, three-dimensional
 7.75


[Kung Fu Panda] Wow, a double carriage
 7.75
If I were you, I'd ride Toothless instead of Ikran. 8
Ahhhhhhhhhhhhhh let me fly more 8
People who have cats can't help but go wild 7
DreamWorks' brilliant performance
OMG, I never thought I'd fall in love with a dragon
Let the whole family fly. 7
DreamWorks is making a piece of art, too
<There are more exciting 3D than Avatar> 7
I mean, sometimes there's a movie that you have to watch in 3D. 8

python crawling

2022-09-22 19:46

1 Answers

I changed the usage library from requests to from urllib.request import urlopen and it was resolved I did it just in case, but fortunately, it works well. The environment was Python 3.7, and it was run by PiCham Have a great day!

from urlib.request import urlopen
#import requests
from bs4 import BeautifulSoup

def getData(url):

    webpage = urlopen(url)
    soup = BeautifulSoup(webpage, 'html.parser')
    file = open('data.txt','a')
    sentence = soup.select('div.reporter_line > dl > dd')
    score = soup.select('div.re_score_grp > div > div > em')

    for score_value, value in zip(score, sentence):
        print(value.text.strip(), score_value.text.strip())
        file.writelines(value.text.strip().replace(',',' ') +','+ score_value.text.strip() + '\n')


    small_sc = soup.select('li > div.star_score > em')
    small_se = soup.select('div.score_reple > p')

    for score_value, value in zip(small_sc, small_se):
        print(value.text.strip(), score_value.text.strip())
        file.writelines(value.text.strip().replace(',',' ') +','+ score_value.text.strip()+ '\n')

    file.close()

file = open('url.txt','r')

for url in file:
    print(url)
    getData(url)
file.close()


2022-09-22 19:46

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.