Python Crawling Questions

Asked 2 years ago, Updated 2 years ago, 47 views

Site with star balloon data received by African vijays (http://poong.today/chart/day)

I want to crawl, but press f12 on the site to develop the developer tool as shown in the picture below

If you look at the day, there are data by ranking

I wanted to bring this, so I put the form data in the post as below.

import requests

header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'} 
url = 'http://poong.today/chart/day'

param = {
    "year" : '2020',
    "month" : '05',
    "day" : '20',
    "ks" : 'false'
}

req = requests.post(url,data=param,headers=header)

print(req.text)

When I print it out, I don't have the data I want, and it's printed as below.

<!doctype html>
<html lang="en">
    <head>
        <title>Page Expired</title>

        <meta charset="utf-8">
        <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">

        <!-- Fonts -->
        <link href="https://fonts.googleapis.com/css?family=Nunito" rel="stylesheet" type="text/css">

 (Omitted because the back part is long)

Ask for advice on what's wrong and what I need to do to get the data I want to request.

Below is a picture of the form data

python crawling

2022-09-20 22:23

1 Answers

import requests

header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
          'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
          'X-CSRF-TOKEN': '#I deliberately went. Use values ripped with DEV TOOLS',
          'X-Requested-With': 'XMLHttpRequest',
          'Cookie': '#I purposely covered it. Use the value you ripped off with DEV TOOLS'
}
url = 'http://poong.today/chart/day'

param = {
    "year" : '2020',
    "month" : '05',
    "day" : '24',
    "ks" : 'false'
}

req = requests.post(url,data=param,headers=header)

print(req.text) 

The site seems to have added some devices to prevent unauthorized crawling.

We need to add more information to the header. Specifically

After adding these four, I got the following response

[{"i":"skaosdk7","n":"BJ\ud558\ub298\uc774\u2665","b":23357,"r":1,"h":[3268,3097,9989,2950,4053,"","","","","","","","","","","","","","","","","","",""],"c":208,"v":131},{"i":"gksk4998","n":"\uc140\ub9ac+\u2665","b":22718,"r":2,"h":[1374,1534,1758,5558,408,4858,7228]] ... a heavy policy

However, since the X-CSRF-TOKEN value or cookie value is estimated to be the value generated when accessing with a browser, if it is a crawler that needs to be rotated periodically, Selenium and Headless Chrome should be combined to decorate it as if a real user accessed through a browser.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! If the site developer has blocked it like this, it means that abnormal behavior such as crawling should not interfere with the operation of the site. All of the traffic is ultimately the cost of the site administrator. You have to decide for yourself whether it's not a problem to keep crawling these sites.


2022-09-20 22:23

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.