Site with star balloon data received by African vijays (http://poong.today/chart/day)
I want to crawl, but press f12 on the site to develop the developer tool as shown in the picture below
If you look at the day, there are data by ranking
I wanted to bring this, so I put the form data in the post as below.
import requests
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
url = 'http://poong.today/chart/day'
param = {
"year" : '2020',
"month" : '05',
"day" : '20',
"ks" : 'false'
}
req = requests.post(url,data=param,headers=header)
print(req.text)
When I print it out, I don't have the data I want, and it's printed as below.
<!doctype html>
<html lang="en">
<head>
<title>Page Expired</title>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<!-- Fonts -->
<link href="https://fonts.googleapis.com/css?family=Nunito" rel="stylesheet" type="text/css">
(Omitted because the back part is long)
Ask for advice on what's wrong and what I need to do to get the data I want to request.
Below is a picture of the form data
import requests
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'X-CSRF-TOKEN': '#I deliberately went. Use values ripped with DEV TOOLS',
'X-Requested-With': 'XMLHttpRequest',
'Cookie': '#I purposely covered it. Use the value you ripped off with DEV TOOLS'
}
url = 'http://poong.today/chart/day'
param = {
"year" : '2020',
"month" : '05',
"day" : '24',
"ks" : 'false'
}
req = requests.post(url,data=param,headers=header)
print(req.text)
The site seems to have added some devices to prevent unauthorized crawling.
We need to add more information to the header. Specifically
After adding these four, I got the following response
[{"i":"skaosdk7","n":"BJ\ud558\ub298\uc774\u2665","b":23357,"r":1,"h":[3268,3097,9989,2950,4053,"","","","","","","","","","","","","","","","","","",""],"c":208,"v":131},{"i":"gksk4998","n":"\uc140\ub9ac+\u2665","b":22718,"r":2,"h":[1374,1534,1758,5558,408,4858,7228]] ... a heavy policy
However, since the X-CSRF-TOKEN value or cookie value is estimated to be the value generated when accessing with a browser, if it is a crawler that needs to be rotated periodically, Selenium and Headless Chrome should be combined to decorate it as if a real user accessed through a browser.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! If the site developer has blocked it like this, it means that abnormal behavior such as crawling should not interfere with the operation of the site. All of the traffic is ultimately the cost of the site administrator. You have to decide for yourself whether it's not a problem to keep crawling these sites.
© 2024 OneMinuteCode. All rights reserved.