When scraping the web with the requests module, is it meaningful to manipulate the header?

Asked 1 years ago, Updated 1 years ago, 108 views

I want to send and receive data through Python's requests module.

If you use the request module, it is revealed that it is a scripted connection to a user-agent or a header like this, but if you arbitrarily modify the host, user-agent, origin, referrer, etc.,

Can I hide it like it's a normal web browser request?

Especially, host, origin, and referrer seem to come out only when you enter the menu in the browser.Is that correct?

python web-crawling http requests

2022-09-22 18:16

1 Answers

Tell the server which browser to use. In Python's requests module, the default values are as follows:

python-requests/2.22.0

Anyone can tell you're connecting to Python. Some servers may block this form of user agent string if it is detected. If you replace this user agent string with a Chrome browser, for example, it looks like this.

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36

In this case, the user agent string alone cannot distinguish whether the server is a Chrome browser or a Python requests module.

The domain name of the server. For example, if you send an http request to this site, it will be set as follows:

Host: hashcode.co.kr

An error occurs if this header is missing, multiple, or set to an invalid value.

Indicates from which address the request came from when sending a POST request. If the address you receive the request with origin is different, you may have a problem. Anyway, it's one of the two things that works the same as when you don't set it up, or there's an error, so there's no reason to change it.

Header that tells you which page you have passed and reached. You can use it if you want to fool the server, but user-agent would normally be enough to change.

In addition to what we've covered here, http has a variety of headers. It's good to look up each header or post a question, but the best way is to try it yourself. Please change the header and send the request. You can use the following code.

import requests

headers = requests.utils.default_headers()
headers.update ({'Header Name to Replace': 'Changed Header Value'})

requests.get('url', headers=headers)
requests.post('url', ..., heaers=headers)
# And so on...


2022-09-22 18:16

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.