The code I'm trying to make is to make a code by downloading the pdf files from Google Academic Information magazine through a download address on HTml. However, there were some problems during this download process, and first of all, if you look at the code I ran,
def get_download(url,fname,directory):
try:
os.chdir(directory)
print(url)
request.urlretrieve(url,fname)
print ('Download Complete')
except HTTPError as e:
print(e)
return None
There are three cases when downloading files using request.urlretrieve
The first is if the download is successful.
The second is when an http error occurs
The third is when the download is running, but the pdf file is not available.
Here
In the second case,
For example, url:http://www.academia.edu/download/35716149/leach.pdf
(Actual url:
When you click the pdf button on Google Academic Information magazine, the url specified in the html file is different, but in this case, I couldn't find a way, so I made an exception for now,
In the third case,
url:This is the case for such as Maybe it's because url doesn't specify the pdf file directly, but if you download the file using url,
It says it's an unreadable file. I want to know what part I should study to solve this problem.
I don't know what the problem is.
As you can see below, there is no unique problem.
request header
GET http://journals.sagepub.com/doi/pdf/10.1038/jcbfm.1993.48 HTTP/1.1
Host: journals.sagepub.com
Connection: keep-alive
Cache-Control: max-age=0
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36
Upgrade-Insecure-Requests: 1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate
Accept-Language: ko,en-US;q=0.9,en;q=0.8,ja;q=0.7
Cookie: timezone=540; I2KBRCK=1; SERVER=WZ6myaEXBLGzzb+3qD0SOQ==; SERVER=WZ6myaEXBLGzzb+3qD0SOQ==; MAID=lPQzpL4VA36/hXTPEnQenQ==; MAID=lPQzpL4VA36/hXTPEnQenQ==; MACHINE_LAST_SEEN=2017-12-28T00%3A18%3A08.075-08%3A00; MACHINE_LAST_SEEN=2017-12-28T00%3A18%3A08.075-08%3A00; JSESSIONID=aaahTMzn7gPiC5lBZTUbw; JSESSIONID=aaahTMzn7gPiC5lBZTUbw; _ga=GA1.2.826585137.1514449083; _gid=GA1.2.907697027.1514449083
response header
HTTP/1.1 200 OK
Server: AtyponWS/7.1
Cache-Control: max-age=3600, private, must-revalidate
Pragma:
X-Webstats-RespID: ad685cea9103f80c95659addb3e4e5a1
Content-Disposition: inline; filename=jcbfm.1993.48.pdf
Set-Cookie: JSESSIONID=aaahTMzn7gPiC5lBZTUbw; domain=.journals.sagepub.com; path=/
Content-Type: application/pdf; charset=UTF-8
Date: Thu, 28 Dec 2017 08:35:26 GMT
Content-Length: 236399
Part of body
%PDF-1.5
%
1 0 obj
<</Subtype/XML/Type/Metadata/Length 3356>>stream
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta x:xmptk="Adobe XMP Core 4.2.1-c043 52.372728, 2009/01/18-15:08:04 " xmlns:x="adobe:ns:meta/">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="" xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
<pdf:Producer>Adobe Acrobat 9.32 Paper Capture Plug-in with ClearScan; modified using iText 4.2.0 by 1T3XT</pdf:Producer>
</rdf:Description>
<rdf:Description rdf:about="" xmlns:xmp="http://ns.adobe.com/xap/1.0/">
<xmp:ModifyDate>2017-12-28T00:35:26-08:00</xmp:ModifyDate>
<xmp:CreateDate>2010-08-07T16:12:40+05:30</xmp:CreateDate>
<xmp:MetadataDate>2017-12-28T00:35:26-08:00</xmp:MetadataDate>
<xmp:CreatorTool>Acrobat 5.0 Image Conversion Plug-in for Windows</xmp:CreatorTool>
</rdf:Description>
<rdf:Description rdf:about="" xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
<xmpMM:DocumentID>uuid:d5bdf982-82e0-4b0a-b3a8-da3bc7602a85</xmpMM:DocumentID>
<xmpMM:InstanceID>uuid:896cd839-1aab-41fb-a5f6-6cd7d16f5e06</xmpMM:InstanceID>
</rdf:Description>
<rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:format>application/pdf</dc:format>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
http is a protocol.
That is, it works according to the protocol.
There is no problem when viewing header information.
In fact, download a normal pdf file when executing the code below.
import requests
import shutil
res = requests.get('http://journals.sagepub.com/doi/pdf/10.1038/jcbfm.1993.48', stream=True)
with open('/home/allinux/abcd.pdf', 'wb') as f:
res.raw.decode_content = True
shutil.copyfileobj(res.raw, f)
© 2024 OneMinuteCode. All rights reserved.