How to download a pdf file on a web server with python

The code I'm trying to make is to make a code by downloading the pdf files from Google Academic Information magazine through a download address on HTml. However, there were some problems during this download process, and first of all, if you look at the code I ran,

def get_download(url,fname,directory):
      try:
        os.chdir(directory)
        print(url)
        request.urlretrieve(url,fname)
        print ('Download Complete')
    except HTTPError as e:
        print(e)
        return None

There are three cases when downloading files using request.urlretrieve

The first is if the download is successful.

The second is when an http error occurs

The third is when the download is running, but the pdf file is not available.

Here

In the second case,

For example, url:http://www.academia.edu/download/35716149/leach.pdf

(Actual url:https://s3.amazonaws.com/academia.edu.documents/35716149/leach.pdf?AWSAccessKeyId=AKIAIWOWYYGZ2Y53UL3A&Expires=1514444614&Signature=A6SdIuGn4hxxEcZjQTWsZmxg%2Fx0%3D&response-content-disposition=inline%3B%20filename%3DEnergy-Efficient_Communication_Protocol.pdf)

When you click the pdf button on Google Academic Information magazine, the url specified in the html file is different, but in this case, I couldn't find a way, so I made an exception for now,

In the third case,

url:This is the case for such as

Maybe it's because url doesn't specify the pdf file directly, but if you download the file using url, It says it's an unreadable file.

I want to know what part I should study to solve this problem.

python urllib

2022-09-22 08:32

2 Answers

I don't know what the problem is.

As you can see below, there is no unique problem.

request header

GET http://journals.sagepub.com/doi/pdf/10.1038/jcbfm.1993.48 HTTP/1.1
Host: journals.sagepub.com
Connection: keep-alive
Cache-Control: max-age=0
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36
Upgrade-Insecure-Requests: 1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate
Accept-Language: ko,en-US;q=0.9,en;q=0.8,ja;q=0.7
Cookie: timezone=540; I2KBRCK=1; SERVER=WZ6myaEXBLGzzb+3qD0SOQ==; SERVER=WZ6myaEXBLGzzb+3qD0SOQ==; MAID=lPQzpL4VA36/hXTPEnQenQ==; MAID=lPQzpL4VA36/hXTPEnQenQ==; MACHINE_LAST_SEEN=2017-12-28T00%3A18%3A08.075-08%3A00; MACHINE_LAST_SEEN=2017-12-28T00%3A18%3A08.075-08%3A00; JSESSIONID=aaahTMzn7gPiC5lBZTUbw; JSESSIONID=aaahTMzn7gPiC5lBZTUbw; _ga=GA1.2.826585137.1514449083; _gid=GA1.2.907697027.1514449083

response header

HTTP/1.1 200 OK
Server: AtyponWS/7.1
Cache-Control: max-age=3600, private, must-revalidate
Pragma: 
X-Webstats-RespID: ad685cea9103f80c95659addb3e4e5a1
Content-Disposition: inline; filename=jcbfm.1993.48.pdf
Set-Cookie: JSESSIONID=aaahTMzn7gPiC5lBZTUbw; domain=.journals.sagepub.com; path=/
Content-Type: application/pdf; charset=UTF-8
Date: Thu, 28 Dec 2017 08:35:26 GMT
Content-Length: 236399

Part of body

%PDF-1.5
%    
1 0 obj
<</Subtype/XML/Type/Metadata/Length 3356>>stream
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta x:xmptk="Adobe XMP Core 4.2.1-c043 52.372728, 2009/01/18-15:08:04        " xmlns:x="adobe:ns:meta/">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="" xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
     <pdf:Producer>Adobe Acrobat 9.32 Paper Capture Plug-in with ClearScan; modified using iText 4.2.0 by 1T3XT</pdf:Producer>
  </rdf:Description>
  <rdf:Description rdf:about="" xmlns:xmp="http://ns.adobe.com/xap/1.0/">
     <xmp:ModifyDate>2017-12-28T00:35:26-08:00</xmp:ModifyDate>
     <xmp:CreateDate>2010-08-07T16:12:40+05:30</xmp:CreateDate>
     <xmp:MetadataDate>2017-12-28T00:35:26-08:00</xmp:MetadataDate>
     <xmp:CreatorTool>Acrobat 5.0 Image Conversion Plug-in for Windows</xmp:CreatorTool>
  </rdf:Description>
  <rdf:Description rdf:about="" xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
     <xmpMM:DocumentID>uuid:d5bdf982-82e0-4b0a-b3a8-da3bc7602a85</xmpMM:DocumentID>
     <xmpMM:InstanceID>uuid:896cd839-1aab-41fb-a5f6-6cd7d16f5e06</xmpMM:InstanceID>
  </rdf:Description>
  <rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/">
     <dc:format>application/pdf</dc:format>
  </rdf:Description>
</rdf:RDF>
</x:xmpmeta>

2022-09-22 08:32

http is a protocol.

That is, it works according to the protocol.

There is no problem when viewing header information.

In fact, download a normal pdf file when executing the code below.

import requests
import shutil

res = requests.get('http://journals.sagepub.com/doi/pdf/10.1038/jcbfm.1993.48', stream=True)

with open('/home/allinux/abcd.pdf', 'wb') as f:
    res.raw.decode_content = True
    shutil.copyfileobj(res.raw, f)

2022-09-22 08:32

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656