Code that automatically downloads a CSV file with download links (URLs) recorded from Python

Asked 1 years ago, Updated 1 years ago, 114 views

There is currently a CSV file with download links recorded as URLs I want to write a code that reads the URLs in the file and automatically downloads them in the folder. I don't know how to do it here, so I've tried a few things, but there's an error and it's completely blocked.

import urllib.request

url='http://www.data.go.kr/dataset/fileDownload.do?atchFileId=FILE_000000001212856&fileDetailSn=1&publicDataDetailPk=uddi:07b44140-4ded-40e6-946e-c03b317b833e'

urllib.request.urlretrieve(url,1)

This attempt first attempted to see if the urllib.request library worked.

OSError: [WinError 6] Handle is invalid

This error occurs.

import requests
url='http://www.data.go.kr/dataset/fileDownload.do?atchFileId=FILE_000000001375425&fileDetailSn=1&publicDataDetailPk=uddi:fff6f608-f3b8-464f-be97-d58c4944e477'
r=requests.get(url,allow_redirects=True)
open('urldata.csv','wb').write(r.content)
r = requests.get(url, allow_redirects=True)
print (r.headers.get('content-type'))

There was also a way to use the request library, but it seemed that I had to enter the file name and url one by one by one.

I want to write a code that automatically downloads the url record one by one after reading the CSV file with the URL from Python as open, r, etc. Please give me a lot of help.

r = requests.get(url.rstrip(), stream=True)
if r.status_code == 200:
    content_dispotistion = r.headers.get('content-disposition')
    if content_disposition is not None:
        targetFileName = requests.utils.unquote(cgi.parse_header(content_dispotistion)[1]['filename'])
        with open("{}/{}".format(SAVE_DIR, targetFileName), 'wb') as f:
            for chunk in r.iter_content(chunk_size=1024):
                f.write(chunk)
    else:
        print('url {} had no content-disposition header'.format(url))
    return targetFileName
elif r.status_code == 404:
    print('{} returned a 404, no file was downloaded'.format(url))
else:
    print('something else went wrong with {}'.format(url))

python urllib url csv

2022-09-22 08:36

5 Answers

It's lunch time...I did coding for a while, but just refer to it for learning.

import cgi
import requests


SAVE_DIR = 'C:/'

def downloadURLResource(url):
    r = requests.get(url, stream=True)
    if r.status_code == 200:
        targetFileName = requests.utils.unquote(cgi.parse_header(r.headers['content-disposition'])[1]['filename'])
        with open("{}/{}".format(SAVE_DIR, targetFileName), 'wb') as f:
            for chunk in r.iter_content(chunk_size=1024):
                f.write(chunk)
        return targetFileName


With open('h:/url.csv') as f: #url.csv is a file with url list.
    print(list(map(downloadURLResource, f.readlines())))


2022-09-22 08:36

There are no questions, only wishful thinking.

The URL is obtained by parsing by line from the result of 1. If you don't know 1 and 2, you have to study how to handle IO with PYTHON.

http://www.data.go.kr/dataset/fileDownload.do?atchFileId=FILE_000000001375425&fileDetailSn=1&publicDataDetailPk=uddi:fff6f608-f3b8-464f-be97-d58c4944e477 The response header for is as follows.

HTTP/1.1 200 OK
Date: Wed, 19 Jul 2017 02:57:47 GMT
Server: Apache
Content-Disposition: attachment;filename=%EC%B6%A9%EC%B2%AD%EB%B6%81%EB%8F%84+%EC%A6%9D%ED%8F%89%EA%B5%B0_%EB%AF%BC%EB%B0%95%ED%8E%9C%EC%85%98%EC%97%85%EC%86%8C_20170511.csv
Content-Language: ko
Content-Length: 378
Keep-Alive: timeout=5, max=10000
Connection: Keep-Alive
Content-Type: application/octet;charset=utf-8

View Content-Disposition to get the file name.

Please study and ask questions that you don't understand.


2022-09-22 08:36

http://www.data.go.kr/dataset/fileDownload.do?atchFileId=FILE_000000001210727&fileDetailSn=1&publicDataDetailPk=uddi:4cf4dc4c-e0e9-4aee-929e-b2a0431bf03e

The results of the above address are as follows.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Insert title here</title>
</head>
<body>

<body>

    <input type="hidden" name="exception" value="The requested file was not found" style="display:none"/>

</body>
</html>

As you can see, there is a high possibility that it is an invalid url.


2022-09-22 08:36

import cgi
import requests


TARGET_DIR = 'C:/'

def downloadURLResource(url):
    r = requests.get(url, stream=True)
    If r.status_code == 200 and 'content-disposition' in r.heads: # When the state is ok and the content-disposition value exists
        targetFileName = requests.utils.unquote(cgi.parse_header(r.headers['content-disposition'])[1]['filename'])
        with open("{}/{}".format(TARGET_DIR, targetFileName), 'wb') as f:
            for chunk in r.iter_content(chunk_size=1024):
                f.write(chunk)
        Return True # If content-disposition exists, it's true
    else:
        Return False # content-disposition does not exist or is 404...Wait, if there's an error,


with open('h:/urlcsv.csv') as f:
    failItems = filter(lambdai:i[1] == False, {url.rstrip():downloadURLResource(url.rstrip()) for url in f.readlines()}.items()) #url:False() function return form and filter the result.
    list(map(print, failItems)) #Output

Test it out.


2022-09-22 08:36

Please leave a log as below.

You can know explicitly where the problem occurred.

import cgi
import requests
import logging
logging.basicConfig(level=logging.DEBUG)


TARGET_DIR = 'C:/'

def downloadURLResource(index, url):
    logging.debug('index {}'.format(index))
    r = requests.get(url, stream=True)
    if r.status_code == 200 and 'content-disposition' in r.headers:
        targetFileName = requests.utils.unquote(cgi.parse_header(r.headers['content-disposition'])[1]['filename'])
        logging.debug('index {} File downloding {}. '.format(index, targetFileName))
        with open("{}/{}".format(TARGET_DIR, targetFileName), 'wb') as f:
            for chunk in r.iter_content(chunk_size=1024):
                f.write(chunk)
        return True
    else:
        logging.debug('index {} URL: {} failed.'.format(index, url))
        return False


with open('h:/data.csv') as f:
    failItems = filter(lambda i:i[1] == False, {url.rstrip():downloadURLResource(index, url.rstrip()) for index, url in enumerate(f)}.items())
    list(map(print, failItems))


DEBUG:root:index 0
DEBUG:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): www.data.go.kr
DEBUG:requests.packages.urllib3.connectionpool:http://www.data.go.kr:80 "GET /dataset/fileDownload.do?atchFileId=FILE_000000001212856&fileDetailSn=1&publicDataDetailPk=uddi:07b44140-4ded-40e6-946e-c03b317b833e HTTP/1.1" 200 10139648
DEBUG:root:index 0 File downloading December_Industrial Trends.hwp. 
DEBUG:root:index 1
DEBUG:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): www.data.go.kr
DEBUG:requests.packages.urllib3.connectionpool:http://www.data.go.kr:80 "GET /dataset/fileDownload.do?atchFileId=FILE_000000001215215&fileDetailSn=1&publicDataDetailPk=uddi:07c50b21-6f66-4943-b405-8fdf4adec661 HTTP/1.1" 200 31232
DEBUG:root:index 1 File downloding 2013-09.xls. 
DEBUG:root:index 2
DEBUG:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): www.data.go.kr
DEBUG:requests.packages.urllib3.connectionpool:http://www.data.go.kr:80 "GET /dataset/fileDownload.do?atchFileId=FILE_000000001215878&fileDetailSn=1&publicDataDetailPk=uddi:07c9308d-3382-48ca-bafe-ad3990342e77 HTTP/1.1" 200 17920
Status of designation of DEBUG:root:index 2 File downloading public office (agent) (2014).hwp. 


2022-09-22 08:36

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.