There is currently a CSV file with download links recorded as URLs I want to write a code that reads the URLs in the file and automatically downloads them in the folder. I don't know how to do it here, so I've tried a few things, but there's an error and it's completely blocked.
import urllib.request
url='http://www.data.go.kr/dataset/fileDownload.do?atchFileId=FILE_000000001212856&fileDetailSn=1&publicDataDetailPk=uddi:07b44140-4ded-40e6-946e-c03b317b833e'
urllib.request.urlretrieve(url,1)
This attempt first attempted to see if the urllib.request library worked.
OSError: [WinError 6] Handle is invalid
This error occurs.
import requests
url='http://www.data.go.kr/dataset/fileDownload.do?atchFileId=FILE_000000001375425&fileDetailSn=1&publicDataDetailPk=uddi:fff6f608-f3b8-464f-be97-d58c4944e477'
r=requests.get(url,allow_redirects=True)
open('urldata.csv','wb').write(r.content)
r = requests.get(url, allow_redirects=True)
print (r.headers.get('content-type'))
There was also a way to use the request library, but it seemed that I had to enter the file name and url one by one by one.
I want to write a code that automatically downloads the url record one by one after reading the CSV file with the URL from Python as open, r, etc. Please give me a lot of help.
r = requests.get(url.rstrip(), stream=True)
if r.status_code == 200:
content_dispotistion = r.headers.get('content-disposition')
if content_disposition is not None:
targetFileName = requests.utils.unquote(cgi.parse_header(content_dispotistion)[1]['filename'])
with open("{}/{}".format(SAVE_DIR, targetFileName), 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
f.write(chunk)
else:
print('url {} had no content-disposition header'.format(url))
return targetFileName
elif r.status_code == 404:
print('{} returned a 404, no file was downloaded'.format(url))
else:
print('something else went wrong with {}'.format(url))
It's lunch time...I did coding for a while, but just refer to it for learning.
import cgi
import requests
SAVE_DIR = 'C:/'
def downloadURLResource(url):
r = requests.get(url, stream=True)
if r.status_code == 200:
targetFileName = requests.utils.unquote(cgi.parse_header(r.headers['content-disposition'])[1]['filename'])
with open("{}/{}".format(SAVE_DIR, targetFileName), 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
f.write(chunk)
return targetFileName
With open('h:/url.csv') as f: #url.csv is a file with url list.
print(list(map(downloadURLResource, f.readlines())))
There are no questions, only wishful thinking.
The URL is obtained by parsing by line from the result of 1. If you don't know 1 and 2, you have to study how to handle IO with PYTHON.
http://www.data.go.kr/dataset/fileDownload.do?atchFileId=FILE_000000001375425&fileDetailSn=1&publicDataDetailPk=uddi:fff6f608-f3b8-464f-be97-d58c4944e477 The response header for is as follows.
HTTP/1.1 200 OK
Date: Wed, 19 Jul 2017 02:57:47 GMT
Server: Apache
Content-Disposition: attachment;filename=%EC%B6%A9%EC%B2%AD%EB%B6%81%EB%8F%84+%EC%A6%9D%ED%8F%89%EA%B5%B0_%EB%AF%BC%EB%B0%95%ED%8E%9C%EC%85%98%EC%97%85%EC%86%8C_20170511.csv
Content-Language: ko
Content-Length: 378
Keep-Alive: timeout=5, max=10000
Connection: Keep-Alive
Content-Type: application/octet;charset=utf-8
View Content-Disposition to get the file name.
Please study and ask questions that you don't understand.
The results of the above address are as follows.
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Insert title here</title>
</head>
<body>
<body>
<input type="hidden" name="exception" value="The requested file was not found" style="display:none"/>
</body>
</html>
As you can see, there is a high possibility that it is an invalid url.
import cgi
import requests
TARGET_DIR = 'C:/'
def downloadURLResource(url):
r = requests.get(url, stream=True)
If r.status_code == 200 and 'content-disposition' in r.heads: # When the state is ok and the content-disposition value exists
targetFileName = requests.utils.unquote(cgi.parse_header(r.headers['content-disposition'])[1]['filename'])
with open("{}/{}".format(TARGET_DIR, targetFileName), 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
f.write(chunk)
Return True # If content-disposition exists, it's true
else:
Return False # content-disposition does not exist or is 404...Wait, if there's an error,
with open('h:/urlcsv.csv') as f:
failItems = filter(lambdai:i[1] == False, {url.rstrip():downloadURLResource(url.rstrip()) for url in f.readlines()}.items()) #url:False() function return form and filter the result.
list(map(print, failItems)) #Output
Test it out.
Please leave a log as below.
You can know explicitly where the problem occurred.
import cgi
import requests
import logging
logging.basicConfig(level=logging.DEBUG)
TARGET_DIR = 'C:/'
def downloadURLResource(index, url):
logging.debug('index {}'.format(index))
r = requests.get(url, stream=True)
if r.status_code == 200 and 'content-disposition' in r.headers:
targetFileName = requests.utils.unquote(cgi.parse_header(r.headers['content-disposition'])[1]['filename'])
logging.debug('index {} File downloding {}. '.format(index, targetFileName))
with open("{}/{}".format(TARGET_DIR, targetFileName), 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
f.write(chunk)
return True
else:
logging.debug('index {} URL: {} failed.'.format(index, url))
return False
with open('h:/data.csv') as f:
failItems = filter(lambda i:i[1] == False, {url.rstrip():downloadURLResource(index, url.rstrip()) for index, url in enumerate(f)}.items())
list(map(print, failItems))
DEBUG:root:index 0
DEBUG:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): www.data.go.kr
DEBUG:requests.packages.urllib3.connectionpool:http://www.data.go.kr:80 "GET /dataset/fileDownload.do?atchFileId=FILE_000000001212856&fileDetailSn=1&publicDataDetailPk=uddi:07b44140-4ded-40e6-946e-c03b317b833e HTTP/1.1" 200 10139648
DEBUG:root:index 0 File downloading December_Industrial Trends.hwp.
DEBUG:root:index 1
DEBUG:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): www.data.go.kr
DEBUG:requests.packages.urllib3.connectionpool:http://www.data.go.kr:80 "GET /dataset/fileDownload.do?atchFileId=FILE_000000001215215&fileDetailSn=1&publicDataDetailPk=uddi:07c50b21-6f66-4943-b405-8fdf4adec661 HTTP/1.1" 200 31232
DEBUG:root:index 1 File downloding 2013-09.xls.
DEBUG:root:index 2
DEBUG:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): www.data.go.kr
DEBUG:requests.packages.urllib3.connectionpool:http://www.data.go.kr:80 "GET /dataset/fileDownload.do?atchFileId=FILE_000000001215878&fileDetailSn=1&publicDataDetailPk=uddi:07c9308d-3382-48ca-bafe-ad3990342e77 HTTP/1.1" 200 17920
Status of designation of DEBUG:root:index 2 File downloading public office (agent) (2014).hwp.
© 2024 OneMinuteCode. All rights reserved.