Identifying File Names Retrieved and Saved by Selenium

Asked 2 years ago, Updated 2 years ago, 387 views

I was able to automate Chrome with Ubuntu+Python+Selenium and click the Save button to save the file locally, but this file name cannot be controlled manually and the site from which I got it becomes the appropriate name based on the search word.
If Python retrieves the filename of this file that has just been saved, which method would be smart?
I searched the directory, got a list that matches the file name in the regular expression, and thought that it was the newest one, but I think there is an example, so please let me know if there is a better way.

python ubuntu google-chrome selenium selenium-webdriver

2022-09-30 21:57

2 Answers

Ubuntu + Python + Selenium

This is not a prerequisite for your question, but if Puppeter instead of Selenium, you can specify the path and filename to download directly.
Save with any path and name when downloading files in papeter|AquaWare tweet blog

The above article is an example of node.js+Puppeteer, but using pyppeteer results in Python as follows:

importos
import asyncio

# Specify Chromeimum Revision for Fetch.enable
os.environ ['PYPPETER_CHROMIUM_REVISION'] = '884014'
import pyppeteer

event_loop=asyncio.get_event_loop()

async def main(file_name,headless=True,wait_time=5.0):
    b = wait pyppeteer.launch({'headless':headless})
    p = wait b.newPage()
    wait p.goto('https://github.com/pyppeteer/pyppeteer')
    e = wait p.querySelector('get-repo')
    wait e.click()

    client=wait p.target.createCDPSession()
    wait client.send('Page.setDownloadBehavior', {'behavior':'allow', 'downloadPath':os.getcwd()})
    wait client.send('Fetch.enable', {'patterns':[{'urlPattern':'*', 'requestStage':'Response'}] })

    async defaultRequestPaused (requestEvent):
        responseHeaders = [ v for v in requestEvent [ 'responseHeaders' ] if v [ 'name' ]! = 'content-disposition' ]
        requestId = requestEvent ['requestId']
        if requestEvent ['responseStatusCode'] == 200:
            responseHeaders.append({'name':'content-disposition', 'value':f' attachment; filename="{file_name}"}})
            response=wait client.send('Fetch.getResponseBody', {'requestId':requestId})
            wait client.send('Fetch.fullRequest', {'requestId':requestId', 'responseCode': 200, 'responseHeaders':responseHeaders', 'body':response['body'] })
        else:
            wait client.send('Fetch.continueRequest', {'requestId':requestId}); 
    client.on('Fetch.requestPaused', lambdae:asyncio.ensure_future(onRequestPaused(e), loop=event_loop))

    # Click the "Download ZIP" button on Github
    e = wait p.querySelector('a[href$=".zip"]')
    wait e.click()
    wait asyncio.sleep(wait_time)
    wait client.send('Fetch.disable')
    wait b.close()

event_loop.run_until_complete(main(file_name='specified_name.zip', headless=False))

Verified with Ubuntu on Windows and WSL.

Since the original pyppeteer is node.js, it is basically an asynchronous API, so if you are familiar with Selenium's synchronous API, it will be difficult to handle.
For your reference.


2022-09-30 21:57

We resolved the issue by referring to the following article you mentioned in the comment.

Python Selenium Dynamic Download Completion Waiting

In fact, besides the subject matter, there was a problem that I couldn't relate to the end of the download, but I am happy that it was resolved at the same time.
I'll put up my sauce.

Due to confidentiality obligations, the following conditions apply:

  • Searching for https://foo.example.com/ sites
  • Search by the words barbarbar, bazbazbaz
  • Search results are saved under the name foo_search_*.txt (where * appears to be a shortened name based on search words, such as barbarb-1?)

Key points of the program are as follows:

  • Use the glob.glob function to loop while waiting 1 second as long as there is a file named foo_search_*.txt.crdownload
  • Get the filename by removing .crdownload from foo_search_*.txt.crdownload in the loop
  • If you have escaped the loop, you may have finished downloading or could not download it in the first place, so take care of it
#!/usr/bin/python3
import glob 
import time
from selenium import webdriver
from selenium.webdriver.common.by import By# for find_element_by_ID
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.common.keys import Keys

options=webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver=webdriver.Chrome(options=options) 
wait = Maximum number of seconds for WebDriverWait(driver,10)#wait.until
driver.get('https://foo.example.com/')#search site
print(driver.title)

def search_and_save(search_term):
    element= 
    wait.until(expected_conditions.visibility_of_element_located((By.ID, "search_term_id"))))# Wait for the search box to appear
    element.send_keys(Keys.CONTROL+"a")
    element.send_keys (Keys.DELETE)
    element.send_keys(search_term)#Submit by entering the search word
    element.submit()
    driver.find_element_by_class_name("search-submit").click()
    new_file='ERROR'+search_term# filename to download, first with error name
    for i in range(30): # Wait MAX 30 seconds
        download_fileName=look for globe.glob(f'foo_search_*.txt.crdownload')#crdownload
        if download_fileName: # If there is a crdownload
            replace new_file=download_fileName[0].replace('.crdownload', '')#crdownload removed and new_file replaced
            time.sleep(1)# Wait 1 second
        else —#crdownload is missing.The download is over or hasn't started in the first place.
            break
       print('new_file:'+new_file)# If the download is successful, the file name will be included, if not already downloaded.

search_and_save('barbarbar')
search_and_save('bazbazbaz')
driver.quit()


2022-09-30 21:57

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.