I want to download only the files under the directory specified in wget.

Asked 2 years ago, Updated 2 years ago, 110 views

The site you would like to download has the following configuration:

https://files.example/works/section_a
https://files.example/works/section_b
https://files.example/works/section_c
...

Among them, you want to download only the files under section_a
However, the page in section_a contains links to section_b and section_c.

So I ran the following wget command:

wget-p-E-nH-np-k-r-l1 https://files.example/works/section_a

However, the results are not what I expected, and section_a as well as section_b, section_c files of hierarchy are downloaded.
The parent hierarchy file has not been retrieved as expected.

  • Why is the https://files.example/works/section_a specified with the -np option still downloading files from the directory in the same hierarchy

    ?
  • Is there a way to download only the
  • and section_a files?

Why do I download files from the hierarchy directory even though I specify https://files.example/works/section_a with the -np option on it?

Also, is there a way to download only section_a files?

wget-p-E-nH-np-k-r-l1 https://files.example/works/section_a/

After adding / to the end of the URL and wgetting, it ended up as 404 Not Found.

...

HTTP request sent, waiting response...404 Not Found
2021-01-05 19:16:16 ERROR404: Not Found.

Also, if you try to access it with a slash at the end, the page cannot be found.

bash wget

2022-09-30 13:53

2 Answers

If the specified URL is a directory, try running with / at the end.

reference:
no-parent does not work on wget

However, this disappointed me and went to the parent's class to find out why.
Finally, you have to add /.

#wget --recursive --no-remove-listing --no-parent http://www.example.com/foo/baa/


2022-09-30 13:53

Why are files in the same hierarchy downloaded even though I specify https://files.example/works/section_a with the -np option?

-np is --no-parent, so you don't want to get the parent hierarchy.
The hierarchy is subject to retrieval.

Also, is there a way to download only section_a files?

section_b, section_c is obtained because -r specifies recursive retrieval, so you should not add -r-l1 as follows:

wget-p-E-nH-np-k https://files.example/works/section_a

I thought section_a was a file, but it is a directory.

If you have the URL https://files.example/works/section_a, the underlying directory is /works/, and section_a is interpreted as a file.

If section~a was a directory, the web server usually returns a message that redirects to https://files.example/works/section_a/ to tell the client that it is a directory.
If https://files.example/works/section_a/, the base directory will be /works/section_a/, so the -np option will work as expected.

However, if you specify https://files.example/works/section_a/ in wget, it becomes Not Found.
You may have returned the file directly as a redirect destination.

Do you see the following redirect message when you run wget?

HTTP request sent, waiting response...301 Moved Permanent
Location: https://files.example/works/section_a/index.html [following]

If you have returned the file in section_a as the redirect destination, you can specify the URL in wget.

The problem is if you returned a file outside of section_a or if it was not redirected.If so, try specifying /works/section_a in the --accept-regex option as follows:

wget-p-E-nH-np-k-r-l1 --accept-regex'/works/section_a'https://files.example/works/section_a

However, if the files required to display html under section_a are outside section_a, you will not be able to retrieve those files.
In that case, you might want to exclude section_b from the --reject-regex option as follows:

wget-p-E-nH-np-k-r-l1 --reject-regex'/works/section_[b-z]' https://files.example/works/section_a


2022-09-30 13:53

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.