The site you would like to download has the following configuration:
https://files.example/works/section_a
https://files.example/works/section_b
https://files.example/works/section_c
...
Among them, you want to download only the files under section_a
However, the page in section_a
contains links to section_b
and section_c
.
So I ran the following wget command:
wget-p-E-nH-np-k-r-l1 https://files.example/works/section_a
However, the results are not what I expected, and section_a
as well as section_b
, section_c
files of hierarchy are downloaded.
The parent hierarchy file has not been retrieved as expected.
Why is the https://files.example/works/section_a
specified with the -np
option still downloading files from the directory in the same hierarchy
and section_a
files?
Why do I download files from the hierarchy directory even though I specify https://files.example/works/section_a
with the -np
option on it?
Also, is there a way to download only section_a
files?
wget-p-E-nH-np-k-r-l1 https://files.example/works/section_a/
After adding /
to the end of the URL and wgetting, it ended up as 404 Not Found
.
...
HTTP request sent, waiting response...404 Not Found
2021-01-05 19:16:16 ERROR404: Not Found.
Also, if you try to access it with a slash at the end, the page cannot be found.
bash wget
If the specified URL is a directory, try running with /
at the end.
reference:
no-parent does not work on wget
However, this disappointed me and went to the parent's class to find out why.
Finally, you have to add /
.
#wget --recursive --no-remove-listing --no-parent http://www.example.com/foo/baa/
Why are files in the same hierarchy downloaded even though I specify https://files.example/works/section_a with the -np option?
-np
is --no-parent
, so you don't want to get the parent hierarchy.
The hierarchy is subject to retrieval.
Also, is there a way to download only section_a files?
section_b
, section_c
is obtained because -r
specifies recursive retrieval, so you should not add -r-l1
as follows:
wget-p-E-nH-np-k https://files.example/works/section_a
I thought section_a
was a file, but it is a directory.
If you have the URL https://files.example/works/section_a
, the underlying directory is /works/
, and section_a
is interpreted as a file.
If section~a
was a directory, the web server usually returns a message that redirects to https://files.example/works/section_a/
to tell the client that it is a directory.
If https://files.example/works/section_a/
, the base directory will be /works/section_a/
, so the -np option will work as expected.
However, if you specify https://files.example/works/section_a/
in wget, it becomes Not Found.
You may have returned the file directly as a redirect destination.
Do you see the following redirect message when you run wget?
HTTP request sent, waiting response...301 Moved Permanent
Location: https://files.example/works/section_a/index.html [following]
If you have returned the file in section_a
as the redirect destination, you can specify the URL in wget.
The problem is if you returned a file outside of section_a
or if it was not redirected.If so, try specifying /works/section_a
in the --accept-regex
option as follows:
wget-p-E-nH-np-k-r-l1 --accept-regex'/works/section_a'https://files.example/works/section_a
However, if the files required to display html under section_a
are outside section_a
, you will not be able to retrieve those files.
In that case, you might want to exclude section_b
from the --reject-regex
option as follows:
wget-p-E-nH-np-k-r-l1 --reject-regex'/works/section_[b-z]' https://files.example/works/section_a
© 2024 OneMinuteCode. All rights reserved.