Ruby's countermeasures against open-uri's 400 Bad Request (access to Internet archive (archive.org)

Asked 2 years ago, Updated 2 years ago, 35 views

Using Ruby's open-uri, use Internet archive at http://www.google.com/and https://web.archive.org/web/20150408183138/I am accessing https://suumo.jp/tochi/tokyo/sc_nishitokyo/nc_84783830/, and although I can access Google's archive correctly, the other URL is "400 Bad Request".
Also, depending on the URL you access, it may succeed or fail.

On the following site, "400 Bad Requests are mostly problematic on the user side."So, I'm thinking about countermeasures, but do you have TIPS when you use open-uri?
If you have any information, I would appreciate it if you could let me know.

400 Bad Request
http://www.bmoo.net/archives/2012/02/312554.html

The Ruby I am using is 2.2.
ruby 2.2.0 preview1 (2014-09-17 trunk47616) [x86_64-darwin14]
=====/Source Code

require 'open-uri'

default_open(url)
  rescue_num = 0
  begin
    res=open(url)
  rescue=>e
    print "error raise in rescue:"
    pe
    print "url=#{url}\n"
    if rescue_num<5then
      sleep1
      rescue_num = rescue_num+1
      retry
    else
      res=nil
    end
  end
  puts" open OK url=#{url}\n\n"unless res==nil
  res
end

f=resque_open('https://web.archive.org/web/20150421015448/http://www.google.com/')

f=resque_open('https://web.archive.org/web/20150408183138/https://suumo.jp/tochi/tokyo/sc_nishitokyo/nc_84783830/')

========/Execution results
openOK url=https://web.archive.org/web/20150421015448/http://www.google.com/

error raise in rescue:#
url=https://web.archive.org/web/20150408183138/https://suumo.jp/tochi/tokyo/sc_nishitokyo/nc_84783830/
error raise in rescue:#
url=https://web.archive.org/web/20150408183138/https://suumo.jp/tochi/tokyo/sc_nishitokyo/nc_84783830/
error raise in rescue:#
url=https://web.archive.org/web/20150408183138/https://suumo.jp/tochi/tokyo/sc_nishitokyo/nc_84783830/
error raise in rescue:#
url=https://web.archive.org/web/20150408183138/https://suumo.jp/tochi/tokyo/sc_nishitokyo/nc_84783830/
error raise in rescue:#
url=https://web.archive.org/web/20150408183138/https://suumo.jp/tochi/tokyo/sc_nishitokyo/nc_84783830/
error raise in rescue:#
url=https://web.archive.org/web/20150408183138/https://suumo.jp/tochi/tokyo/sc_nishitokyo/nc_84783830/

ruby

2022-09-29 22:53

2 Answers

I was curious, so I read open-uri.rb.(It comes with 2.2.0, but it should be the same with 2.2.0 preview1)

It seems to be Bad request in the next flow.

When converting to an absolute URL, consecutive / is converted to one / (Ruby's "uri" specification; consecutive / should not be correct...)

Example:
URI("http://example.jp/") + URI("/foo//bar")
=>#<URI::HTTP:0x007f94dc0b5c70 URL:http://example.jp/foo/bar>

As a result, web.archive.org requests a URL that is different from what it expects (https://suumo.jp is https://suumo.jp).

Originally, it should be available for open-uri implementation without any problems, but if you want to use a server that doesn't meet the standard (which is common in reality), you should use a different library such as net/http.

*Please note that open-uri is a bit peculiar around authentication and redirect.(Reference: https://stackoverflow.com/questions/13763399/openuri-causing-401-unauthorized-error-with-https-url/13765887#13765887)


2022-09-29 22:53

If you access the above url with Firefox, url will be rewritten, so it seems that open-uri does not support the 3xx response.
 By rewriting the open process with Net::HTTP with reference to here, the url above also works well.
 Also, if you want to crawl, I think it would be easier to use Mechanize. (Reference)

By the way, Ruby 2.2 series has 2.2.2, so if there is no particular reason, I think it would be better to update it to the latest one.


2022-09-29 22:53

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.