What should I do if I can't get data due to JavaScript when crawling on Ruby?

I'm practicing web crawling using nokogiri. http://media.daum.net/digital/mobile/#page=1&type=tit_cont In this url, I made a code that outputs url of the article in the red type as shown below.

However, this part is made of JavaScript, so you can't see it when you look at the page source. Therefore, the desired result value is not obtained through nokogiri. You can see the code through Chrome developer tool as below.

The code I made is as follows.


url = "http://media.daum.net/digital/mobile/#page=1&type=tit_cont"
page1 = Nokogiri::HTML(open(url))

cl =page1.css("#listWrap")
child1=cl.children
child1.each do |c| 
    if c.name == "li"
        html_a= c.css('a').attr("href")
        strH = html_a.to_s
        puts strH    
    end 

end

How can I crawl the part that can't crawl due to JavaScript? I'm curious.

nokogiri ruby crawling javascript

2022-09-22 21:30

1 Answers

You can use phantomjs, selenium, and watir instead of open-uri. Open-uri is for acquiring url data as it is, so it is not possible to read content that is dynamically created by the browser.

Capybara and Poltergeist (phantomjs) can be implemented in the following ways:

visit url
within ("div#listWrap") do
  all("li").map do |base|
    img_url = base.find("img")["src"] 
    title = base.find("a.tit").text
    content = base.find("a.txt").text
    [img_url, title, content]
  end
end

Find If you add an exception and summarize the code:

require 'capybara'
require 'capybara/dsl'
require 'capybara/poltergeist'

include Capybara::DSL
Capybara.default_driver = :poltergeist

visit("http://media.daum.net/digital/mobile/#page=1&type=tit_cont")
type_daum_news1 = ->b { ([b.find("img")["src"] rescue "no_img"),
                         b.find("a.tit").text, b.find("a.txt").text] }
contents = within ("div#listWrap") { all('li').map(&type_daum_news1) } 
p contents #=> [[img, title, contents], ... [..,contents] ]

The result

2022-09-22 21:30

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656