I'm practicing web crawling using nokogiri.
http://media.daum.net/digital/mobile/#page=1&type=tit_cont
In this url, I made a code that outputs url of the article in the red type as shown below.
However, this part is made of JavaScript, so you can't see it when you look at the page source. Therefore, the desired result value is not obtained through nokogiri.
You can see the code through Chrome developer tool as below.
The code I made is as follows.
url = "http://media.daum.net/digital/mobile/#page=1&type=tit_cont"
page1 = Nokogiri::HTML(open(url))
cl =page1.css("#listWrap")
child1=cl.children
child1.each do |c|
if c.name == "li"
html_a= c.css('a').attr("href")
strH = html_a.to_s
puts strH
end
end
How can I crawl the part that can't crawl due to JavaScript? I'm curious.
nokogiri ruby crawling javascript
You can use phantomjs, selenium, and watir instead of open-uri. Open-uri is for acquiring url data as it is, so it is not possible to read content that is dynamically created by the browser.
Capybara and Poltergeist (phantomjs) can be implemented in the following ways:
visit url
within ("div#listWrap") do
all("li").map do |base|
img_url = base.find("img")["src"]
title = base.find("a.tit").text
content = base.find("a.txt").text
[img_url, title, content]
end
end
Find If you add an exception and summarize the code:
require 'capybara'
require 'capybara/dsl'
require 'capybara/poltergeist'
include Capybara::DSL
Capybara.default_driver = :poltergeist
visit("http://media.daum.net/digital/mobile/#page=1&type=tit_cont")
type_daum_news1 = ->b { ([b.find("img")["src"] rescue "no_img"),
b.find("a.tit").text, b.find("a.txt").text] }
contents = within ("div#listWrap") { all('li').map(&type_daum_news1) }
p contents #=> [[img, title, contents], ... [..,contents] ]
The result
© 2024 OneMinuteCode. All rights reserved.