I am a student who studies the web. I always get a lot of help here. I've tried extracting images separately from text, but I'm curious here. For example, when crawling a newspaper article, how do I extract it without losing this order when the article is arranged in the order of text image text?
The code I want to crawl is as follows. As you can see, it is in the order of text image text image text.
<div="articleBodyContents">
<!-- Body content -->
Google has announced a plan to release a developer version of its prefabricated smartphone 'Ara' that can add functions in a modular manner like Lego this fall and sell it from next year. Following the G5 introduced by LG Electronics, 'Ara' is expected to expand the assembly smartphone ecosystem.Blaze Bertrand, head of Google's Advanced Technology and Product (ATAP) division, said on Tuesday that a new smartphone for project Ara developers will be available in the fourth quarter of this year, the last day of the annual developer conference "Google I/O 2016." It added that it will also be sold to consumers in 2017.<br /><br />
<span class="end_photo_org"><img src="http://imgnews.naver.net/image/030/2016/05/22/804126_20160522135644_339_0001_99_20160522184806.jpg?type=w540" /><em class="img_desc">
Google's assembled smartphone Ara/photos = Yonhap News</em></span><br/>5.3-inch Ara smartphone has six slots to attach and remove various modules. While LG Electronics' G5 can only replace the lower module, it is a method that allows you to insert parts such as speakers and high-performance cameras as you like as an assembly PC. It can be compatible even if the next generation Ara frame comes out.<br /><br />
<span class="end_photoo_org"><img src="http://imgnews.naver.net/image/030/2016/05/22/804126_20160522135644_339_0002_99_20160522184806.jpg?type=w540" /><em class="img_desc"> Google has released a module smartphone project called 'Ara' Image</em></span><<<<&br /gt; Google's software project at the forefront. It allows various developers to participate in module development. The project 'Ara' began as a secret project in 2012 and was introduced in 2013. Some parts were introduced last year, but no actual product has been released, but only this year, a detailed blueprint will be presented.<br /><br /> Reporter Ham Ji-hyun [email protected] <span style="display: block; font-size:14px;"> [Copyright 전자 electronic newspapers & Internet, unauthorized reproduction and redistribution prohibited]<span>
<!-- // Body content -->
</div>
Below are the codes that I made. Controller
defindex
require 'open-uri'
require 'nokogiri'
@img_urls = Array.new
@table_hsh = Array.new
@url ="http://news.naver.com/main/read.nhn?mode=LSD&mid=shm&sid1=105&oid=030&aid=0002480868"
@page = Nokogiri::HTML(open(@url), nil, 'EUC-KR')
@title = @page.search("title").text
@title_edit = @title.split(':')[0]
@content = @page.css('#articleBodyContents').text
@img_urls = @page.css('.end_photo_org img').map{|i| i['src']}
@table_hsh << {:imgs => @img_urls }
end
View
<h1>Title</h1>
<%= @title_edit %>
<h1>Article Contents</h1>
<%= @content %>
<h1>Image</h1>
<table>
<tr>
<% @table_hsh.each do |row| %>
<% i = 0 %>
<% row[:imgs].each do |img_url| %>
<td> <img src=<%=img_url %>, width="500" , height="400"> </img></td>
<% i = i + 1 %>
<% end %>
<% end %>
</tr>
</table>
What I want is to crawl while maintaining the HTML code of the site I crawl in the following format. I always get a lot of help. Thank you.
In this way, the child nodes of @content
can be read separately one by one.
If you look at the code you posted, there are 23 children, and if you print out each name (child.ear {|c|puts c.name}
) It appears below. You can use only the children you need.
text
comment
text
br
br
text
br
br
span
br
text
br
br
span
br
text
br
br
text
span
text
comment
text
© 2024 OneMinuteCode. All rights reserved.