This is a question about web crawling in Ruby.

Asked 2 years ago, Updated 2 years ago, 109 views

I'm a student who studies the web. I'm making a code that extracts only the contents of Internet newspaper articles with rubies. Code execution allows you to extract text from a newspaper article. The code is as follows.


    def index

        require 'open-uri'
        require 'nokogiri'

        @url ="http://news.naver.com/main/read.nhn?mode=LSD&mid=shm&sid1=105&oid=030&aid=0002480868"
        @page = Nokogiri::HTML(open(@url), nil, 'EUC-KR')

        @title = @page.search("title").text

        @title_edit = @title.split(':')[0]
        @content = @page.css('#articleBodyContents').text
    end

The code for the newspaper article is as follows. What I'm curious about here is that I want to extract the image of the article, but I don't know what to do. There are two images between text, how do I pull them out?

<div id="articleBodyContents">
    The main content -- > <!

    Google Adds a manner as like a Lego module assembly that can be real smartphone developer version of Ara, a fall this year and next year sales plan for 2007. The lg introduced by Ara ` the five g an assembly can participate in smartphone ecosystem is expected to be extended.<br /><br />블레이즈 베르트랑 구글 고급 기술과 제품(ATAP) 부문의 창의 책임자(Head of Creative)는 연례 개발자 회의 `구글 I/O 2016`의 마지막 날인 20일(현지시간) 올해 4분기 프로젝트 아라 개발자용 새 스마트폰이 나올 예정이라 밝혔다. Will be sold to consumers by 2017, it added.<br /><br />
<span class="end_photo_org"><img src="http://imgnews.naver.net/image/030/2016/05/22/804126_20160522135644_339_0001_99_20160522184806.jpg?type=w540" /><em class="img_desc">
구글의 조립형 스마트폰 아라/사진=연합뉴스</em></span><br />5.3인치의 아라 스마트폰은 각종 모듈을 탈부착할 수 있는 6개의 슬롯이 있다. The lg g of the replaceable, on the other hand, just the bottom of the module assembly speaker and high performance cameras that can be a way of components, such as the taste like pc. The next generation of Ara frame is possible compatible.<br /><br />
< span class : "end photo _ org" > < img src = "http://imgnews.naver.net/image/030/2016/05/22/804126_20160522135644_339_0002_99_20160522184806.jpg?type=w540" / > < em class : "_ img desc" Ara ` smartphone project, released by the module image Google > em and < > span and < > < br and Led by Ara (ara), for the project has been Google > and released the software development kit (sdk) on the site. Many to take part in the development module developers to one. Ara ` Project 2012 began with secret project and shown in 2013. Some parts last year but will show real products have not and only specific to a blueprint for this year.<br /><br /> Reporter Ham Ji-hyun [email protected] <span style="display: block; font-size:14px;"> [Copyright 전자 electronic newspapers & Internet, unauthorized reproduction and redistribution prohibited]<span>
    <!-- // Body content -->
    </div>


ruby ruby-on-rails-4 web crawling nokogiri

2022-09-22 21:56

1 Answers

As shown below, only the src of img can be extracted using .css.

img_urls = @page.css('.end_photo_org img').map{|i| i['src']}


2022-09-22 21:56

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.