This is a web crawling question using Ruby-on-Rails.

Asked 2 years ago, Updated 2 years ago, 127 views

I'm studying Ruby on Rails. I'm trying to scratch four titles of Naver news articles, but it didn't work out as I thought, so I'm posting a question. In the code I made, if you turn 0002713773, 0002713772, 0002713771, 00027137770 separately without using #{c}, it works well, but when I turn this part into a repeating sentence, it doesn't come out. I wonder why.


        @titles = Array.new
        0002713773.downto(0002713770) do |c| 
            @url ="http://news.naver.com/main/read.nhn?mode=LSD&mid=shm&sid1=105&oid=032&aid=#{c}"
            @page = Nokogiri::HTML(open(@url), nil, 'EUC-KR')

            #@title = @page.search("title").text
            @title = @page.css("#articleTitle")
            @titles << @title.inner_text
        end

Also, I have one more question as I study. I know that curl allows me to download the HTML and save it locally, but I want to select only certain parts (article title, article content) and save them locally as HTML, what should I do? I made a code to pull out a specific part, but I wonder how to save it locally.

For example, I want to save article 1 as 1.html by selecting only the title and content, and article 2 as 2.html.

ruby ruby-on-rails crawling html nokogiri

2022-09-22 13:55

1 Answers

I solved why. Numbers that start with zero are recognized as octal numbers. I won't erase it just in case.


2022-09-22 13:55

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.