Web-crawling Korean broken question.

Asked 2 years ago, Updated 2 years ago, 174 views

class Home Controller < Application Controller

    def index
    require 'open-uri'
  require 'nokogiri'

  @url ="http://news.naver.com/main/read.nhn?mode=LSD&mid=shm&sid1=105&oid=030&aid=0002480868"
  @page = Nokogiri::HTML(open(@url), nil, 'EUC-KR')

  @title = @page.search("title").inner_html

@title_edit = @title.split(':')[0]
@content = @page.at_css('div.article_body').inner_html
   end
end

I made the code like above. As shown below, all the Korean words are broken. I don't know what to do.<

ruby-on-rails-4 crawling scraping utf-8

2022-09-22 15:15

1 Answers

Try text instead of inner_html.

The reason why the Korean alphabet is broken is that the encoding of the document is EUC-KR, but the encoding used by the person who constructs the screen by reading the crawled data is UTF-8.

According to the nakogiri document,

Therefore, using the text method instead of inner_html solves the problem because both parsed data and the encoding used for the web page are unified to UTF-8.

parsing_an_html_xml_document

Strings are always stored as UTF-8 internally. Methods that return text values will always return UTF-8 encoded strings. Methods that return XML (like to_xml, to_html and inner_html) will return a string encoded like the source document.


2022-09-22 15:15

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.