Understanding How to Join Files Using Ruby's Each

Asked 2 years ago, Updated 2 years ago, 88 views

I am trying web scraping using ruby language in my studies.

This is an image that searches Google and creates a CSV file similar to the following:

★ CSV output image (*Title, URL, text from left)

Google, www.google.co.jp, provides tools to search for all the information in the world.

Create an output file using each and
An attempt was made to combine three files from another program into one file.
Then I found that the search order of the three files was not correct.

[Image of output results]

■Title.txt

Title 1:a
Title 2: b
Title 3: c

■ URL.txt

URL1:a
URL 3:c
URL 2:b

■ Text.txt

Sentence 1:a
Sentence 2: b
Sentence 3: c

So I tried to aggregate the search results into one output file.
This is what happened as a result of running from the following sources.

[Image of output results]

■Summary.txt

Title 1:a
Title 2: b
Title 3: c
URL 1:a
URL 2:b
URL 3:c
Sentence 1:a
Sentence 2: b
Sentence 3: c

As for the completion system, I would like to do the following.

■Summary completion system.txt

Title 1:a, URL 1:a, Text 1:a
Title 2:b, URL 2:b, text 2:b
Title 3:c, URL 3:c, text 3:c

No matter how much I try, it doesn't work, so I asked you a question.
I'm sorry to ask you a beginner's question, but could someone give me some advice?

I will write down the source I created.
Thank you very much for your instruction
*The ruby version is 2.1.5p273.

★ Source

rec='https://www.google.com/search?q=google&oe=utf-8&hl=ja'
count = 0
ST="&start="

# Clearing Files
File.open("/src/out/all", "w")

# Search Processing
for in 1..2

        # Convert to String type and combine
        search=rec.to_s+ST.to_s+count.to_s

        # Convert character codes to Google
        escape_url = URI.escape (search)
        count+=10
        doc=Nokogiri::HTML (open(escaped_url))


        # title acquisition
        doc.xpath('//h3/a') .each do | link |
                $cont=[]
                $cont.push
                $cont.push(link.content)
                $stdout=File.open("/src/out/all", "a")
                puts$cont[0]
                $stdout = STDOUT
        end

        # Retrieving URLs
        doc.xpath('//div[1]/cite').each do|url|
                $ul = [ ]
                $ul.push
                $ul.push(url.content)
                $stdout=File.open("/src/out/all", "a")
                puts$ul[0]
                $stdout = STDOUT
        end

        # sentence acquisition
        doc.xpath('//div/span') .each do | link |
                $body=[]
                $body.push(link.content)
                $stdout=File.open("/src/out/all", "a")
                puts$body[0]
                $stdout = STDOUT
        end

end

Thank you for your cooperation.

ruby web-scraping

2022-09-30 11:46

2 Answers

First.
Google's search results are quite complicated, so it was a little hard to handle exceptions.

Next is the script you created.

  • There is a for statement, but there is a separate count, so I think there is a lot of waste.Also, I feel uncomfortable with the increasing position of the count.It should be done at the end (without count before that).
  • I'm concerned about the heavy use of global variables.I think Ruby's style is not to use global variables. If you are trying to write the same as Perl, remove the scalar variable because it does not require $. (1/18 1:00 postscript: PHP also starts with $. Not required for Ruby.)
  • Replace and return File.open to the standard output is very frustrating.File.open should be followed by a block.
  • The
  • URI.escape method is deprecated in the old one. Use the CGI.escape method.

Now, I'm going to give you a brief policy description of the scripts I've created.

  • Nokogiri:: It's similar to cutting an HTML object with xpath, but you take out a large batch of elements, such as titles and links, and then separate them with xpath.
  • First, buffer the format of the output file and export it to the file at the end of the script.
  • You want CSV output, so I created the output as follows:It may contain spaces, so I put them in quotation marks.

    "Title 1", "URL 1", "Sentence 1"
    "Title 2", "URL 2", "Sentence 2"
    "Title 3", "URL 3", "Sentence 3"

You want CSV output, so I created it to output it as follows.It may contain spaces, so I put them in quotation marks.

"Title 1", "URL 1", "Sentence 1"
"Title 2", "URL 2", "Sentence 2"
"Title 3", "URL 3", "Sentence 3"

Let's look at the script.I completely rewritten it.

require 'nokogiri'
require 'open-uri'
require 'cgi/util'

# search keyword
keyword="google"
base_url="https://www.google.com/search?q=#{CGI.escape(keyword)}&oe=utf-8&hl=ja&start="
# output destination
file_path="./output.csv" 
# output buffer
output=""

2.times.each do|i|
  search_url=base_url+(i*10).to_s
  doc=Nokogiri::HTML.parse(open(search_url).read)
  doc.xpath('//li [@class="g" ]').each do|item|
    # title acquisition
    title=item.xpath('.//h3/a').first.content
    # Skip items in the case of image search and news search results
    next if title [ / Image Search Results / ] | | title [ / News Search Results / ]
    # Get URL (but map link is special)
    # The site tag is item.xpath('.//cite') first.content, but the return value is not always the linked URL.
    anchor_include_map = item.xpath('.//h3/a').first ["href" ]
    anchor=if anchor_include_map [%r!^/url!]
      anchor_include_map [%r!(?<=/url\?q=)[^&]+!]
    else
      anchor_include_map
    end
    link=CGI.unescape(anchor)
    # document acquisition
    source_text=item.xpath('.//span [@class="st"]')
    # There is no text on the map link, so it is a countermeasure.
    # Removed as the date of creation is too depressing
    # Remove line breaks as well
    text=source_text.empty??":source_text.first.content.tr("\n", "") .gsub(/\d{4}\d{1,2}month\d{1,2}day.../,")
    # temporarily store in the output buffer
    output<<[title, link, text].map { | a | ' ' + a + ' ' }.join(', ") + "\n"
  end
end

# Output to file (overwrite)
open(file_path, "w") {|f|f.write(output)}

Both XPath and regular expressions are very complicated, so please let me know if you have any questions.


2022-09-30 11:46

I wrote it as follows.

require 'open-uri'
require 'nokogiri'
require 'csv'

search_word='google'
search_url='https://www.google.com/search?'
search_url+='q='+search_word
search_url+='&oe=utf-8&hl=ja&start=0&num=20'
escape_url = URI.escape(search_url)
output_csv = 'search.csv'

nth = 1
doc=Nokogiri::HTML (open(escaped_url))
CSV.open(output_csv, "wb") do | csv|
  doc.xpath('//li [@class="g" ]').each do|li|
    title=li.xpath('h3/a').text
      url=li.xpath('div/div/cite').text
      exp=li.xpath('div/span').text.gsub(/\r?\n/,')

    if title!='then
      title=sprintf("Title %d: %s", nth, title)
        url=sprintf("URL%d:%s", nth, url)
        exp=sprintf("text %d:%s", nth, exp)
      csv<<[title, url, exp]
      nth + = 1
    end
  end
end

I think it is different in detail from what you want, but please change it accordingly.
Search results are enclosed in <licclass="g"> tags for each item (*), extracted for each item and then extracted internal nodes (cite or span).Then, we use Ruby's CSV package to save the extraction results to a file in CSV format.

*The search results will include advertisements such as News Topics, but they are excluded.


2022-09-30 11:46

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.