I am trying web scraping using ruby language in my studies.
This is an image that searches Google and creates a CSV file similar to the following:
★ CSV output image (*Title, URL, text from left)
Google, www.google.co.jp, provides tools to search for all the information in the world.
Create an output file using each
and
An attempt was made to combine three files from another program into one file.
Then I found that the search order of the three files was not correct.
[Image of output results]
■Title.txt
Title 1:a
Title 2: b
Title 3: c
■ URL.txt
URL1:a
URL 3:c
URL 2:b
■ Text.txt
Sentence 1:a
Sentence 2: b
Sentence 3: c
So I tried to aggregate the search results into one output file.
This is what happened as a result of running from the following sources.
[Image of output results]
■Summary.txt
Title 1:a
Title 2: b
Title 3: c
URL 1:a
URL 2:b
URL 3:c
Sentence 1:a
Sentence 2: b
Sentence 3: c
As for the completion system, I would like to do the following.
■Summary completion system.txt
Title 1:a, URL 1:a, Text 1:a
Title 2:b, URL 2:b, text 2:b
Title 3:c, URL 3:c, text 3:c
No matter how much I try, it doesn't work, so I asked you a question.
I'm sorry to ask you a beginner's question, but could someone give me some advice?
I will write down the source I created.
Thank you very much for your instruction
*The ruby version is 2.1.5p273.
★ Source
rec='https://www.google.com/search?q=google&oe=utf-8&hl=ja'
count = 0
ST="&start="
# Clearing Files
File.open("/src/out/all", "w")
# Search Processing
for in 1..2
# Convert to String type and combine
search=rec.to_s+ST.to_s+count.to_s
# Convert character codes to Google
escape_url = URI.escape (search)
count+=10
doc=Nokogiri::HTML (open(escaped_url))
# title acquisition
doc.xpath('//h3/a') .each do | link |
$cont=[]
$cont.push
$cont.push(link.content)
$stdout=File.open("/src/out/all", "a")
puts$cont[0]
$stdout = STDOUT
end
# Retrieving URLs
doc.xpath('//div[1]/cite').each do|url|
$ul = [ ]
$ul.push
$ul.push(url.content)
$stdout=File.open("/src/out/all", "a")
puts$ul[0]
$stdout = STDOUT
end
# sentence acquisition
doc.xpath('//div/span') .each do | link |
$body=[]
$body.push(link.content)
$stdout=File.open("/src/out/all", "a")
puts$body[0]
$stdout = STDOUT
end
end
Thank you for your cooperation.
ruby web-scraping
First.
Google's search results are quite complicated, so it was a little hard to handle exceptions.
Next is the script you created.
$
. (1/18 1:00 postscript: PHP also starts with $
. Not required for Ruby.)File.open
to the standard output is very frustrating.File.open
should be followed by a block.URI.escape
method is deprecated in the old one. Use the CGI.escape
method.Now, I'm going to give you a brief policy description of the scripts I've created.
You want CSV output, so I created the output as follows:It may contain spaces, so I put them in quotation marks.
"Title 1", "URL 1", "Sentence 1"
"Title 2", "URL 2", "Sentence 2"
"Title 3", "URL 3", "Sentence 3"
You want CSV output, so I created it to output it as follows.It may contain spaces, so I put them in quotation marks.
"Title 1", "URL 1", "Sentence 1"
"Title 2", "URL 2", "Sentence 2"
"Title 3", "URL 3", "Sentence 3"
Let's look at the script.I completely rewritten it.
require 'nokogiri'
require 'open-uri'
require 'cgi/util'
# search keyword
keyword="google"
base_url="https://www.google.com/search?q=#{CGI.escape(keyword)}&oe=utf-8&hl=ja&start="
# output destination
file_path="./output.csv"
# output buffer
output=""
2.times.each do|i|
search_url=base_url+(i*10).to_s
doc=Nokogiri::HTML.parse(open(search_url).read)
doc.xpath('//li [@class="g" ]').each do|item|
# title acquisition
title=item.xpath('.//h3/a').first.content
# Skip items in the case of image search and news search results
next if title [ / Image Search Results / ] | | title [ / News Search Results / ]
# Get URL (but map link is special)
# The site tag is item.xpath('.//cite') first.content, but the return value is not always the linked URL.
anchor_include_map = item.xpath('.//h3/a').first ["href" ]
anchor=if anchor_include_map [%r!^/url!]
anchor_include_map [%r!(?<=/url\?q=)[^&]+!]
else
anchor_include_map
end
link=CGI.unescape(anchor)
# document acquisition
source_text=item.xpath('.//span [@class="st"]')
# There is no text on the map link, so it is a countermeasure.
# Removed as the date of creation is too depressing
# Remove line breaks as well
text=source_text.empty??":source_text.first.content.tr("\n", "") .gsub(/\d{4}\d{1,2}month\d{1,2}day.../,")
# temporarily store in the output buffer
output<<[title, link, text].map { | a | ' ' + a + ' ' }.join(', ") + "\n"
end
end
# Output to file (overwrite)
open(file_path, "w") {|f|f.write(output)}
Both XPath and regular expressions are very complicated, so please let me know if you have any questions.
I wrote it as follows.
require 'open-uri'
require 'nokogiri'
require 'csv'
search_word='google'
search_url='https://www.google.com/search?'
search_url+='q='+search_word
search_url+='&oe=utf-8&hl=ja&start=0&num=20'
escape_url = URI.escape(search_url)
output_csv = 'search.csv'
nth = 1
doc=Nokogiri::HTML (open(escaped_url))
CSV.open(output_csv, "wb") do | csv|
doc.xpath('//li [@class="g" ]').each do|li|
title=li.xpath('h3/a').text
url=li.xpath('div/div/cite').text
exp=li.xpath('div/span').text.gsub(/\r?\n/,')
if title!='then
title=sprintf("Title %d: %s", nth, title)
url=sprintf("URL%d:%s", nth, url)
exp=sprintf("text %d:%s", nth, exp)
csv<<[title, url, exp]
nth + = 1
end
end
end
I think it is different in detail from what you want, but please change it accordingly.
Search results are enclosed in <licclass="g">
tags for each item (*), extracted for each item and then extracted internal nodes (cite
or span
).Then, we use Ruby's CSV package to save the extraction results to a file in CSV format.
*The search results will include advertisements such as News Topics, but they are excluded.
© 2024 OneMinuteCode. All rights reserved.