I want to unify the acquisition of information from different web pages.

I am currently learning Python.
Search the Google Search API for "Tokyo Company Overview", obtain the URL of each web page of the search results, and
I'm thinking of getting a company profile by scraping those URL destinations.
As a matter of course, each web page has a different way of writing html, so I can't extract the information I'm looking for with table tags and li tags.
Please let me know if you have any ideas.

# code 1.googleAPI search and output results to json file
import json
import urllib.request
import urllib.parse
from urllib.request import urlopen
QUERY=u 'Company Overview + Tokyo'
key = 'KEY'
cx = 'CX'
NUM = 3
cseurl='https://www.googleapis.com/customsearch/v1?'
param={
  'key': key,
  'q'—Query,
  'cx': cx,
  'alt': 'json',
  'lr': 'lang_ja',
}
start = 1
f=open('result/GoogleResult.json', 'w')

for i in range (0, NUM):
  params ['start'] = start
  req_url = cseurl + urllib.parse.urlencode (params)
  search_response=urllib.request.urlopen(req_url)
  search_results=search_response.read().decode("utf8")
  dump=json.loads(search_results)
  f.write(json.dumps(dump)+"\n")
  start=int(dump['queries']['nextPage'][0]['startIndex'])
f.close()


# Code 2. Extract URL from json file of Google search results
import re
read_file=open('result/GoogleResult.json', 'r')
resultFileData=read_file.read().replace(',',','\n')
read_file.close()
# Regular Expression Patterns for URL Extraction
pattern=re.compile(r'"link":\s"http.+"')
link_urls=pattern.findall(resultFileData)
write_file=open('result/UrlList.txt', 'w')
for link_urlin link_urls:
  geturl=link_url.replace("\"link\":\"","".replace("\"",")
  write_file.write(geturl+'\n')
write_file.close()

# Code 3. Obtain information about the table tag of the URL destination obtained
import csv
from bs4 import BeautifulSoup
urlfile=open('result/UrlList.txt', 'r')
urlrows=urlfile.readlines()
urlfile.close()

csvFile=open("result/url_file.csv", 'wt', newline=', encoding='utf-8')
For urrow in urrows:
  html=urlopen(urlrow)
  bsObj=BeautifulSoup(html)
  tables=bsObj.findAll("table")
  writer=csv.writer(csvFile)
  for table intables:
    rows=table.findAll("tr")
    For row in rows:
      csvRow = [ ]
      for cell in row.findAll (['td', 'th']
        csvRow.append(cell.get_text())
        iflen(csvRow) == 2:
          writer.writerow (csvRow)
  writer.writerow("-----------")
csvFile.close()

Currently, we are executing the above three codes in order, and we have managed to get the second URL.
The third code is that the table tag information has been brought.
Thank you for your cooperation.

python

2022-09-30 14:01

1 Answers

Since Wikipedia information can be downloaded in xml format, I have parsed it to obtain a company profile.I think it's easier to do it because it's more structured than the corporate website.It's different from scraping, but just for your information.

2022-09-30 14:01

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656