Scrap text information after the "View More" button when searching in the Yahoo! News search window

Asked 1 years ago, Updated 1 years ago, 518 views

Scrap text information after the "View More" button on the web page
It will be almost the same as here, but I would like to ask you a similar question because I couldn't solve it by myself.

I'm trying to get text information by scraping by typing keywords in the Yahoo! News search window.
There is a "View More" button in the middle of the page, and I would like to get all the information that follows.
I'm in trouble because I only got information before "See More".

Respondents to the above link will use Firefox developer tools ("Ctrl" + "Alt" + "Del") to
"If you look at the console, you can get information about ""Page"" and ""Limit"", so you can see it."
However, I was unable to reproduce it on the PR TIMES page.

You may not know how to use Firefox from the basics, but it's kind of
I'm a beginner, so I don't know.I don't mind if it's a hint, but if you know,
Thank you for your cooperation.

I don't even know if this is the way I know it.

The results of the Firefox developer tool are as shown in the image below.

javascript html

2023-03-04 09:39

1 Answers

The API URL is https://news.yahoo.co.jp/api/searchFeed, where query word is query (search string) and results (number of acquisitions), start (index of article) and API token.The API token is embedded in the front page as a variable (hash) in JavaScript, so you need to extract it.

If you search Toyota, there are more than 10,000 cases, so the code below has the first 300 cases.

import requests
import re
import sys
import time
from print import print

yahoo_news_url='https://news.yahoo.co.jp/'
search_url='https://news.yahoo.co.jp/api/searchFeed'

# get API token
r=requests.get(yahoo_news_url)
r.raise_for_status()
token=re.search(r'"apiAccessToken":"(.+?)"',r.text)
if token is None: sys.exit(1)
token=token[1]

# get news feed
query='Toyota'
start, num_feeds, num_repeats=1,50,6#50articles*6repeats=300articles
headers = {'content-type': 'application/json'}
articles=[ ]
for idx in range (start, num_feeds * num_repeats, num_feeds):
    param={'query':query,'start':idx,'results':num_feeds,'token':token}
    r=requests.get(search_url,params=params,headers=headers)
    r.raise_for_status()

    js = r.json()
    for c in js ['contents']:
        if'contentId'inc:
            headline=re.sub(r'[\x02\x03]', '', c['highlightSearchText']['headline'])
            body_text='…'+re.sub(r'[\x02\x03]',',',c['highlightSearchText']['body'])+'…'
            articles.append({
                'headline':headline, 'body':body_text,
                'publishTime':c['publishTime'], 'permalink':c['permalink']})

    time.sleep(10)

total_results=js ['totalResults']

# show results
print(f'{query=}')
print(f'{total_results=}')
print(articles, sort_dictts=False, width=150)

output results

query='Toyota'
total_results=10471
[{'headline':'[Valley] Saitama Ueo and Yamagishi Akane have participated in 230 V-League games.the second female college graduate in history',
  "body":" became the second great achievement in history.The team is scheduled to hold an award ceremony after the second game on March 5 (Saitama Ueo vs. Toyota Motor Corporation). \u3000 Yamagishi's profile is as follows…',
  'publicTime': {'date': 'Sunday, March 5', 'time': '1:01'},
  'permalink': 'https://news.yahoo.co.jp/articles/bbee4c9c54db4861de8189f914cdb864bc272f2a'},
 {'headline':'[Highlight video included] Yokohama Canon Eagles vs. Shizuoka Blue Leaves, Friday night's final draw is after a fierce battle.Japan Rugby League One Section 10',
  Shizuoka Blue Leaves of 'body': '... challenged. The Eagles, aiming to reach the top four, beat the Toyota Verbritz 39-7 in the previous section and won their third consecutive victory.Director Keisuke Sawaki has been FW since the previous section (Fo...',
  'publicTime': {'date': '3/4 (Sat), 'time': '21:42'},
  'permalink': 'https://news.yahoo.co.jp/articles/93246093f24965722a153a86cbbd81c60966317a'},

                                        :


2023-03-04 21:29

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.