Naver Securities Web Crawling

Asked 1 years ago, Updated 1 years ago, 124 views

Hello! I'm a beginner at coding. I've been trying to scribble stock data on the Nav Securities site for the past few days, but I've had a hard time because of the continuous errors. I'd appreciate it if you could help meㅠ<

First, the crawl url is https://finance.naver.com/item/sise_day.nhn?code=068270&page=1 and is imported through urlopen and read as Beautiful Soup. If you try to print up to here, the page cannot be found, you will see an error_content message and no stock information will be printed. I thought about the encoding problem, but I couldn't find the answer. Please!

Code:

url = 'https://finance.naver.com/item/sise_day.nhn?code=068270&page=1'

with urlopen(url) as doc:
    html = BeautifulSoup(doc, 'lxml') 
    print(html)
    pgrr = html.find('td', class_='pgRR')
    s = str(pgrr.a['href']).split('=')
    last_page = s[-1]  

Output:

AttributeError: 'NoneType' object has no attribute 'a'

crawling web-crawling scraping beautifulsoup

2022-09-20 18:02

1 Answers

Next time, please upload the full code including the module you are using.

When html is output, it is output as follows.

In this case, you need to use selenium or find another way.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Naver:: All knowledge in the world, Naver</title>
<style type="text/css">
.error_content * {margin:0;padding:0;}
.error_content img{border:none;}
.error_content em {font-style:normal;}
.error_content {width:410px; margin:80px auto 0; padding:57px 00; font-size:12px; font-family:"sharing Gothic", "Nanum Gothic", "Standing", Dotum, Apple Gothic, Sans-serialign (https://ssl.pstatic.net/static/common/error/090610/bg_thumb; next month)gif) no-repeat center top; white-space:nowrap;}
.error_content p{margin:0;}
.error_content .error_desc {margin-bottom:21px; overflow:hidden; text-align:center;}
.error_content .error_desc2 {margin-bottom:11px; padding-bottom:7px; color:#888; line-height:18px; border-bottom:1px solid #eee;}
.error_content .error_desc3 {clear:both; color:#888;}
.error_content .error_desc3 a {color:#004790; text-decoration:underline;}
.error_content .error_list_type {clear:both; float:left; width:410px; _width:428px; margin:0 0 18px 0; *margin:0 0 7px 0; padding-bottom:13px; font-size:13px; color:#000; line-height:18px; border-bottom:1px solid #eee;}
.error_content .error_list_type dt {float:left; width:60px; _width /**/:70px; padding-left:10px; background:url(https://ssl.pstatic.net/static/common/error/090610/bg_dot.gif) no-repeat 2px 8px;}
.error_content .error_list_type dd {float:left; width:336px; _width /**/:340px; padding:0 0 0 4px;}
.error_content .error_list_type dd span {color:#339900; letter-spacing:0;}
.error_content .error_list_type dd a{color:#339900;}
.error_content p.btn{margin:29px 0 100px; text-align:center;}
</style>
</head>
<!-- ERROR -->
<body>
<div class="error_content">
<p class="error_desc"><imgalt="page not found" height="30" src="https://ssl.pstatic.net/static/common/error/090610/txt_desc5.gif" width="319"/></p>
<p class="error_desc2">The address of the page you are trying to visit is entered incorrectly, or <br/>
                The page you requested could not be found because the address of the page has been changed or deleted.<br/>
                Please check again if the address you entered is correct.
        </p>
<p class="error_desc3">For inquiries, refer to <a href="https://help.Please let us know at naver.com/"target="_blank">Customer Center</a> and we will kindly guide you. Thank you.</p>
<p class="btn">
<a href="javascript:history.back()"><imgalt="previous page" height="35" src="https://ssl.pstatic.net/static/common/error/090610/btn_prevpage.gif" width="115"/></a>
<a href="https://finance.naver.com"><imgalt="to financial home" height="35" src="https://ssl.pstatic.net/static/nfinance/btn_home.gif" width="115"/></a>
</p>
</div>
</body>
</html>

I don't know how to crawl with urlips.

Using the requests module, you can:

import requests
from bs4 import BeautifulSoup

url = 'https://finance.naver.com/item/sise_day.nhn?code=068270&page=1'
headers = {#user agent}
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.text, 'html.parser')
tag = soup.select_one('td.pgRR > a')['href']
sp = tag.split('=')
print(sp)
# # ['/item/sise_day.nhn?code', '068270&page', '386']


2022-09-20 18:02

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.