pd.read_html does not work well

Asked 2 years ago, Updated 2 years ago, 45 views

I'm trying to retrieve the table data from the following URL, but only one URL (aa1) doesn't work.
Is there something wrong with the string of data that is being retrieved?"Also, after researching on various sites, I found that encoding="cp932" did not work. "
For example,

import pandas as pd
aa0="https://db.netkeiba.com/race/201901010101"
pd.read_html(aa0)

Output is

 [Order of Arrival: Horseman, Horseman, Age, Weight, Jockey, Time Difference, Single Victory, Popular Horseweight]
 01 11 Golconda Oyster 254 Rumer 1:48.3 NaN 1.4 1518 (-16)   
 1 2 3 3 Punt Fire Oysters 254 Yasumasa Iwata 1:50.1 Large 3.52496 (-8)   
 2 3 44 Lagrimasnegrass Oysters 251 TANO Taisei 1:50.95 46.6 6546 (+6)   
 3489 Kitano Kodo Oyster 251 Sugawara Akira 1:51.53.1/256.87458 (-8)   
 4555 Nemophila Blue Oyster 254 Kawashima Shinji 1:51.7 1.1/2 140.39436(0)   
 5688 Minelaxman Oyster 254 Yuji Tanouchi 1:52.1/29.7 3480 (+8)   
 6722 San Montebello Female 254 人 Hiroto 1: 52.5 2.1 / 214.7 8450 (+2)   
 7877 Escalation 2254 Yusuke Fujioka 1:52.5 Atama 26.15448(-4)   
 8966 Seiun Julia 2254 Kenichi Ikezoe 1:52.6 fired 16.4470(0)   
 
         trainer  
 0 [East] Tetsuya Kimura  
 1 [East] Takahisa Tezuka  
 2 [East] Kazuo Fujisawa  
 3 [East] Tree Climbing  
 4 [West] Yazaku Yoshinobu  
 5 [East] Kim Sung Takashi  
 6 [East] Eiji Nakano  
 7 [East] Mizuki Takayagi  
 8 [West] Hideichi Asami,
     0      1          2    3
 0 Single win 11401
 1 Double win 134 110110470 127
 2 frames in a row 1-31901
 3 Horse Rengo 1-31901,
      0                1           2      3
 0 Wide 1-31-43-41208401100 11213
 Single horse 1 → 32901
 2 Triple 1-3-416106
 3 Triple unit 1 → 3 → 4356010,
             0                                           1
 0 Baba Index You can check the Baba Index by subscribing to Premier Service.To Premier Service Guide
 1 Baba Comments (?) You can check Baba comments by subscribing to the Premier Service.To Premier Service Information,
        0                      1
 01 Corner 1(4,3)6(5,8)2(9,7)
 12 Corner 1(3,8)4,6 -(5,7)(2,9)
 23 Corner (*1,3)(4,8) 6-5,2(9,7)
 34 Corner 1, 3-(4,8)6-5(9,2)7,
      0                                                  1
 0 wrap 12.5-11.5-11.9-11.7-12.0-12.2-12.4...
 1 pace 12.5-24.0-35.9-47.6-59.6-71.8-84.2...]

Similarly, if you read the URL given to aa1,

aa1="https://db.netkeiba.com/race/202204010202"
pd.read_html(aa1)[0]

Error

------------------------------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-9-3b6afaad857c>in<module>
      2aa0="https://db.netkeiba.com/race/201901010101"
      3aa1 = "https://db.netkeiba.com/race/202204010202"
---->4pd.read_html(aa1)

~/opt/anaconda3/lib/python 3.8/site-packages/pandas/util/_decorators.py in wrapper (*args,**kwargs)
    297                 )
    298 warnings.warn (msg, FutureWarning, stacklevel=stacklevel)
-->299 return func (*args, **kwargs)
    300 
    301 return wrapper

~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/html.py in read_html(io, match,flavor,header,index_col,skiprows,attrs,parse_dates,thousands,encoding,decimal,conversers,na_values,keep_default,only)
   1083io=stringify_path(io)
   1084 
->1085 return_parse(
   1086 flavor=flavor,
   1087io=io,

~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/html.py in_parse (flavor, io, match, attrs, encoding, displayed_only, **kwargs)
    916 for table intables:
    917 try:
-->918 ret.append(_data_to_frame(data=table,**kwargs))
    919 except EmptyDataError: #emptytable
    920 continue

~/opt/anaconda3/lib/python 3.8/site-packages/pandas/io/html.py in_data_to_frame(**kwargs)
    795#fill out elements of body that are "ragged"
    796_expand_elements(body)
-->797 with TextParser (body, header=header, **kwargs)astp:
    798 return tp.read()
    799 

~/opt/anaconda3/lib/python 3.8/site-packages/pandas/io/parsers.py in TextParser (*args,**kwds)
   2227     """
   2228kwds ["engine"] = "python"
- > 2229 return TextFileReader (*args, **kwds)
   2230 
   2231 

~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in __init__(self,f,engine,**kwds)
    817self.options["has_index_names"] = kwds["has_index_names"]
    818 
-->819 self._engine=self._make_engine(self.engine)
    820 
    821 def close (self):

~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in_make_engine(self,engine)
   1048             )
   1049# error: Too many arguments for "ParserBase"
->1050 return mapping [engine] (self.f,**self.options) # type: ignore [call-arg]
   1051 
   1052def_failover_to_python(self):

~/opt/anaconda3/lib/python 3.8/site-packages/pandas/io/parsers.py in __init__(self,f,**kwds)
   2308self.num_original_columns,
   2309self.unnamed_cols,
->2310) = self._infer_columns()
   2311 except (TypeError, ValueError):
   2312 self.close()

~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in_infer_columns(self)
   2721 columns= [names]
   2722 else:
->2723 columns=self._handle_usecols(columns, columns[0])
   2724 else:
   2725 try:

IndexError:list index out of range

is the case.As you can see if you hit the URL of aa0, aa1 directly on the web, both of them have the same layout of the result data of the horse race.However, the URL of aa1 is not scraping well.I don't know why, so could you please point it out?

python3 pandas

2022-09-30 19:46

1 Answers

pandas.read_html reads multiple tables at once, so you may want to try one at a time to isolate the problem.

If you try to access the site frequently, it will lead to an attack on the site, so stop reading it

import pandas as pd
import requests
from bs4 import BeautifulSoup

url='https://db.netkeiba.com/race/201901010101'
url='https://db.netkeiba.com/race/202204010202'

r=requests.get(url)
soup = BeautifulSoup(r.content)

Also, it would be good to check one by one

lst=soup('table')
display(f'number of tables: {len(lst)}')
for tbl install:
    try:
        dfs = pd.read_html(str(tbl))
        # iflen(dfs) == 1:
        #    display(dfs[0])
    except:
        display(tbl)

The problem is that there seems to be no content in the 4/6th table
(Just in case, I'll hide the contents)


2022-09-30 19:46

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.