I want to take out a 12-digit string that contains a number as well as a number after a particular character in the regular expression.

import requests
import re
import pandas aspd
from bs4 import BeautifulSoup

response=requests.get("https://db.netkeiba.com/horse/2012104511/")
response.encoding=response.apparent_encoding
html=response.text
soup = BeautifulSoup(html, "lxml")

race_id_list = [ ]
race_a_list=soup.find("table", attrs={"class":"db_h_race_resultsnk_tb_common"}).find_all(
    "a", attrs={"href":re.compile("^/race/20")})

for a inrace_a_list:
    race_id = re.findall(r"\d+", a["href"])
    race_id_list.append(race_id[0])

print(race_a_list)
print(race_id_list)

What print(race_a_list) printed when executing this code:

[<a href="/race/list/20160626/">June 26, 2016<a>,<a href="/race/sum/09/20160626/">3 Hanshin 8<a>a>,<a href="/race/2016030811/"Treasure mound commemorative (G1;Takarazuka Memorial)netkeiba.ja/image/icon_douga.png"/></a>, <a href="/race/list/20160326/">2016/03/26</a>, <a href="/race/sum/C7/20160326/">アラブ首</a>, <a href="/race/2016C7a00708/" title="ドバイシーマクラシッ(G1)">ドバイシーマクラシッ(G1)</a>, <a href="/race/list/20160228/">2016/02/28</a>, <a href="/race/sum/06/20160228/">2中山2</a>, <a href="/race/201606020211/" title="中山記念(G2)">中山記念(G2)</a>, <a href="/race/movie/201606020211" target="_blank" title="中山記念(G2)の映像"><img border="0" src="/style/netkeiba.ja/image/icon_douga.png"/></a>, <a href="/race/list/20150531/">2015/05/31</a>, <a href="/race/sum/05/20150531/">2東京12</a>, <a href="/race/201505021210/" title="東京優駿(G1)">東京優駿(G1)</a>, <a href="/race/movie/201505021210" target="_blank" title="東京優駿(G1)の映像"><img border="0" src="/style/netkeiba.ja/image/icon_douga.png"/></a>, <a href="/race/list/20150419/">2015/04/19</a>, <a href="/race/sum/06/20150419/">3中山8</a>, <a href="/race/201506030811/" title="皐月賞(G1)">皐月賞(G1)</a>, <a href="/race/movie/201506030811" target="_blank" title="皐月賞(G1)の映像"><img border="0" src="/style/netkeiba.ja/image/icon_douga.png"/></a>, <a href="/race/list/20150215/">2015/02/15</a>, <a href="/race/sum/05/20150215/">1東京6</a>, <a href="/race/201505010611/" title="共同通信杯(G3)">共同通信杯(G3)</a>, <a href="/race/movie/201505010611" target="_blank" title="共同通信杯(G3)の映像"><img border="0" src="/style/netkeiba.ja/image/icon_douga.png"/></a>, <a href="/race/list/20150201/">2015/02/01</a>, <a href="/race/sum/05/20150201/">1東京2</a>, <a href="/race/201505010209/" title="セントポーリア賞(500万下)">セントポーリア賞(500万下)</a>, <a href="/race/movie/201505010209" target="_blank" title="セントポーリア賞(500万下)の映像"><img border="0" src="/style/netkeiba.ja/image/icon_douga.png"/></a>, <a href="/race/list/20141108/">2014/11/08</a>, <a href="/race/sum/05/20141108/">5東京1</a>, <a href="/race/201405050104/" title="2歳未勝利">2歳未勝利</a>, <a href="/race/movie/201405050104" target="_blank" title="2歳未勝利の映像"><img border="0" src="/style/netkeiba.ja/image/icon_douga.png"/></a>, <a href="/race/list/20141012/">2014/10/12</a>, <a href="/race/sum/05/20141012/">4東京2</a>, <a href="/race/201405040205/" title="2歳新馬">2歳新馬</a>, <a href="/race/movie/201405040205" target="_blank" title="Video of a 2-year-old new horse"><img border="0" src="/style/netkeiba.ja/image/icon_duga.png"/>a>]

What print(race_id_list) printed when executing this code:

['201609030811', '2016', '201606020211', '201505021210', '201506030811', '201505010611', '201505010209', '201405050104', '201405040205']

The results of the desired print(race_id_list) execution are as follows:

['201609030811', '2016C7a00708', '201606020211', '201505021210', '201506030811', '201505010611', '2015010209', '201405050104', '2014050205']

I don't know how to write the "\d+" part of race_id=re.findall(r"\d+", a["href"]) to extract the '2016C7a00708' as a regular expression from the example above.
I would appreciate your advice.

Python 3.8.8

python regular-expression

2022-09-30 15:03

1 Answers

[\da-zA-Z]+
for numbers and lowercase alphabets And [\da-zA-Z]{12}

if you want to limit it to a 12-character pattern.

2022-09-30 15:03

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656