Defining a Function to Extract a Specific String in Python Regular Expressions

There is a data frame containing the URL as shown below.
I would like to extract the values of certain parameters here and create a new column to replace them.

df=pd.DataFrame({'ulr':['https://www.shop.jp/shop/pro_source=google&pro_md=list_1&pro_cp=sweets_&gclid=?skejsurieksoduuuuis23028', 'https://www.shop.jp/shop/pro_source=yahoo&pro_cp=sweets_&pro_md=list_2&pro_cm=gclid=?skejsurieksoduuuuis23028',
'None', 'https://www.shop.jp/shop/pro_source=&pro_md=&pro_cp=sweets_&gclid =?skejsurieksoduuuuis23028', 'https://www.shop.jp/shop/pro_source=google&pro_md=list_1&pro_cp=sweets_&gclid=?skejsurieksoduuuuis23028']})

Specifically, I would like to do the following.

For the pro_source parameter, extract the value and create and store the column "source"
For the pro_md parameter, extract the value and create and store the column "md"

The following data frame images will be completed:

ans=pd.DataFrame({'ulr':['https://www.shop.jp/shop/pro_source=google&pro_md=list_1&pro_cp=sweets_&gclid=?skejsurieksoduuuuis23028', 'https://www.shop.jp/shop/pro_source=yahoo&pro_cp=sweets_&pro_md=list_2&pro_cm=gclid=?skejsurieksoduuuuis23028',
'None', 'https://www.shop.jp/shop/pro_source=&pro_md=&pro_cp=sweets_&gclid =?skejsurieksoduuuuis23028', 'https://www.shop.jp/shop/pro_source=google&pro_md=list_1&pro_cp=sweets_&gclid=?skatejsurieksoduuuuis23028',
                    'source': ['google', 'yahoo', 'no input', '', 'google',
                    'md': ['list_1', 'list_2', 'no input', '', 'list_1']})

In order to achieve this, the above sample data frames were able to get the expected results, but when I actually applied them to the data I wanted to flag (data with about 10 million records), I got the error "Error: list index out of range".

import re 

def get_source(x):
    if x == 'None':
        return 'No input' 
    elif'pro_source='in x:
        return.findall('pro_source=(.*?)&pro_',x)[0]
    else:
        return 'Other'

default_md(x):
    if x == 'None':
        return 'No input' 
    elif'pro_md='in x:
        return.findall('pro_md=(.*?)&pro_',x)[0]
    else:
        return 'Other'

df['source'] = df['ulr'].apply(get_source)
df['md'] = df['ulr'].apply(get_md)

The above sample code was able to run without errors, so the program itself recognizes that there is no problem, but I would appreciate it if you could advise me why the error occurs (what is expected from the data I want to apply) and how to avoid it.

python pandas regular-expression

2022-09-30 21:51

1 Answers

I don't know the details until I look at the data, but I think there is some data that doesn't have & pro_ behind it.In that case, even if you check the existence of the in inspection and then do findall, the findall itself returns an empty list without finding the string behind it, so I think there is an error when you try to retrieve the elements of the empty list.
Why don't you just use & string to indicate the end of the extraction range?

return.findall('pro_source=(.*?)&',x)[0]

Alternatively, you may want to add it to the back of the string so that you can always find the end.

return.findall('pro_source=(.*?)&pro_',x+'&pro_')[0]

With this method, even if there is no & behind (even at the end of the line) they will definitely find it.

2022-09-30 21:51

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656