How to Delete a Specific Row in a Pandas Data Frame

(Questions regarding preprocessing of purchasing data)
I have the following data frames that are numbered consecutively for each id, purchase date, store, product, and sales type.

df=pd.DataFrame({'id':['111', '111', '111', '111', '111', '111', '111', '222', '222', '333', '333', '333', '333', '333', '333', '333', '333', '333', '3' ],
                   'Purchase Date': ['1/5', '1/5', '1/5', '1/5', '1/5', '2/3', '2/3', '2/3', '3/5', '3/5', '4/1', '4/1', '4/1', '4/1', '4/1'],
                   Stores: ['Tokyo', 'Tokyo', 'Tokyo', 'Tokyo', 'Tokyo', 'Chiba', 'Chiba', 'Chiba', 'Tokyo', 'Chiba', 'Chiba', 'Chiba', 'Chiba', 'Chiba', 'Chiba', 'Chiba', 'Chiba', 'Chiba', 'Chiba', 'Chiba', 'Chiba',
                   'Products': ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'C', 'D', 'D', 'D'],
                   Sales Type: ['Sales', 'Sales', 'Return', 'Return', 'Return', 'Return', 'Return', 'Return', 'Return', 'Return', 'Return', 'Sales', 'Return',
                   'cnt': [1, 2, 3, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1 ]})

Of these,
"I would like to delete the ""Return"" record corresponding to ""Sales"" in the same ""id", ""Purchase Date", ""Store"", and ""Product""." is when there is cnt "1" for "Sales" and cnt "1" for "Return".I would like to delete the records of cnt "1" and cnt "1" of "Sales" together (...Sales cnt "2" ·Return cnt "2" are deleted, sales cnt "3" ·Return cnt "3" are deleted...).Leave unsupported records)

If you do this for the df above, the image will return the following data frame:

ans=pd.DataFrame({'id':['111', '222', '333'],
                    Date of purchase: ['1/5', '2/3', '4/1',
                    'Store': ['Tokyo', 'Chiba', 'Chiba', 'Chiba',
                    Product: ['A', 'B', 'D',
                    'Sales Type': ['Sales', 'Return', 'Sales',
                    'cnt': [3,2,2]})

I have about 25 million data, so I would like to write while considering the processing speed. (The basic for statement is slow...?)) If you have any advice, please let me know.

python pandas

2022-09-30 21:53

2 Answers

After grouping, use agg() to determine the number of "sales types" in each.If it is only Sales or Return, it will be 1, so we will extract records with Sales Type number 1.

df.iloc[
  sum(((
    df.assign(idx=df.index)
      .groupby(['id', 'Purchase Date', 'Store', 'Product', 'cnt'], as_index=False)
      .agg({
        'idx': lambdax:list(x)iflen(list(x)) == 1 else[]
      })
  .idx,[])].reset_index(drop=True)

# result of processing

    id Purchase Date Store Product Sales Type cnt
0111 1/5 Tokyo A Sales 3
122 2/3 Chiba B Return 2
2333 4/1 Chiba D Sales 2

add

df.iloc [sum(●●●).idx,[]]]

I understand that iloc specifies the position of rows and columns by numerical values (df.iloc [row, column]), but please supplement the code in this part

I think it will be easy to understand if you look at the progress of the process.

 (df.assign(idx=df.index)
  .groupby(['id', 'Purchase Date', 'Store', 'Product', 'cnt'], as_index=False)
  .agg({
    'idx': lambdax:list(x)iflen(list(x)) == 1 else[]
  }))
=>
    id Date of purchase store item cnt idx
0 111 1/5 Tokyo A1 [ ]
111 1/5 Tokyo A2 [ ]
2111 1/5 Tokyo A3 [2]
3222 2/3 Chiba B1 [ ]
4222 2/3 Chiba B2 [7]
5333 3/5 Tokyo C1 [ ]
63334/1 Chiba D1 [ ]
73334/1 Chiba D2 [11]

where the idx column contains the index value of the original data frame (df).If the line (record) does not meet the requirements, it should be an empty list ([]), because of the sum() later.

Data frame above.idx
=>[[],[],[2],[],[7],[],[],[11]]

sum (data frame .idx, [ ] above)
=> [2,7,11]

sum(lst,[]) has flattened (flatten) the above data frame .idx.

df.iloc [Sum() results above].reset_index(drop=True)

The result of sum() is a list, which is the index value (line number of the original data frame) of the record that meets the extraction criteria.By passing this list to df.iloc[], we select and extract the desired row.

2022-09-30 21:53

Conceptually, this should work.
However, performance is not considered.

drops=[]
grouped=df.groupby(['id', 'Purchase Date', 'Store', 'Product', 'cnt'])
for i, dfw in grouped:
    r=len(dfw.index)
    if r>=2:## The above conditions are the same and there are more than one record
        ## Below is a confirmation that sales and returns are paired and how to deal with them.
        sale = [ ]
        void = [ ]
        subg=dfw.groupby('Sales Type')
        fort, dfs in subg:
            if t == 'Sales':
                sale.extend(dfs.index)
            else:
                void.extend(dfs.index)
        
        slen=len(sale)
        vlen=len(void)
        blen = min(slen, vlen)
        ifslen==vlen:## Same number of sales and returns
            drops.extend(dfw.index)
        elif blen>0:## The number of records is different, but the pair of sales and returns is not zero.
            ## Delete it in the order in which it appears on the list first list.leave the unmatched portion of
            drops.extend(sale[:blen])
            drops.extend (void[:blen])
        ## else —If either one is zero, it is not paired and should not be added to the delete list.

droprows.sort()## keep the delete list sorted in ascending order

ans=df.drop (index=drops)
an.reset_index(drop=True, replace=True)

Adding comments:

pandas.core.groupby.DataFrameGroupBy I don't have any information about the original structure of pandas.core.groupby.DataFrameGroupBy, but <

Accepted by a single variable, the value used to select the group (multiple conditions are its tuple) and the selected portion of the corresponding data frame are tuple. When you separate the variables you receive, they are stored in separate variables.

The selected part of the data frame can be changed as a data frame or retrieved by itself.(I may not have been able to change it, but I have not confirmed it because of the following.) However, if you do something to rewrite the data frame itself, it is only in the loop and cannot be taken out of the loop. Therefore, we have prepared a separate variable (drops here) to save the information.



		
		
			

				

					
				

				
					2022-09-30 21:53

			
			If you have any answers or tips



		

	
		Popular Tags
	
	python x 4647
android x 1593
java x 1494
javascript x 1427
c x 927
c++ x 878
ruby-on-rails x 696
php x 692
python3 x 685
html x 656
	


	
		Popular Questions
	
	
	705 I'm a beginner at Flask. The requested URL was not found on the server. If you entered the URL manually please check your spelling and try again.

	1026 /usr/bin/google-chrome:symbol lookup error:/usr/bin/google-chrome: undefined symbol:gbm_bo_get_modifier

	639 Scrap text information after the "View More" button when searching in the Yahoo! News search window

	1253 When building Fast API+Uvicorn environment with PyInstaller, console=False results in an error

	771 Error in x, y, and format string must not be None