Windows 10
Python 3.X
pandas
We would like to use the following df for data processing.
This is made up of columns called id, number, and classification.
Multiple IDs are duplicated, and three categories These IDs are always divided into three categories (type1, type2, type3).
It is written in the column classification, but the data is written in one of the lines of the duplicate data.(Each where it is written)
The number in the column is an integer in the last row of duplicate ids.
We would like to obtain numerical values for each classification, and eventually find the average, maximum, and minimum values of the numerical values in each of the three categories.Therefore, I would like to get what classification is for aaa and what classification is for bbb, and put those numbers in the list of type1, type2, type3, but I am not used to using pandas, so I don't know exactly how to write the code.Which function of Pandas should I use to get the numbers well?
python pandas
テストAdding rows for testing.
import pandas as pd
importio
csv_data='"
id, number, classification
aaa,
aaa,
aaa, 111, type 2
bbb,
bbb, type 1
bbb, 222,
ccc, type 3
ccc,,
ccc,333
ddd,,
ddd,,
ddd, 1234, type 2
'''
df = pd.read_csv(io.StringIO(csv_data))
#
dic=df.groupby('id').agg({'numerical': 'last', 'classification': 'first'})\
.groupby('classification')['numerical'].agg(list).to_dict()
print(dic)
# {'type1': [222.0], 'type2': [111.0, 1234.0], 'type3': [333.0]}
pandas.core.groupby.groupBy.first—pandas 1.5.2 documentation
finalGroupBy.first(numerical_only=False,min_count=-1)
Compute the first non-null entryof each column.
I'm not used to Pandas, so I don't know exactly how to write the code.
I don't know if it's easy to understand, but each step
I'm going to use different DataFrames one by one
As a result, I need a list, so I will list it in the dictionary
update:change agg specification to last or first
df=pd.read_csv(tsvf)#, keep_default_na=False)
# Group by id
df2=df.groupby('id', as_index=False).agg({'id':'first','numerical':'last','classification':'first'})
# Horizontal or vertical alignment
df3=df2.pivot(index='classification', columns='id', values='numerical') .T
# List by id (dictionary)
dct={t:[int(n)for n indf3[t].to_list()if not pd.isnull(n)]
for in('type1', 'type2', 'type3')}
dct
# {'type1':[222], 'type2':[111], 'type3':[333]}
<
['dddd',444,'type3']
if there is an item
This should be the result of adding to the 'type3' list
© 2024 OneMinuteCode. All rights reserved.