ID_11
ID_23
ID_25
ID_21
ID_32
ID_33
…
ID_11
ID_2 1 3 5
ID_323
…
How do I horizontally combine duplicates from a dataset with duplicate IDs like the one above?
I would like to arrange duplicate IDs in ascending order.
There are many ways to remove duplicates, but I can't find a way to combine them, so I'm worried about how to deal with them.
Thank you for your cooperation.
for R
Create a list of overlapping IDs in tapply and apply the functions you like in lapply to make sorting easier.
x<-tapply(c(1,3,5,1,2,3), c('ID_1', 'ID_2', 'ID_2', 'ID_2', 'ID_3', 'ID_3', list)
lapply (x, sort)
I don't know the exact format of the data, so it's an abstract story, but how about using Python's dictionary to write the following:
data=[(1,1),(2,3),(2,5),(2,1),(3,2),(3)]
id_dict={}
For dind data:
id_dict.setdefault(d[0],[]).append(d[1])
for i,lin id_dict.items():
print(i, ":", sorted(l))
Run Results
1: [1]
2 : [1, 3, 5]
3 : [2, 3]
They are aligned by print
to arrange them in ascending order.
I don't know the range of ID values, so I use a dictionary, but I think I might use a list if I know the range (and it's small).
Data format is not specified, but
pandas makes it easy to write using groupby()
.
import pandas as pd
df = pd.DataFrame({
'id': ['ID_1', 'ID_2', 'ID_2', 'ID_2', 'ID_2', 'ID_3', 'ID_3', 'ID_3',
'value': [1,3,5,1,2,3]})
result=df.groupby('id')['value'].apply(list)
You can also use itertools.groupby
.
I use OrderDict
to keep the order in out
, but if you don't need the order in order, a dictionary is OK.
data
must be sorted by key items
https://docs.python.jp/3/library/itertools.html#itertools.groupby
>>>from ittertools import groupby
>> from collections import OrderedDict
>>
>>> data = [
... ("ID_1", 1),
... ("ID_2", 3),
... ("ID_2", 5),
... ("ID_2", 1),
... ("ID_3", 2),
... ("ID_3", 3),
... ]
>> data
[('ID_1', 1', ('ID_2', 3), ('ID_2', 5), ('ID_2', 1), ('ID_3', 2), ('ID_3', 3)]
>>>out = OrderedDict()
>>>fork,gin groupby(data,key=lambda v:v[0]):
... value = [v[1] for ving ]
... out[k] = sorted(value)
...
>>>out
OrderedDict(('ID_1', [1], ('ID_2', [1,3,5]), ('ID_3', [2,3])])
>>dict(out)
{'ID_1':[1], 'ID_2':[1,3,5], 'ID_3':[2,3]}
I think using the dplyr
, tidyr
package is also a good idea for R.
library(dplyr)
library(tidyr)
df = data.frame(
id = c('ID_1', 'ID_2', 'ID_2', 'ID_2', 'ID_2', 'ID_3', 'ID_3'),
x = c (1, 3, 5, 1, 2, 3)
)
df_spread=df%>%
range(id,x)%>%
group_by(id)%>%
mute(no=row_number())%>%
spread(key=no,value=x,sep='_')
print(df_spread)
# > # Atible—4 x 3
#>#Groups:id[3]
#>id no_1 no_2 no_3
#>*<fctr><dbl><dbl>
# > 1 ID_11 NANA
# > 2 ID_2 1 3 5
# > 3 ID_323 NA
© 2024 OneMinuteCode. All rights reserved.