How to Combine Elements with Duplicate IDs

Asked 2 years ago, Updated 2 years ago, 31 views

ID_11
ID_23
ID_25
ID_21
ID_32
ID_33
…
ID_11
ID_2 1 3 5
ID_323
…

How do I horizontally combine duplicates from a dataset with duplicate IDs like the one above?
I would like to arrange duplicate IDs in ascending order.
There are many ways to remove duplicates, but I can't find a way to combine them, so I'm worried about how to deal with them.
Thank you for your cooperation.

python r

2022-09-30 16:37

5 Answers


for R Create a list of overlapping IDs in tapply and apply the functions you like in lapply to make sorting easier.

x<-tapply(c(1,3,5,1,2,3), c('ID_1', 'ID_2', 'ID_2', 'ID_2', 'ID_3', 'ID_3', list)

lapply (x, sort)


2022-09-30 16:37

I don't know the exact format of the data, so it's an abstract story, but how about using Python's dictionary to write the following:

data=[(1,1),(2,3),(2,5),(2,1),(3,2),(3)]

id_dict={}

For dind data:
    id_dict.setdefault(d[0],[]).append(d[1])

for i,lin id_dict.items():
    print(i, ":", sorted(l))

Run Results

1: [1]
2  :  [1, 3, 5]
3  :  [2, 3]

They are aligned by print to arrange them in ascending order.

I don't know the range of ID values, so I use a dictionary, but I think I might use a list if I know the range (and it's small).


2022-09-30 16:37

Data format is not specified, but

pandas makes it easy to write using groupby().

import pandas as pd

df = pd.DataFrame({
    'id': ['ID_1', 'ID_2', 'ID_2', 'ID_2', 'ID_2', 'ID_3', 'ID_3', 'ID_3',
    'value': [1,3,5,1,2,3]})

result=df.groupby('id')['value'].apply(list)


2022-09-30 16:37

You can also use itertools.groupby.
I use OrderDict to keep the order in out, but if you don't need the order in order, a dictionary is OK.
data must be sorted by key items

https://docs.python.jp/3/library/itertools.html#itertools.groupby

>>>from ittertools import groupby
>> from collections import OrderedDict
>>
>>> data = [
...     ("ID_1", 1),
...     ("ID_2", 3),
...     ("ID_2", 5),
...     ("ID_2", 1),
...     ("ID_3", 2),
...     ("ID_3", 3),
... ]
>> data
[('ID_1', 1', ('ID_2', 3), ('ID_2', 5), ('ID_2', 1), ('ID_3', 2), ('ID_3', 3)]
>>>out = OrderedDict()
>>>fork,gin groupby(data,key=lambda v:v[0]):
...     value = [v[1] for ving ]
...     out[k] = sorted(value)
...
>>>out
OrderedDict(('ID_1', [1], ('ID_2', [1,3,5]), ('ID_3', [2,3])])
>>dict(out)
{'ID_1':[1], 'ID_2':[1,3,5], 'ID_3':[2,3]}


2022-09-30 16:37

I think using the dplyr, tidyr package is also a good idea for R.

library(dplyr)
library(tidyr)

df = data.frame(
  id = c('ID_1', 'ID_2', 'ID_2', 'ID_2', 'ID_2', 'ID_3', 'ID_3'),
  x = c (1, 3, 5, 1, 2, 3)
)

df_spread=df%>%
  range(id,x)%>% 
  group_by(id)%>% 
  mute(no=row_number())%>%
  spread(key=no,value=x,sep='_')

print(df_spread)
# > # Atible—4 x 3
#>#Groups:id[3]
#>id no_1 no_2 no_3
#>*<fctr><dbl><dbl>
# > 1 ID_11 NANA
# > 2 ID_2 1 3 5
# > 3 ID_323 NA


2022-09-30 16:37

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.