code that takes all duplicate lines from a data frame with R

Asked 2 years ago, Updated 2 years ago, 41 views

A.sub<-A%>%dplyr::filter(kegg_compound!=dup_u.df[1,])%>%
dplyr::filter(kegg_compound!=dup_u.df[2,])%>%
dplyr::filter(kegg_compound!=dup_u.df[3,])%>%
dplyr::filter(kegg_compound!=dup_u.df[4,])%>%
dplyr::filter(kegg_compound!=dup_u.df[5,])

A is a data frame with a kegg_compound column containing duplicate gene names.I'd like to write a code from this column that takes out duplicate gene names, but I can't put it right.
dup_u.df is a data frame containing duplicate gene names in the kegg_compound column of A.
A1<-A[duplicated(A$kegg_compound),]#A1 leaves a duplicate first line.
dup<-A1
dup_u<-unique(dup$kegg_compound)
dup_u.df<-data.frame(dup_u)
In this example, there were only five overlapping genes, so a program that was written one by one would achieve its goal.
I would like to make it a program that can handle more duplicated genes, even if the number of duplicated genes is not known in the first place.
A is a data frame with a mix of character, factor, and int types.
If you write the following in the for statement, you will be forced to convert the factor type to the int type.
for(i in 1:nrow(dup_u.df)) {
# B[i,]<-A[A$kegg_compound==dup_u.df[i,1],]}

Professor, thank you for your cooperation.

r

2022-09-30 18:22

1 Answers

I have decided that "take out duplicate lines" is "remove duplicate lines (without leaving a single line)". Is that correct?

You seem to be using dplyr, so how about the following? Also, for the sake of explanation, I am making samples appropriately :

library(dplyr)
df<-data.frame(
  id = 1:6,
  kegg_compound=c("a", "b", "b", NA, "c", NA)
)
df%>% 
  dplyr::group_by(kegg_compound)%>% 
  dplyr::filter(n()==1)%>% 
  dplyr::unggroup()
# > # Atible: 2 x 2
#>id kegg_compound
#><int><fctr>
# > 11a
# > 25c


2022-09-30 18:22

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.