I want to extract the IDs that overlap for a period of time between the two Dataframes.

Asked 2 years ago, Updated 2 years ago, 49 views

I use python3 and pandas.
I'd like to compare the dates of people with the same ID among the two Dataframes, df1, df2, and take out the IDs of the rows with overlapping periods.
There are actually about 100,000 lines.

data1=[1, "2010-01-01", "2010-01-20", "1, "2010-03-20", "2010-03-30",
        [2, "2010-02-01", "2010-04-20"], [3, "2010-06-10", "2010-06-15"],
        [3, "2010-06-20", "2010-06-30"], [3, "2010-07-10", "2010-06-20"]]

data2 = [[1, "2010-01-15", "2010-01-30", "1, "2010-04-20", "2010-04-30",
        [2, "2010-05-01", "2010-05-10"], [2, "2010-05-20", "2010-05-25"],
        [3, "2010-02-01", "2010-02-15"], [3, "2010-06-15", "2010-06-25"]]

columns1 = ["ID", "start_date", "end_date" ]

df1=pd.DataFrame(data=data1,columns=columns)
df2=pd.DataFrame(data=data2,columns=columns)

I would like to compare this by ID and get a list like the one below.

result_list=[1,3]


result_list = [ ]
for index, row1 indf1.iterrows():
    if not row1 ["ID"] in result_list:
        for index, row2indf2.iterrows():
            if row1["ID"]==row2["ID"]and row2["start_date"]<=row1["end_date"]and row2["end_date"]>=row1["start_date"]:
                result_list.append(row1["ID"])
                break
result_list=list(set(result_list))
result_list

I was able to get the result I wanted through trial and error
I can't help but feel like it's inefficient… Is there a way to avoid multiple for statements?
Thank you for your advice.

python pandas

2022-09-29 21:47

4 Answers

I don't want to use For statements in pandas, so if you just want to duplicate data frame processing, you'll see the following:

In[1]:import pandas aspd

In[2]: data1 = [[1, "2010-01-01", "2010-01-20", "1, "2010-03-20", "2010-03-30",
      :         [2, "2010-02-01", "2010-04-20"], [3, "2010-06-10", "2010-06-15"],
      :         [3, "2010-06-20", "2010-06-30"], [3, "2010-07-10", "2010-07-20"]]
      :
      : data2 = [[1, "2010-01-15", "2010-01-30", "1, "2010-04-20", "2010-04-30",
      :         [2, "2010-05-01", "2010-05-10"], [2, "2010-05-20", "2010-05-25"],
      :         [3, "2010-02-01", "2010-02-15"], [3, "2010-06-15", "2010-06-25"]]
      :
      : columns=["ID", "start_date", "end_date" ]
      :
      : df1=pd.DataFrame(data=data1,columns=columns)
      : df2=pd.DataFrame(data=data2,columns=columns)

In[3]: df=pd.concat([df1,df2], keys=["df1", "df2"])
      : df
Out [3]:
       ID start_date end_date
df10 1 2010-01-01 2010-01-20
    1   1  2010-03-20  2010-03-30
    2   2  2010-02-01  2010-04-20
    3   3  2010-06-10  2010-06-15
    4   3  2010-06-20  2010-06-30
    5   3  2010-07-10  2010-07-20
df20 1 2010-01-15 2010-01-30
    1   1  2010-04-20  2010-04-30
    2   2  2010-05-01  2010-05-10
    3   2  2010-05-20  2010-05-25
    4   3  2010-02-01  2010-02-15
    5   3  2010-06-15  2010-06-25

In[4]: stacked=df.set_index("ID", append=True).stack().to_frame()
      : stacked
Out [4]:
                              0
      ID
df10 1 start_date 2010-01-01
         end_date 2010-01-20
    11 start_date 2010-03-20
         end_date 2010-03-30
    22 start_date 2010-02-01
         end_date 2010-04-20
    33 start_date 2010-06-10
         end_date 2010-06-15
    43 start_date 2010-06-20
         end_date 2010-06-30
    53 start_date 2010-07-10
         end_date 2010-07-20
df20 1 start_date 2010-01-15
         end_date 2010-01-30
    11 start_date 2010-04-20
         end_date 2010-04-30
    22 start_date 2010-05-01
         end_date 2010-05-10
    32 start_date 2010-05-20
         end_date 2010-05-25
    43 start_date 2010-02-01
         end_date 2010-02-15
    53 start_date 2010-06-15
         end_date 2010-06-25

In[5]: ordered=stacked.reset_index(level=2).sort_values('ID',0])
      : ordered
Out [5]:
                  ID 0
df10 start_date1 2010-01-01
df20 start_date1 2010-01-15
df10 end_date1 2010-01-20
df20 end_date1 2010-01-30
df11 start_date1 2010-03-20
      end_date1 2010-03-30
df21 start_date1 2010-04-20
      end_date1 2010-04-30
df12start_date2 2010-02-01
      end_date2 2010-04-20
df22start_date2 2010-05-01
      end_date2 2010-05-10
    3 start_date2 2010-05-20
      end_date2 2010-05-25
    4 start_date3 2010-02-01
      end_date3 2010-02-15
df13 start_date3 2010-06-10
      end_date3 2010-06-15
df25 start_date3 2010-06-15
df14 start_date3 2010-06-20
df25end_date3 2010-06-25
df14end_date3 2010-06-30
    5 start_date3 2010-07-10
      end_date3 2010-07-20

In[14]: seq_df = ordered [[["ID"]].reset_index()[[["level_2", "ID"]].pipe(
       :     lambdadf:
       :     df.assign(
       :         prev_level_2 = df.groupby("ID").shift()
       :     )
       : )
       : seq_df
Out [14]:
       level_2 ID prev_level_2
0 start_date1NaN
1 start_date 1 start_date
2 end_date 1 start_date
3 end_date 1 end_date
4 start_date 1 end_date
5 end_date 1 start_date
6 start_date1 end_date
7 end_date 1 start_date
8 start_date2NaN
9 end_date 2 start_date
10 start_date 2 end_date
11 end_date 2 start_date
12 start_date 2 end_date
13 end_date 2 start_date
14 start_date3NaN
15 end_date 3 start_date
16 start_date 3 end_date
17 end_date 3 start_date
18 start_date 3 end_date
19 start_date 3 start_date
20 end_date 3 start_date
21 end_date 3 end_date
22 start_date 3 end_date
23 end_date 3 start_date

In[15]:overlap_df=seq_df.pipe(
       :     lambdadf:
       :     df[df["level_2"] == df["prev_level_2"]]
       : )
       : overlap_df
Out [15]:
       level_2 ID prev_level_2
1 start_date 1 start_date
3 end_date 1 end_date
19 start_date 3 start_date
21 end_date 3 end_date

In[16]:overlap_df["ID"].unique()
Out [16]—array ([1,3])


2022-09-29 21:47

review:
After that, I looked at the article about acceleration and ignored the prerequisites of not writing multiple for sentences or similar "not writing loops" and made it possible to speed up as a kind of mental exercise.

That's how I think.

·Assume that ID has been sorted as the first key and start_date has been sorted as the second key. ·There should be no duplication within one DataFrame
·Use Python basic list instead of Pandas for main processing
·Check the ID and start_date, end_date ranges and do not rotate duplicate detection loops to unnecessary ranges

Isn't it acceptable to sort the data before the check (by csv, etc.) in advance?
You can do it by calling the following two lines, but it takes time and memory.
Profiler cost about 4.5ms, 0.406MiB.

df1=df1.sort_values('ID','start_date')
df2 = df2.sort_values (['ID', 'start_date'])

The following is the duplicate detection process.All of the following took just under 1ms, 0.078MiB.

#pandasDataFrame to Python Basic List
#
df1 limit = len(df1)
df2limit=len(df2)
df1ID = df1['ID'].tolist()
df1start=df1['start_date'].tolist()
df1end=df1['end_date'].tolist()
df2ID = df2['ID'].tolist()
df2start=df2['start_date'].tolist()
df2end=df2['end_date'].tolist()

# Main Duplicate Detection Processing
#
result_list = [ ]
df1 index = 0
df2 index = 0
while df1 index <df1 limit:
  CurrentID = df1ID [ df1 index ]

  # Adjust the index of the list so that df1 and df2 have the same 'ID' at the start of the comparison.
  #
  df2IDFound = False
  while df2 index <df2 limit:
    if CurrentID>df2ID [df2index]: #df2 has a smaller 'ID' and should be skipped
      df2index+=1
    else:
      if CurrentID == df2ID [ df2index ]:
        df2IDFound=True
      break
  else:
    Break#df2 is over, so the detection process is over.

  if not df2 IDFounded: # If df2 does not have the same 'ID' as df1, skip 'ID' of df1
    while df1 index <df1 limit and CurrentID == df1ID [df1 index ]:
      df1 index+=1
    continue

  # DUPLICATE DETECTION PROCESSING BY DATE COMPARISON
  #
  Save the same 'ID' leading index for df2compareTop=df2index#df2
  DuplicateFound=False#Clear duplicate detection flag

  while df1 index <df1 limit and currentID == df1ID [df1 index]—Loop while 'ID' of #df1 is the same
    StartDate1 = df1 start [df1index]# Get df1 comparison date from list (reduce time even a little)
    EndDate1 = df1 end [ df1 index ] #〃
    df2index=df2compareTop#Restore the same 'ID' leading index for df2

    while df2index<df2limit and CurrentID==df2ID [df2index]—Loop while 'ID' of #df2 is the same
      if df2start[df2index]<=EndDate1 and df2end[df2index]>=StartDate1:#Duplicate Detection Comparison Process
        result_list.append(CurrentID)# Duplicate detection added to list
        DuplicateFound=True#Duplicate detection flag set to end loop
        break
      elif df2start [df2index] > EndDate1:# Loop terminated as no more duplicates
        break
      else:
        To the next data with the same 'ID' with df2 index+=1#df2
    #
    # Bottom of Duplicate Detection Processing Loop with df2

    df1 index+=1
    if DuplicateFound: # Duplicate detected, so loop terminated to next 'ID'
      break
  #
  # Duplicate Detection Processing Loop Bottom with Same 'ID' on df1

  while df1 index <df1 limit and currentID == df1ID [df1 index ] — Skip #df1 to the next 'ID'
    df1 index+=1
  while df2index<df2limit and CurrentID == df2ID [df2index] — Skip #df2 to the next 'ID'
    df2index+=1

# Bottom of all duplicate detection processing loops

print(result_list)# Display duplicate 'ID' list

I read the comments, but I don't have to wait for an answer.
On the contrary, it may be inefficient, and the nesting of for may be deep, but I came up with both.

Incidentally, the end date of [3, "2010-07-10", "2010-06-20"] of the data1 in the questionnaire source code is probably a mistake in writing "2010-07-20".
Also, the variable name of columns1=[ID", "start_date", "end_date"] must be written incorrectly as columns without numbers.

How to
Consider Duplicate in One DataFrame
Reference Articles
How to use pandas Timestamp and date_range
Data concatenation/coupling process as seen in Python pandas diagram
pandas.DataFrame,Series duplicate rows extracted and deleted

It's an aggressive method that doesn't seem very sophisticated.
However, if necessary, we can obtain data for all overlapping periods.

  • Mix both DataFrames
  • Group by 'ID' and process the following per ID
  • Deploy and concatenate one day to DataFrame during the period
  • Check for duplication and get duplicate count in one line using the Pandas feature
  • If there are duplicates, the process of adding IDs to the results list continues
  • If there is no duplication, take advantage of the exception and no additional ID processing, ignore the exception

The following actions will be taken:

df3=pd.concat([df1,df2], ignore_index=True)

result_list = [ ]
for ID, grp indf3.groupby('ID'):
  arr = [ ]
  for index, item in grp.iterrows():
    arr.append(pd.DataFrame(pd.date_range(start=item['start_date'], end=item['end_date'])))

  arr=pd.concat(arr,ignore_index=True)
  try:
    dup=arr.duplicated().value_counts() [True]
    result_list.append(ID)
  except:
    pass

print(result_list)

"Duplicate Between Two DataFrames" Method
Reference Articles
Duplicate Date Duration
Determine if the two periods overlap.
Break from multiple loops (nested for loops) in Python

The nesting of for becomes deeper, but the ID comparison for each case is omitted, so it may be a little more efficient.

  • Group DataFrame1 with 'ID' and process the following per ID
  • Extracting data with the same ID from DataFrame 2
  • Each DataFrame1 data is an intermediate loop, DataFrame2 data extracted with the same ID is an inner loop, and duplicate checking by reference method
  • If there is a duplicate, add an ID to the result list, break the intermediate loop, and proceed with the next ID.

The following is true:

result_list=[]
for ID, grp1indf1.groupby('ID'):
  grp2 = df2 [df2['ID'] == ID]
  for index1, item1 in grp1.iterrows():
    start1 = item1 ['start_date']
    end1 = item1 ['end_date']
    for index2, item2 in grp2.iterrows():
      if item2['start_date']<=end1 and start1<=item2['end_date']:
        result_list.append(ID)
        break
    else:
      continue
    break

print(result_list)

@ When I looked at Yuki Inoue's opening and searched for "pandas for slow", it seems that there are many things, and that's the principle.
I'll keep it as a memo for your reference.
Loose and fluffy for loop in pandas.DataFrame△improve to 300x faster
Wow...my pandas is too late...? and things to do sometimes (from the wisdom of my predecessor)
Tips collection
to speed up the preprocessing of 10 million pieces of data with pandas Notes on quickly adjusting row-by-row values by looking at multiple columns in pandas
pandas This kite pot/Accelerate the application of Pandas' DataFrame
python – Pandas: Late date conversion /How to speed up read_csv on slow pandas (disk)
Pandas Acceleration Technique/Pandas how to handle the for loop
Summary of Python Acceleration Ver.1/Developers often make mistakes when using Python for big data analysis
Talks that may be 500 times slower if you do pandas.to_datetime() for data that does not conform to ISO and what to do with it

addition:&correction
Actually, the processing of questions was the fastest.
This is just a condition of the number of data and the content of the question.A hundred thousand cases would be totally different.
Correct the number because the measurement method was wrong

US>Time required
Question: Around 4.7 ms, @kunif-1:20-21 ms, @kunif-2:8 ms, @Yuki Inoue: 17.7 ms.

Measure the processing part after creating df1,df2 in to measure the execution performance of the Python program.

Increased memory usage
Memory Usage Question: 0.110 MiB, @kunif-1: 0.773 MiB, @kunif-2: 0.370 to 0.430 MiB, @Yuki Inoue: 0.797 to 0.812 MiB

Using the memory_profiler of "Check Memory Usage" in the same article, measure the processing part after creating df1, df2 as well


2022-09-29 21:47

I went there relatively without loop.

Points to Note?

  • The ID list should not have to be looped

  • According to
  • pandas.Interval, where does closed='right' enter, for example (1,3),(3,7) enter?(Sometimes it can be a problem)

  • The result is not a list (for now) but only a display (True means wearing it)

ID list should not be looped

According to pandas.Interval, where does closed='right' enter, for example (1,3),(3,7) enter?(Sometimes it can be a problem)

The result is not a list (for now) but only a display (True means wearing it)

import pandas as pd
data1 = [[1, "2010-01-01", "2010-01-20", "1, "2010-03-20", "2010-03-30",
        [2, "2010-02-01", "2010-04-20"], [3, "2010-06-10", "2010-06-15"],
        [3, "2010-06-20", "2010-06-30"], [3, "2010-07-10", "2010-07-20"]]

data2 = [[1, "2010-01-15", "2010-01-30", "1, "2010-04-20", "2010-04-30",
        [2, "2010-05-01", "2010-05-10"], [2, "2010-05-20", "2010-05-25"],
        [3, "2010-02-01", "2010-02-15"], [3, "2010-06-15", "2010-06-25"]]

columns1 = ["ID", "start_date", "end_date" ]

df1=pd.DataFrame(data=data1,columns=columns1)
df2=pd.DataFrame(data=data2,columns=columns1)

for df in (df1, df2):
    df['interval'] = df.astype({'start_date':'M8', 'end_date':'M8'}.apply(
        lambdav: pd.Interval(v.start_date, v.end_date), axis=1)

for num in pd.unique (pd.concat([df1,df2]).ID):
    iv1,iv2 = [df.loc [df.ID==num, 'interval'] for df in (df1, df2)]
    chk = iv2.map (pd.arrays.IntervalArray(iv1).overlaps)
    print(num,chk.map(any).any())

# 1 True
# 2 False
# 3 True

Instead of adding items to df1, df2, how do I prepare a series?
Processing steps are almost the same as before

(If the ID exists on both sides as well as on one side, iv2.index==iv.name is O only on iv.nameK)

iv1,iv2 = [df.astype({'start_date':'M8', 'end_date':'M8'})
              .apply(lambdav: [v.ID, pd.Interval(v.start_date, v.end_date),axis=1,result_type='expand')
              .set_index(0)[1]#0,1 are IDs, pd.Interval, respectively
                    for df in (df1, df2)]

res=iv1.groupby(level=0).apply(lambdav:
    iv2 [iv2.index==iv.name].map (pd.arrays.IntervalArray(iv).overlaps)
                             .map(any).any())
res.index[res].to_list()
# [1, 3]


2022-09-29 21:47

Compare by ID by round.

Note:
[Python] Evaluate all combinations from two arrays

import numpy as np

data1 = [[1, "2010-01-01", "2010-01-20", "1, "2010-03-20", "2010-03-30",
        [2, "2010-02-01", "2010-04-20"], [3, "2010-06-10", "2010-06-15"],
        [3, "2010-06-20", "2010-06-30"], [3, "2010-07-10", "2010-06-20"]]

data2 = [[1, "2010-01-15", "2010-01-30", "1, "2010-04-20", "2010-04-30",
        [2, "2010-05-01", "2010-05-10"], [2, "2010-05-20", "2010-05-25"],
        [3, "2010-02-01", "2010-02-15"], [3, "2010-06-15", "2010-06-25"]]

arr1 = np.array (data1)
arr2 = np.array (data2)

result_list = [ ]

for id innp.unique (arr1[:,0]):
    _arr2=arr2[arr2[:,0]==id]
    iflen(_arr2) == 0:
        continue
    _arr1=arr1[arr1[:,0]==id]
    xx=np.array([x for x in_arr1] for_in range(len(_arr2)]).reshape(-1,3)
    yy=np.array([[y for_in range(len(_arr1)))] fory in_arr2]).reshape(-1,3)
    if np.any(xx[:,1]<=yy[:,2])&(yy[:,1]<=xx[:,2]):
        result_list.append(id)


2022-09-29 21:47

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.