I use python3 and pandas.
I'd like to compare the dates of people with the same ID among the two Dataframes, df1, df2, and take out the IDs of the rows with overlapping periods.
There are actually about 100,000 lines.
data1=[1, "2010-01-01", "2010-01-20", "1, "2010-03-20", "2010-03-30",
[2, "2010-02-01", "2010-04-20"], [3, "2010-06-10", "2010-06-15"],
[3, "2010-06-20", "2010-06-30"], [3, "2010-07-10", "2010-06-20"]]
data2 = [[1, "2010-01-15", "2010-01-30", "1, "2010-04-20", "2010-04-30",
[2, "2010-05-01", "2010-05-10"], [2, "2010-05-20", "2010-05-25"],
[3, "2010-02-01", "2010-02-15"], [3, "2010-06-15", "2010-06-25"]]
columns1 = ["ID", "start_date", "end_date" ]
df1=pd.DataFrame(data=data1,columns=columns)
df2=pd.DataFrame(data=data2,columns=columns)
I would like to compare this by ID and get a list like the one below.
result_list=[1,3]
result_list = [ ]
for index, row1 indf1.iterrows():
if not row1 ["ID"] in result_list:
for index, row2indf2.iterrows():
if row1["ID"]==row2["ID"]and row2["start_date"]<=row1["end_date"]and row2["end_date"]>=row1["start_date"]:
result_list.append(row1["ID"])
break
result_list=list(set(result_list))
result_list
I was able to get the result I wanted through trial and error
I can't help but feel like it's inefficient… Is there a way to avoid multiple for statements?
Thank you for your advice.
I don't want to use For statements in pandas, so if you just want to duplicate data frame processing, you'll see the following:
In[1]:import pandas aspd
In[2]: data1 = [[1, "2010-01-01", "2010-01-20", "1, "2010-03-20", "2010-03-30",
: [2, "2010-02-01", "2010-04-20"], [3, "2010-06-10", "2010-06-15"],
: [3, "2010-06-20", "2010-06-30"], [3, "2010-07-10", "2010-07-20"]]
:
: data2 = [[1, "2010-01-15", "2010-01-30", "1, "2010-04-20", "2010-04-30",
: [2, "2010-05-01", "2010-05-10"], [2, "2010-05-20", "2010-05-25"],
: [3, "2010-02-01", "2010-02-15"], [3, "2010-06-15", "2010-06-25"]]
:
: columns=["ID", "start_date", "end_date" ]
:
: df1=pd.DataFrame(data=data1,columns=columns)
: df2=pd.DataFrame(data=data2,columns=columns)
In[3]: df=pd.concat([df1,df2], keys=["df1", "df2"])
: df
Out [3]:
ID start_date end_date
df10 1 2010-01-01 2010-01-20
1 1 2010-03-20 2010-03-30
2 2 2010-02-01 2010-04-20
3 3 2010-06-10 2010-06-15
4 3 2010-06-20 2010-06-30
5 3 2010-07-10 2010-07-20
df20 1 2010-01-15 2010-01-30
1 1 2010-04-20 2010-04-30
2 2 2010-05-01 2010-05-10
3 2 2010-05-20 2010-05-25
4 3 2010-02-01 2010-02-15
5 3 2010-06-15 2010-06-25
In[4]: stacked=df.set_index("ID", append=True).stack().to_frame()
: stacked
Out [4]:
0
ID
df10 1 start_date 2010-01-01
end_date 2010-01-20
11 start_date 2010-03-20
end_date 2010-03-30
22 start_date 2010-02-01
end_date 2010-04-20
33 start_date 2010-06-10
end_date 2010-06-15
43 start_date 2010-06-20
end_date 2010-06-30
53 start_date 2010-07-10
end_date 2010-07-20
df20 1 start_date 2010-01-15
end_date 2010-01-30
11 start_date 2010-04-20
end_date 2010-04-30
22 start_date 2010-05-01
end_date 2010-05-10
32 start_date 2010-05-20
end_date 2010-05-25
43 start_date 2010-02-01
end_date 2010-02-15
53 start_date 2010-06-15
end_date 2010-06-25
In[5]: ordered=stacked.reset_index(level=2).sort_values('ID',0])
: ordered
Out [5]:
ID 0
df10 start_date1 2010-01-01
df20 start_date1 2010-01-15
df10 end_date1 2010-01-20
df20 end_date1 2010-01-30
df11 start_date1 2010-03-20
end_date1 2010-03-30
df21 start_date1 2010-04-20
end_date1 2010-04-30
df12start_date2 2010-02-01
end_date2 2010-04-20
df22start_date2 2010-05-01
end_date2 2010-05-10
3 start_date2 2010-05-20
end_date2 2010-05-25
4 start_date3 2010-02-01
end_date3 2010-02-15
df13 start_date3 2010-06-10
end_date3 2010-06-15
df25 start_date3 2010-06-15
df14 start_date3 2010-06-20
df25end_date3 2010-06-25
df14end_date3 2010-06-30
5 start_date3 2010-07-10
end_date3 2010-07-20
In[14]: seq_df = ordered [[["ID"]].reset_index()[[["level_2", "ID"]].pipe(
: lambdadf:
: df.assign(
: prev_level_2 = df.groupby("ID").shift()
: )
: )
: seq_df
Out [14]:
level_2 ID prev_level_2
0 start_date1NaN
1 start_date 1 start_date
2 end_date 1 start_date
3 end_date 1 end_date
4 start_date 1 end_date
5 end_date 1 start_date
6 start_date1 end_date
7 end_date 1 start_date
8 start_date2NaN
9 end_date 2 start_date
10 start_date 2 end_date
11 end_date 2 start_date
12 start_date 2 end_date
13 end_date 2 start_date
14 start_date3NaN
15 end_date 3 start_date
16 start_date 3 end_date
17 end_date 3 start_date
18 start_date 3 end_date
19 start_date 3 start_date
20 end_date 3 start_date
21 end_date 3 end_date
22 start_date 3 end_date
23 end_date 3 start_date
In[15]:overlap_df=seq_df.pipe(
: lambdadf:
: df[df["level_2"] == df["prev_level_2"]]
: )
: overlap_df
Out [15]:
level_2 ID prev_level_2
1 start_date 1 start_date
3 end_date 1 end_date
19 start_date 3 start_date
21 end_date 3 end_date
In[16]:overlap_df["ID"].unique()
Out [16]—array ([1,3])
review:
After that, I looked at the article about acceleration and ignored the prerequisites of not writing multiple for sentences or similar "not writing loops" and made it possible to speed up as a kind of mental exercise.
That's how I think.
·Assume that ID
has been sorted as the first key and start_date
has been sorted as the second key.
·There should be no duplication within one DataFrame
·Use Python basic list instead of Pandas for main processing
·Check the ID
and start_date
, end_date
ranges and do not rotate duplicate detection loops to unnecessary ranges
Isn't it acceptable to sort the data before the check (by csv, etc.) in advance?
You can do it by calling the following two lines, but it takes time and memory.
Profiler cost about 4.5ms, 0.406MiB.
df1=df1.sort_values('ID','start_date')
df2 = df2.sort_values (['ID', 'start_date'])
The following is the duplicate detection process.All of the following took just under 1ms, 0.078MiB.
#pandasDataFrame to Python Basic List
#
df1 limit = len(df1)
df2limit=len(df2)
df1ID = df1['ID'].tolist()
df1start=df1['start_date'].tolist()
df1end=df1['end_date'].tolist()
df2ID = df2['ID'].tolist()
df2start=df2['start_date'].tolist()
df2end=df2['end_date'].tolist()
# Main Duplicate Detection Processing
#
result_list = [ ]
df1 index = 0
df2 index = 0
while df1 index <df1 limit:
CurrentID = df1ID [ df1 index ]
# Adjust the index of the list so that df1 and df2 have the same 'ID' at the start of the comparison.
#
df2IDFound = False
while df2 index <df2 limit:
if CurrentID>df2ID [df2index]: #df2 has a smaller 'ID' and should be skipped
df2index+=1
else:
if CurrentID == df2ID [ df2index ]:
df2IDFound=True
break
else:
Break#df2 is over, so the detection process is over.
if not df2 IDFounded: # If df2 does not have the same 'ID' as df1, skip 'ID' of df1
while df1 index <df1 limit and CurrentID == df1ID [df1 index ]:
df1 index+=1
continue
# DUPLICATE DETECTION PROCESSING BY DATE COMPARISON
#
Save the same 'ID' leading index for df2compareTop=df2index#df2
DuplicateFound=False#Clear duplicate detection flag
while df1 index <df1 limit and currentID == df1ID [df1 index]—Loop while 'ID' of #df1 is the same
StartDate1 = df1 start [df1index]# Get df1 comparison date from list (reduce time even a little)
EndDate1 = df1 end [ df1 index ] #〃
df2index=df2compareTop#Restore the same 'ID' leading index for df2
while df2index<df2limit and CurrentID==df2ID [df2index]—Loop while 'ID' of #df2 is the same
if df2start[df2index]<=EndDate1 and df2end[df2index]>=StartDate1:#Duplicate Detection Comparison Process
result_list.append(CurrentID)# Duplicate detection added to list
DuplicateFound=True#Duplicate detection flag set to end loop
break
elif df2start [df2index] > EndDate1:# Loop terminated as no more duplicates
break
else:
To the next data with the same 'ID' with df2 index+=1#df2
#
# Bottom of Duplicate Detection Processing Loop with df2
df1 index+=1
if DuplicateFound: # Duplicate detected, so loop terminated to next 'ID'
break
#
# Duplicate Detection Processing Loop Bottom with Same 'ID' on df1
while df1 index <df1 limit and currentID == df1ID [df1 index ] — Skip #df1 to the next 'ID'
df1 index+=1
while df2index<df2limit and CurrentID == df2ID [df2index] — Skip #df2 to the next 'ID'
df2index+=1
# Bottom of all duplicate detection processing loops
print(result_list)# Display duplicate 'ID' list
I read the comments, but I don't have to wait for an answer.
On the contrary, it may be inefficient, and the nesting of for may be deep, but I came up with both.
Incidentally, the end date of [3, "2010-07-10", "2010-06-20"]
of the data1
in the questionnaire source code is probably a mistake in writing "2010-07-20"
.
Also, the variable name of columns1=[ID", "start_date", "end_date"]
must be written incorrectly as columns
without numbers.
How to
Consider Duplicate in One DataFrame
Reference Articles
How to use pandas Timestamp and date_range
Data concatenation/coupling process as seen in Python pandas diagram
pandas.DataFrame,Series duplicate rows extracted and deleted
It's an aggressive method that doesn't seem very sophisticated.
However, if necessary, we can obtain data for all overlapping periods.
The following actions will be taken:
df3=pd.concat([df1,df2], ignore_index=True)
result_list = [ ]
for ID, grp indf3.groupby('ID'):
arr = [ ]
for index, item in grp.iterrows():
arr.append(pd.DataFrame(pd.date_range(start=item['start_date'], end=item['end_date'])))
arr=pd.concat(arr,ignore_index=True)
try:
dup=arr.duplicated().value_counts() [True]
result_list.append(ID)
except:
pass
print(result_list)
"Duplicate Between Two DataFrames" Method
Reference Articles
Duplicate Date Duration
Determine if the two periods overlap.
Break from multiple loops (nested for loops) in Python
The nesting of for becomes deeper, but the ID comparison for each case is omitted, so it may be a little more efficient.
The following is true:
result_list=[]
for ID, grp1indf1.groupby('ID'):
grp2 = df2 [df2['ID'] == ID]
for index1, item1 in grp1.iterrows():
start1 = item1 ['start_date']
end1 = item1 ['end_date']
for index2, item2 in grp2.iterrows():
if item2['start_date']<=end1 and start1<=item2['end_date']:
result_list.append(ID)
break
else:
continue
break
print(result_list)
@ When I looked at Yuki Inoue's opening and searched for "pandas for slow", it seems that there are many things, and that's the principle.
I'll keep it as a memo for your reference.
Loose and fluffy for loop in pandas.DataFrame△improve to 300x faster
Wow...my pandas is too late...? and things to do sometimes (from the wisdom of my predecessor)
Tips collection
to speed up the preprocessing of 10 million pieces of data with pandas
Notes on quickly adjusting row-by-row values by looking at multiple columns in pandas
pandas This kite pot/Accelerate the application of Pandas' DataFrame
python – Pandas: Late date conversion /How to speed up read_csv on slow pandas (disk)
Pandas Acceleration Technique/Pandas how to handle the for loop
Summary of Python Acceleration Ver.1/Developers often make mistakes when using Python for big data analysis
Talks that may be 500 times slower if you do pandas.to_datetime() for data that does not conform to ISO and what to do with it
addition:&correction
Actually, the processing of questions was the fastest.
This is just a condition of the number of data and the content of the question.A hundred thousand cases would be totally different.
Correct the number because the measurement method was wrong
US>Time required
Question: Around 4.7 ms, @kunif-1:20-21 ms, @kunif-2:8 ms, @Yuki Inoue: 17.7 ms.
Measure the processing part after creating df1,df2 in to measure the execution performance of the Python program.
Increased memory usage
Memory Usage
Question: 0.110 MiB, @kunif-1: 0.773 MiB, @kunif-2: 0.370 to 0.430 MiB, @Yuki Inoue: 0.797 to 0.812 MiB
Using the memory_profiler of "Check Memory Usage" in the same article, measure the processing part after creating df1, df2 as well
I went there relatively without loop.
Points to Note?
The ID
list should not have to be looped
pandas.Interval, where does closed='right'
enter, for example (1,3),(3,7)
enter?(Sometimes it can be a problem)
The result is not a list (for now) but only a display (True means wearing it)
ID
list should not be looped
According to pandas.Interval, where does closed='right'
enter, for example (1,3),(3,7)
enter?(Sometimes it can be a problem)
The result is not a list (for now) but only a display (True means wearing it)
import pandas as pd
data1 = [[1, "2010-01-01", "2010-01-20", "1, "2010-03-20", "2010-03-30",
[2, "2010-02-01", "2010-04-20"], [3, "2010-06-10", "2010-06-15"],
[3, "2010-06-20", "2010-06-30"], [3, "2010-07-10", "2010-07-20"]]
data2 = [[1, "2010-01-15", "2010-01-30", "1, "2010-04-20", "2010-04-30",
[2, "2010-05-01", "2010-05-10"], [2, "2010-05-20", "2010-05-25"],
[3, "2010-02-01", "2010-02-15"], [3, "2010-06-15", "2010-06-25"]]
columns1 = ["ID", "start_date", "end_date" ]
df1=pd.DataFrame(data=data1,columns=columns1)
df2=pd.DataFrame(data=data2,columns=columns1)
for df in (df1, df2):
df['interval'] = df.astype({'start_date':'M8', 'end_date':'M8'}.apply(
lambdav: pd.Interval(v.start_date, v.end_date), axis=1)
for num in pd.unique (pd.concat([df1,df2]).ID):
iv1,iv2 = [df.loc [df.ID==num, 'interval'] for df in (df1, df2)]
chk = iv2.map (pd.arrays.IntervalArray(iv1).overlaps)
print(num,chk.map(any).any())
# 1 True
# 2 False
# 3 True
Instead of adding items to df1
, df2
, how do I prepare a series?
Processing steps are almost the same as before
(If the ID
exists on both sides as well as on one side, iv2.index==iv.name
is O only on iv.name
K)
iv1,iv2 = [df.astype({'start_date':'M8', 'end_date':'M8'})
.apply(lambdav: [v.ID, pd.Interval(v.start_date, v.end_date),axis=1,result_type='expand')
.set_index(0)[1]#0,1 are IDs, pd.Interval, respectively
for df in (df1, df2)]
res=iv1.groupby(level=0).apply(lambdav:
iv2 [iv2.index==iv.name].map (pd.arrays.IntervalArray(iv).overlaps)
.map(any).any())
res.index[res].to_list()
# [1, 3]
Compare by ID by round.
Note:
[Python] Evaluate all combinations from two arrays
import numpy as np
data1 = [[1, "2010-01-01", "2010-01-20", "1, "2010-03-20", "2010-03-30",
[2, "2010-02-01", "2010-04-20"], [3, "2010-06-10", "2010-06-15"],
[3, "2010-06-20", "2010-06-30"], [3, "2010-07-10", "2010-06-20"]]
data2 = [[1, "2010-01-15", "2010-01-30", "1, "2010-04-20", "2010-04-30",
[2, "2010-05-01", "2010-05-10"], [2, "2010-05-20", "2010-05-25"],
[3, "2010-02-01", "2010-02-15"], [3, "2010-06-15", "2010-06-25"]]
arr1 = np.array (data1)
arr2 = np.array (data2)
result_list = [ ]
for id innp.unique (arr1[:,0]):
_arr2=arr2[arr2[:,0]==id]
iflen(_arr2) == 0:
continue
_arr1=arr1[arr1[:,0]==id]
xx=np.array([x for x in_arr1] for_in range(len(_arr2)]).reshape(-1,3)
yy=np.array([[y for_in range(len(_arr1)))] fory in_arr2]).reshape(-1,3)
if np.any(xx[:,1]<=yy[:,2])&(yy[:,1]<=xx[:,2]):
result_list.append(id)
© 2024 OneMinuteCode. All rights reserved.