This is my 4th year as an engineer (I haven't experienced python).
I had to develop a script using python in my business, and I looked it up myself and wrote the following script, but the performance was so bad that I wanted to tune it, but I just started studying and it's hard to decide on a tuning policy...
As the delivery date is close, we have discussed it with you.
I would appreciate it if someone could let me know about tuning, even if it's just a small matter.
Especially, I want to tune the double for statement part of the script.
import pandas as pd
import numpy as np
from pyspark import SparkConf, SparkContext
from pyspark.sqlimport SparkSession
from pyspark.shell import SQLContext
import time
spark=SparkSession.builder.master("yarn").config(conf=SparkConf()) .getOrCreate()
sc=SparkContext.getOrCreate()
sqlContext=SQLContext(sc)
spark.conf.set("park.sql.execution.arrow.enabled", "true")
spark=SparkSession.builder\
.appName('Spark SQL and DataFrame')\
.getOrCreate()
# File Storage Path for S3
filepass_aaa='s3://******/****/test/aaa.csv'
filepass_bb='s3://******/****/test/bbb.csv'
# Read CSV file from S3 to generate Spark DataFreame
sdf_aaa=sqlContext.read.format("com.databricks.park.csv").option("header", "true").load(filepass_aaa)
sdf_bb=sqlContext.read.format("com.databricks.park.csv").option("header", "true").load(filepass_bb)
# Change to DataFreame type in pandas
pdf_aaa=sdf_aaa.toPandas()
pdf_bbb=sdf_bbb.toPandas()
Replace the corresponding value of aaa with the value of bbb if the conditions are met.
def mapping (row, row2):
pdf_aaa.at [(pdf_aaa['ID']==row[0]), '○○ID']=row2[2]
pdf_aaa.at [(pdf_aaa['ID'] == row[0], '××'] = row2[1]
pdf_aaa.at [(pdf_aaa['ID']==row[0]), '△△△']=row2[3]
pdf_aaa.at [(pdf_aaa['ID']==row[0], '□□□']=row2[0]
pdf_aaa.at [(pdf_aaa['ID']==row[0]), '■■■']=row2[4]
pdf_aaa.at [(pdf_aaa['ID']==row[0]), '◇◇◇']=row2[5]
for index, row in pdf_aaa.iterrows():
for index2, row2inpdf_bbb.iterrows():
# Conditions: Name + Address + TEL
if(row[7]==row2[6])&(row[9]==row2[8])&(row[8]==row2[7]):
mapping(row,row2)
break
# Conditions: Name + TEL
elif(row[7]==row2[6])&(row[8]==row2[7]):
mapping(row,row2)
break
# Conditions: Name + Address
elif(row[7]==row2[6])&(row[9]==row2[8]):
mapping(row,row2)
break
# Conditions: TEL+ Address
elif(row[8]==row2[7])&(row[9]==row2[8]):
mapping(row,row2)
break
# Conditions: Address + URL
elif(row[9]==row2[8])&(row[10]==row2[9]):
mapping(row,row2)
break
# Conditions: TEL+URL
elif(row[8]==row2[7])&(row[10]==row2[9]):
mapping(row,row2)
break
# Conditions: TEL
elif(row[8]==row2[7]):
mapping(row,row2)
break
else:
continue
break
pdf_out=pdf_aaa [~pdf_aaa['○○ID'].isnull()]
pdf_out_null = pdf_aaa [pdf_aaa['○○ID'].isnull()]
pdf_out.to_csv('result.csv', header=True, index=False)
pdf_out_null.to_csv('result_null.csv', header=True, index=False)
】What I want to realize したい
Compare DataFrame aaa data with DataFrame bbb data and replace specific aaa data with bbb data only if certain conditions are met
There are about 1 million data for both aaa and bbb
In order
(1)
First of all, with this kind of question, you don't need to retrieve the data that occupies the first half of the code at all, so you can omit it.
However, be sure to indicate what kind of data each DataFrame has instead.If possible, dummy is fine, so I would appreciate it if you could show me the data that can actually operate the code.
This time, for now,
[pdf_aaa configuration]
No | Column Name | Contents
0 | ID | ID Number
1 | □□□ | Some kind of information
2 | x x | some information
3 | ○○ ID | some information
4 | △△△ | Some kind of information
5 | ■■■ | Some kind of information
6 | ◇◇◇ | Some kind of information
7 | Name | Name
8 | Tel | Phone Number
9 | Addr | Address
10 | URL | URL
[pdf_bbb]
No | Column Name | Contents
Overwrite information for 0 | □□□ | □□□
1|××|××××××××××××××××××××××
2 | ○○ ID | ○○ ID Overwrite Information
3 | △△△ | △△△ Overwrite Information
4 | ■■■ | ■■■ Overwrite Information
5 | ◇◇◇ | ◇◇◇ Overwrite Information
6 | Name | Name
7 | Tel | Phone Number
8 | Addr | Address
9 | URL | URL
Assume and equal times
(2)
Current Code
elif(row[8]==row2[7]):
It seems that there are many places where the column number is written as shown in , but if possible, set the appropriate column name and
elif(row['Tel']==row2['Tel']):
I think it would be easier to understand what is being done and serviceability would be better if you write it like this.
Below is the column name described in (1).
(3)
mapping()
in the function
def mapping(row,row2):
pdf_aaa.at [(pdf_aaa['ID']==row[0]), '○○ID']=row2[2]
...
As shown in , I have re-selected the line from pdf_aaa in the ID column, but the Index value of pdf_aaa has already been obtained in the loop, so
def mapping (idx, row2):
pdf_aaa.at [idx, '○○ID'] = row2[2]
...
I think it would be good to
At this point, the caller is mapping(index,row2)
.
Also, I will update 6 places together
def mapping (idx, row2):
pdf_aaa.loc [idx, '□□□': '◇◇◇'] = row2 ['□□□': '◇◇◇'].values
I think you can also write it like this.
(4)
Loops are very slow in Pandas, so it is safe to avoid them if possible.
For now, we will consider how to eliminate the inner loop.There is no point in comparing lines one by one using Pandas Filtering, so
for index, row in pdf_aaa.iterrows():
# Conditions: Name + Address + TEL
# Filter pdf_bbb under the above conditions
row2 = pdf_bb.loc [(pdf_bbb['Name'] == row['Name'])&
(pdf_bbb['Addr'] == row['Addr']) &
(pdf_bbb['Tel'] == row['Tel']]
iflen(row2)>0:
# pass only the first line considering multiple lines are selected
mapping (index, row2.iloc[0])
break
I think I can write like this
I think you can eventually take out the outer loop in Pandas.merge(), but the configuration is completely different, so that's it for now.
916 When building Fast API+Uvicorn environment with PyInstaller, console=False results in an error
613 GDB gets version error when attempting to debug with the Presense SDK (IDE)
573 rails db:create error: Could not find mysql2-0.5.4 in any of the sources
© 2024 OneMinuteCode. All rights reserved.