performance tuning in python,pandas

This is my 4th year as an engineer (I haven't experienced python).
I had to develop a script using python in my business, and I looked it up myself and wrote the following script, but the performance was so bad that I wanted to tune it, but I just started studying and it's hard to decide on a tuning policy...
As the delivery date is close, we have discussed it with you.
I would appreciate it if someone could let me know about tuning, even if it's just a small matter.
Especially, I want to tune the double for statement part of the script.

import pandas as pd
import numpy as np

from pyspark import SparkConf, SparkContext

from pyspark.sqlimport SparkSession
from pyspark.shell import SQLContext
import time

spark=SparkSession.builder.master("yarn").config(conf=SparkConf()) .getOrCreate()
sc=SparkContext.getOrCreate()
sqlContext=SQLContext(sc)

spark.conf.set("park.sql.execution.arrow.enabled", "true")


spark=SparkSession.builder\
        .appName('Spark SQL and DataFrame')\
        .getOrCreate()

# File Storage Path for S3
filepass_aaa='s3://******/****/test/aaa.csv'
filepass_bb='s3://******/****/test/bbb.csv'

# Read CSV file from S3 to generate Spark DataFreame
sdf_aaa=sqlContext.read.format("com.databricks.park.csv").option("header", "true").load(filepass_aaa)
sdf_bb=sqlContext.read.format("com.databricks.park.csv").option("header", "true").load(filepass_bb)

# Change to DataFreame type in pandas
pdf_aaa=sdf_aaa.toPandas()
pdf_bbb=sdf_bbb.toPandas()

Replace the corresponding value of aaa with the value of bbb if the conditions are met.
def mapping (row, row2):
    pdf_aaa.at [(pdf_aaa['ID']==row[0]), '○○ID']=row2[2]
    pdf_aaa.at [(pdf_aaa['ID'] == row[0], '××'] = row2[1]
    pdf_aaa.at [(pdf_aaa['ID']==row[0]), '△△△']=row2[3]
    pdf_aaa.at [(pdf_aaa['ID']==row[0], '□□□']=row2[0]
    pdf_aaa.at [(pdf_aaa['ID']==row[0]), '■■■']=row2[4]
    pdf_aaa.at [(pdf_aaa['ID']==row[0]), '◇◇◇']=row2[5]


for index, row in pdf_aaa.iterrows():

    for index2, row2inpdf_bbb.iterrows():

        # Conditions: Name + Address + TEL
        if(row[7]==row2[6])&(row[9]==row2[8])&(row[8]==row2[7]):
            mapping(row,row2)
            break
        # Conditions: Name + TEL
        elif(row[7]==row2[6])&(row[8]==row2[7]):
            mapping(row,row2)
            break
        # Conditions: Name + Address
        elif(row[7]==row2[6])&(row[9]==row2[8]):
            mapping(row,row2)
            break
        # Conditions: TEL+ Address
        elif(row[8]==row2[7])&(row[9]==row2[8]):
            mapping(row,row2)
            break
        # Conditions: Address + URL
        elif(row[9]==row2[8])&(row[10]==row2[9]):
            mapping(row,row2)
            break
        # Conditions: TEL+URL
        elif(row[8]==row2[7])&(row[10]==row2[9]):
            mapping(row,row2)
            break
        # Conditions: TEL
        elif(row[8]==row2[7]):
            mapping(row,row2)
            break
        else:
            continue
            break

pdf_out=pdf_aaa [~pdf_aaa['○○ID'].isnull()]
pdf_out_null = pdf_aaa [pdf_aaa['○○ID'].isnull()]

pdf_out.to_csv('result.csv', header=True, index=False)
pdf_out_null.to_csv('result_null.csv', header=True, index=False)

】What I want to realize したい
Compare DataFrame aaa data with DataFrame bbb data and replace specific aaa data with bbb data only if certain conditions are met
There are about 1 million data for both aaa and bbb

python3 pandas spark

2022-09-30 19:51

1 Answers

In order

(1)
First of all, with this kind of question, you don't need to retrieve the data that occupies the first half of the code at all, so you can omit it.
However, be sure to indicate what kind of data each DataFrame has instead.If possible, dummy is fine, so I would appreciate it if you could show me the data that can actually operate the code.
This time, for now,

[pdf_aaa configuration]

 No | Column Name | Contents
0 | ID | ID Number
1 | □□□ | Some kind of information
2 | x x | some information
3 | ○○ ID | some information
4 | △△△ | Some kind of information
5 | ■■■ | Some kind of information
6 | ◇◇◇ | Some kind of information
7 | Name | Name
8 | Tel | Phone Number
9 | Addr | Address
10 | URL | URL

[pdf_bbb]

 No | Column Name | Contents
Overwrite information for 0 | □□□ | □□□
1|××|××××××××××××××××××××××
2 | ○○ ID | ○○ ID Overwrite Information
3 | △△△ | △△△ Overwrite Information
4 | ■■■ | ■■■ Overwrite Information
5 | ◇◇◇ | ◇◇◇ Overwrite Information
6 | Name | Name
7 | Tel | Phone Number
8 | Addr | Address
9 | URL | URL

Assume and equal times

(2)
Current Code

elif(row[8]==row2[7]):

It seems that there are many places where the column number is written as shown in , but if possible, set the appropriate column name and

elif(row['Tel']==row2['Tel']):

I think it would be easier to understand what is being done and serviceability would be better if you write it like this.

Below is the column name described in (1).

(3)
mapping() in the function

def mapping(row,row2):
    pdf_aaa.at [(pdf_aaa['ID']==row[0]), '○○ID']=row2[2]
    ...

As shown in , I have re-selected the line from pdf_aaa in the ID column, but the Index value of pdf_aaa has already been obtained in the loop, so

def mapping (idx, row2):
    pdf_aaa.at [idx, '○○ID'] = row2[2]
    ...

I think it would be good to

At this point, the caller is mapping(index,row2).

Also, I will update 6 places together

def mapping (idx, row2):
    pdf_aaa.loc [idx, '□□□': '◇◇◇'] = row2 ['□□□': '◇◇◇'].values

I think you can also write it like this.

(4)
Loops are very slow in Pandas, so it is safe to avoid them if possible.
For now, we will consider how to eliminate the inner loop.There is no point in comparing lines one by one using Pandas Filtering, so

for index, row in pdf_aaa.iterrows():

    # Conditions: Name + Address + TEL
    # Filter pdf_bbb under the above conditions
    row2 = pdf_bb.loc [(pdf_bbb['Name'] == row['Name'])&
                       (pdf_bbb['Addr'] == row['Addr']) &
                       (pdf_bbb['Tel'] == row['Tel']]
    iflen(row2)>0:
        # pass only the first line considering multiple lines are selected
        mapping (index, row2.iloc[0])
        break

I think I can write like this

I think you can eventually take out the outer loop in Pandas.merge(), but the configuration is completely different, so that's it for now.

2022-09-30 19:51

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656