Python data scaling questions. Standard Scaler

Asked 2 years ago, Updated 2 years ago, 46 views

I'm greeting you after a long time. (__)

Before fitting the model, I learned that the data size is re-engraved, so it is necessary to adjust the mean and variance by scaling.

Usually, we separate it into train set/test set with train_test_split and then scale it.

The entire data frame is scaled and separated by train_test_split to evaluate acuity_score.

I'm curious because the score has changed.

Is it right to separate the train and test set and scale and then separate the train and test set?

And when you do the fit after scaling,

scaler1 = StandardScaler()
scaler1.fit(X_train)
X_train_scale = scaler1.transform(X_train)
X_test_scale = scaler1.transform(X_test) 

From the above and below

scaler1 = StandardScaler()
X_train_scale = scaler1.fit_transform(X_train)
X_test_scale = scaler1.fit_transform(X_test) 

Is there a difference???

python python3

2022-09-20 08:54

2 Answers

Reason) If the test data is used for the fit of the scaler, it corresponds to the data leakage.

However, there is a separate test data like Kaggle, and if the train/validation data is divided in the process of CV with train data, the entire train data can be scaled to raise the score as much as possible

(Assuming that train data is already known and test data is unknown, you can proceed with the analysis.)


2022-09-20 08:54

If you do it the second way, you can't do it because you're scaling the train and the test set with different scales.


2022-09-20 08:54

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.