Differences in evaluation between training and test data

Asked 2 years ago, Updated 2 years ago, 199 views

I'm doing a competition problem with Python that predicts a certain number.

We took only the available feature quantities from the given training data and similarly took the same feature quantities from the test data.

https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard/notebook
Based on the above information, we used the extracted data to make predictions in LASSO Regression, Elastic Net Regression, Kernel Ridge Regression, Gradient Boosting Regression, XGBoost, and LightGBM, and added their predictions as feature quantities.

Based on these characteristics, we divided them into training 7: Assessment 3 and learned that R2 Score was 0.85, train loss was 0.1378, and validation loss was 0.1248.
I predicted the test data with this learner, but the R2 Score was 0.55.

Both training data and test data are characterized by stats.shapiro() and the p value is 0 or infinitely close to 0 and I think it is a normal distribution.
The same was true of the training data that we set to the desired value.
Also, there was little difference between the maximum and minimum values.

I would like to know why the evaluation results are different between the training (evaluation) data and the test data.
Also, I would like to know how to improve generalization performance other than cross-validation.
I don't know if the following is correct, but the code has been cross-verified.

X=train [cat_vars+cont_vars+['xgb', 'lgb', 'stacked', 'emblemable']]
y=train [['Score']]

X_train, X_test, Y_train, Y_test=train_test_split(X,y,train_size=0.7,test_size=0.3,random_state=0)
lr = LinearRegression()
kf=KFold(n_splits=5, shuffle=True, random_state=1)
lr.fit(X,y)
splitter=kf.split(X,y)
print(cross_val_score(lr,X,y,cv=splitter,scoring='r2'))

Results

 [0.8883430.8853790.8917290.8813290.899762]

python machine-learning deep-learning scikit-learn pytorch

2022-09-30 21:44

2 Answers

I would like to know why the evaluation results are different between the training (evaluation) data and the test data.

This is a simple story, because machine learning models are trained to learn from training data and produce results that are adapted to it.Training data is generally better evaluated (less lost) because you have already learned the answer.

On the other hand, test data and cross-validation data should not be used for machine learning and should only be used for evaluation.Therefore, if the learning is well generalized, it will be good, but if you are over-learning, it will be bad.

Note: Wikipedia - Overfit

Also, I would like to know how to improve generalization performance other than cross-validation.

As you can see in the comments, cross-validation is not a method of increasing generalization performance, but a measure of generalization performance, so we will write a topic-based story that helped increase generalization performance.Basically, the only way to do this is to plot the loss of training data and cross-validation (test) data, and take measures against each of them.

Summary : The most common ways to prevent overlearning in neural networks are as follows:

Increase training data
Reduce network capacity
Regularize weights
Add Dropout

Also, personally, it was effective to halve the learning rate as the learning progresses.


2022-09-30 21:44

Take only the amount of characteristics that can be used from the given training data

Under what conditions were the feature quantities extracted?
In fact, important feature quantities may be leaked or unnecessary feature quantities may be included.

Instead of extracting feature quantities,
·Remove feature quantities with high correlation
·Calculate the importance of feature quantities and delete feature quantities of low importance

Why don't you try something like that?


2022-09-30 21:44

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.