Logistic Regression Question

Hello. I'm asking you a question because I couldn't understand while doing my university assignment. A problem that creates a logistic regression model and outputs model performance through a training dataset.

In #4 logistic regression,

model_formula = sm.Logit.from_formula("Survived ~ Age + Parch + Fare + Pclass_1 + Pclass_2 + Pclass_3", df)

"Survived~Age+Parch+Fare+Pclass_1+Pclass_2+Pclass_3" I wonder why the range is set like this.

import pandas as pd
import numpy as np

#1. Read Data
df = pd.read_csv('C:/Users/minki/Downloads/Sample_data.csv', header =0)

df['Sex'] = df['Sex'].astype('category')
df['Pclass'] = df['Pclass'].astype('category') 
df[ 'Embarked'] = df['Embarked'].astype('category')
df = pd.get_dummies((df))

#2 data scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler ()
df_scaled = scaler.fit_transform(df)

#3 data splitting
from sklearn.model_selection import train_test_split
Y = df['Survived']
X = df.iloc[:, 1:12] ##important to check the index
print(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

#4 Logistic regression
import statsmodels.api as sm
model_formula = sm. Logit.from_formula ("Survived Age + Parch + Fare + Pclass_1+ Pclass_2+ Pclass_3", df)
result_model = model_formula.fit() ##Build your log reg.
print (result_model.summary()) ## Chcek the reg result

print (np.exp(result_model.params)) ## calculate odds ratio of each year
Y_pred = result_model.predict(X_test) ## using test dataset, we will predict the value of :"survived"
Y_pred = list(map (round, Y_pred))
print(Y_pred)
print("----")
print(list(Y_test))

from sklearn import metrics
metrics.confusion_matrix (Y_test, Y_pred)
accuracy = metrics.accuracy_score (Y_test, Y_pred)
recall = metrics.recall_score (Y_test, Y_pred)
f1 = metrics.f1_score (Y_test, Y_pred)
print("Accuracy:", accuracy, ", F1-score:", f1, "Recall score:", recall)

python regression-analysis logistic-regression statsmodels

2022-09-20 08:44

1 Answers

I don't think the code says why we decided on independent variables as them.

Usually, we do exploratory data analysis (EDA) for a long time, and then we make a regression model by deciding that some features are important to discard.

After transforming the category feature into one-hot, standard scaling, and just making a logistic regression model with a few randomly selected variables and training them to check.

The process of selecting variables might have been to create a model over and over again, and then skip the process of looking at the results and erasing them... That's right.

2022-09-20 08:44

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656