Python, I'm a beginner in machine learning in about a week.
I'm sorry it's a very rudimentary question, but I'd appreciate it if you could help me.
I'm thinking of doing logistic regression analysis using Python.
I have the following user data:
The flag is 0or1 for age, gender, and annual income.
User id, age, gender, annual income, purchase flag
1,30,man,500,1
2,40,woman,400,0
...
Are you buying younger?Do you buy the higher your annual income?
I'm thinking of analyzing things like this with logistic regression.
I can see on the internet that this itself can be done with scikit-learn, but
I couldn't find a way to show how strong the impact was.
I think it's like a P value...
I would appreciate it if you could let me know if you know how to do it well...
python python3 machine-learning scikit-learn
The analysis of this data is very similar to the prediction of the survival of passengers on the Titanic in Kaggle's famous beginner's challenge, "Titanic: Machine Learning from Disaster," so if you use the Titanic data, you'll find that:First, we will do data mining.For Pclass, there are 1st, 2nd, and 3rd class in cabin class, and 1st class is advanced.
import numpy as np
import pandas aspd
from sklearn.linear_model importLogisticRegression
defload_file_train():
train_df = pd.read_csv("../input/train.csv")
cols=["Pclass", "Sex", "Age" ]
# Set men to 1 and women to 0
train_df["Sex"] = train_df["Sex"].apply(lambda sex:1 ifsex=="male" else0)
# average age for data without age
train_df["Age"] = train_df["Age"].fillna(train_df["Age"].mean())
train_df["Fare"] = train_df["Fare"].fillna(train_df["Fare"].mean())
surrendered=train_df["Survived".values
data=train_df[cols].values
return provided, data
provided, data_train=load_file_train()
model=LogisticRegression()
model.fit(data_train,survived)
Since scikit-learn does all the calculations for the analysis, you can enter the learning data to create a model.
To determine if the model is appropriate, use 20-30% of the prepared data as test data and use the predictk command to evaluate and validate as follows:
predicted=model.predict(data_test)
For example, the probability of survival of a 20-year-old woman in a first class cabin can be calculated using predict_proba as follows:
model.predict_proba([1,0,20]])
The correlation between cabin class, gender, age and survival probability can be indicated by coef_, which means that the younger the cabin class, the higher the survival probability.
model.coef_
array([-0.97924449, -2.4057234, -0.02413822]])
If you don't need to set up a model and make a prediction, and you want to see how strong the impact of age and annual income is, I think multiple regression analysis is more appropriate because you can find a P value.
For multiple regression analysis, age and annual income are classified, group aggregation is performed, and the percentage of purchases is calculated, and then analyzed.Also, graphing the results of the group aggregation with matplotlib will help you understand the content of the data.
© 2024 OneMinuteCode. All rights reserved.