The length of the data has been reduced since the one-hot encoder was turned.

Asked 2 years ago, Updated 2 years ago, 67 views

import pandas as pd
from sklearn.model_selection import train_test_split
mushroom = pd.read_csv("../data/mushroom.csv", header = None)
mushroom[0] = mushroom[0].replace("p", float(1))
mushroom[0] = mushroom[0].replace("e", float(0))

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(
    [('one_hot_encoder', OneHotEncoder(categories='auto'), [0])],
    remainder='passthrough'                                   
)

feat = mushroom.iloc[1:]
mush = ct.fit_transform(feat)

y = mushroom[0] == 1

X_train, X_test, y_train, y_test = train_test_split(mush, y, random_state = 0)

It's a question of predicting whether mushrooms are edible or poisonous. Since the data is in str form, only food or not is changed to float (1, 0) The rest of the characteristics were handled using a hot encoder. An error occurs when you try to divide the training set and the test set.

ValueError: Found input variables with inconsistent numbers of samples: [8123, 8124]

I checked the shape because I thought there was an error because the length of the data and the label did not match.

mush.shape
(8123, 24)
y.shape
(8124,)

I checked that the length of the data that turned the one-hot encoder is reduced.

python scikit-learn

2022-09-20 21:54

1 Answers

>>> import pandas as pd


>>> df = pd.DataFrame({"A":[1,2,1,1,1,1], "B":[33,24,52,66,22,111]})
>>> df
   A    B
0  1   33
1  2   24
2  1   52
3  1   66
4  1   22
5  1  111
>>> df.shape
(6, 2)
>>> f = df.iloc[1:]
>>> f
   A    B
1  2   24
2  1   52
3  1   66
4  1   22
5  1  111

>>> y = df['A'] == 1
>>> y
0     True
1    False
2     True
3     True
4     True
5     True
Name: A, dtype: bool
>>> f.shape
(5, 2)
>>> y.shape
(6,)

It's a natural result.

I have a very simple data frame above, and I gave you an example.

If you look at it, the length is not reduced by one during One Hot. Maybe mushroom.iloc [:, 1:] is what this questioner wanted.

>>> f1 = df.iloc[:,1:]
>>> f1
     B
0   33
1   24
2   52
3   66
4   22
5  111
>>> f1.shape
(6, 1)


2022-09-20 21:54

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.