The length of the data has been reduced since the one-hot encoder was turned.

import pandas as pd
from sklearn.model_selection import train_test_split
mushroom = pd.read_csv("../data/mushroom.csv", header = None)
mushroom[0] = mushroom[0].replace("p", float(1))
mushroom[0] = mushroom[0].replace("e", float(0))

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(
    [('one_hot_encoder', OneHotEncoder(categories='auto'), [0])],
    remainder='passthrough'                                   
)

feat = mushroom.iloc[1:]
mush = ct.fit_transform(feat)

y = mushroom[0] == 1

X_train, X_test, y_train, y_test = train_test_split(mush, y, random_state = 0)

It's a question of predicting whether mushrooms are edible or poisonous. Since the data is in str form, only food or not is changed to float (1, 0) The rest of the characteristics were handled using a hot encoder. An error occurs when you try to divide the training set and the test set.

ValueError: Found input variables with inconsistent numbers of samples: [8123, 8124]

I checked the shape because I thought there was an error because the length of the data and the label did not match.

mush.shape
(8123, 24)

y.shape
(8124,)

I checked that the length of the data that turned the one-hot encoder is reduced.

python scikit-learn

2022-09-20 21:54

1 Answers

>>> import pandas as pd


>>> df = pd.DataFrame({"A":[1,2,1,1,1,1], "B":[33,24,52,66,22,111]})
>>> df
   A    B
0  1   33
1  2   24
2  1   52
3  1   66
4  1   22
5  1  111
>>> df.shape
(6, 2)
>>> f = df.iloc[1:]
>>> f
   A    B
1  2   24
2  1   52
3  1   66
4  1   22
5  1  111

>>> y = df['A'] == 1
>>> y
0     True
1    False
2     True
3     True
4     True
5     True
Name: A, dtype: bool
>>> f.shape
(5, 2)
>>> y.shape
(6,)

It's a natural result.

I have a very simple data frame above, and I gave you an example.

If you look at it, the length is not reduced by one during One Hot. Maybe mushroom.iloc [:, 1:] is what this questioner wanted.

>>> f1 = df.iloc[:,1:]
>>> f1
     B
0   33
1   24
2   52
3   66
4   22
5  111
>>> f1.shape
(6, 1)

2022-09-20 21:54

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656