My goal is to model the ranking of shopping malls The final goal is to learn how Rank will come out. Among the categories you currently see, the encoding name is an integer encoding of the product's title after tokenization.
Currently, I am planning to teach you a model with Linear Regression
Because of integer encoding, errors such as ValueError: could not convert string to float: [[[1]]
appear.
How do you convert an integer encoding into a model to learn it together? I'd really appreciate it if you let me know.
import pandas as pd
from sklearn.preprocessing import PowerTransformer
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LinearRegression
train_df=pd.read_csv("/add purified 2.csv",encoding='UTF-8')
test_df = pd.read_csv("/add purified 2_test.csv", encoding='UTF-8')
print(train_df.info())
search_df = train_df[(train_df['price']==0) & (train_df['review']==0) & (train_df['buy']==0) & (train_df['buy_date']==0)]
train_df = train_df.drop(search_df.index, axis=0)
search_df = test_df[(test_df['price']==0) & (test_df['review']==0) & (test_df['buy']==0) &(test_df['buy_date']==0)]
test_df = test_df.drop(search_df.index, axis=0)
x_train_df = train_df.drop(['id', 'rank'], axis=1)
x_test_df = test_df.drop(['id', 'rank'], axis=1)
y_train_df = train_df['rank']
y_test_df = test_df['rank']
print(x_train_df.head())
x_train = transformer.transform(x_train_df)
x_test = transformer.transform(x_test_df)
y_train = y_train_df.to_numpy()
y_test = y_test_df.to_numpy()
------------------------------------------------------------------------
model = LinearRegression()
model.fit(x_train, y_train)
#Model Validation
print(model.score(x_test, y_test)) #0.5462414358589345
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-29-fb4f716a55ea> in <module>
1 model = LinearRegression()
2
----> 3 model.fit(x_train, y_train)
4
5 ###########Model verification
C:\Anaconda3\envs\venv\lib\site-packages\sklearn\linear_model\base.py in fit(self, X, y, sample_weight)
456 n_jobs_ = self.n_jobs
457 X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
--> 458 y_numeric=True, multi_output=True)
459
460 if sample_weight is not None and np.atleast_1d(sample_weight).ndim > 1:
C:\Anaconda3\envs\venv\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
754 ensure_min_features=ensure_min_features,
755 warn_on_dtype=warn_on_dtype,
--> 756 estimator=estimator)
757 if multi_output:
758 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,
C:\Anaconda3\envs\venv\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
525 try:
526 warnings.simplefilter('error', ComplexWarning)
--> 527 array = np.asarray(array, dtype=dtype, order=order)
528 except ComplexWarning:
529 raise ValueError("Complex data not supported\n"
C:\Anaconda3\envs\venv\lib\site-packages\numpy\core\_asarray.py in asarray(a, dtype, order)
83
84 """
---> 85 return array(a, dtype, copy=False, order=order)
86
87
ValueError: could not convert string to float: '[1]'
Linear regression, for (x1, x2, x3, ..., xn) n numbers, predicts the numerical value y. However, an error occurs because the encoding name column is not a single number.
I'm not sure if you normally do this, but if you convert each integer encoded word into a one hot vector and change it to a vector of len (vocab) length, then you can do linear regression anyway.
What is one hot vector
It's about converting to an expression.
So data with multiple words is
It goes like this.
Sklearn probably has a function that converts one hot vector. You can make it simple.
© 2024 OneMinuteCode. All rights reserved.