How can I change the integer encoding to learn with int data?

Asked 2 years ago, Updated 2 years ago, 91 views

My goal is to model the ranking of shopping malls The final goal is to learn how Rank will come out. Among the categories you currently see, the encoding name is an integer encoding of the product's title after tokenization.

Currently, I am planning to teach you a model with Linear Regression Because of integer encoding, errors such as ValueError: could not convert string to float: [[[1]] appear.

How do you convert an integer encoding into a model to learn it together? I'd really appreciate it if you let me know.

import pandas as pd
from sklearn.preprocessing import PowerTransformer
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LinearRegression
train_df=pd.read_csv("/add purified 2.csv",encoding='UTF-8')
test_df = pd.read_csv("/add purified 2_test.csv", encoding='UTF-8')
print(train_df.info())
search_df = train_df[(train_df['price']==0) & (train_df['review']==0) & (train_df['buy']==0) & (train_df['buy_date']==0)]
train_df = train_df.drop(search_df.index, axis=0)
search_df = test_df[(test_df['price']==0) & (test_df['review']==0) & (test_df['buy']==0) &(test_df['buy_date']==0)]
test_df = test_df.drop(search_df.index, axis=0)
x_train_df = train_df.drop(['id', 'rank'], axis=1)
x_test_df = test_df.drop(['id', 'rank'], axis=1)
y_train_df = train_df['rank']
y_test_df = test_df['rank']

print(x_train_df.head())

x_train = transformer.transform(x_train_df)
x_test = transformer.transform(x_test_df)

y_train = y_train_df.to_numpy()
y_test = y_test_df.to_numpy()

------------------------------------------------------------------------

model = LinearRegression()
model.fit(x_train, y_train)

#Model Validation

print(model.score(x_test, y_test)) #0.5462414358589345
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-29-fb4f716a55ea> in <module>
      1 model = LinearRegression()
      2 
----> 3 model.fit(x_train, y_train)
      4 
      5 ###########Model verification

C:\Anaconda3\envs\venv\lib\site-packages\sklearn\linear_model\base.py in fit(self, X, y, sample_weight)
    456         n_jobs_ = self.n_jobs
    457         X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
--> 458                          y_numeric=True, multi_output=True)
    459 
    460         if sample_weight is not None and np.atleast_1d(sample_weight).ndim > 1:

C:\Anaconda3\envs\venv\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    754                     ensure_min_features=ensure_min_features,
    755                     warn_on_dtype=warn_on_dtype,
--> 756                     estimator=estimator)
    757     if multi_output:
    758         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

C:\Anaconda3\envs\venv\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    525             try:
    526                 warnings.simplefilter('error', ComplexWarning)
--> 527                 array = np.asarray(array, dtype=dtype, order=order)
    528             except ComplexWarning:
    529                 raise ValueError("Complex data not supported\n"

C:\Anaconda3\envs\venv\lib\site-packages\numpy\core\_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 

ValueError: could not convert string to float: '[1]'

text-mining deep-learning

2022-09-20 22:12

1 Answers

Linear regression, for (x1, x2, x3, ..., xn) n numbers, predicts the numerical value y. However, an error occurs because the encoding name column is not a single number.

I'm not sure if you normally do this, but if you convert each integer encoded word into a one hot vector and change it to a vector of len (vocab) length, then you can do linear regression anyway.

What is one hot vector

It's about converting to an expression.

So data with multiple words is

It goes like this.

Sklearn probably has a function that converts one hot vector. You can make it simple.


2022-09-20 22:12

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.