a method of finding data that can be used as a feature by means of differences

I'm thinking of using the "average difference" when I select feature quantities for clustering.
I believe that if we select the characteristic amount of the attributes with the larger average difference among the multiple attributes, we can divide the groups more appropriately.
I don't know how to compare the mean difference.

For example, suppose there were two high school classes (Group A, Group B). Assume that you have calculated the average height (cm), weight (kg), and part-time income (yen).

 | Group A | Group B |
         -------------
Average Height | 190 | 180 |
Average Weight | 75 | 70 |
Average Income | 10000 | 500 |

At this time, the average height difference is 10cm, the average weight difference is 5kg, and the average part-time job income is 9,500 yen. If you need to select one of the features,
I don't know which one to choose.
(I don't think it's easy to compare because the units are different.)

When I looked into it, there are things called MinMaxScaler and StandardScaler in scikit-learn.
Should we just use these?
Or do I have to apply another measure?

python pandas machine-learning scikit-learn

2022-09-30 21:33

1 Answers

After some research, it seems that the subject of the question, "How to find data that can be used as feature quantities," is handled by human resources according to the purpose of classification if there are few dimensions or if there are many dimensions.

Feature Selection - Wikipedia

Filter method: Select using criteria for good characteristics, such as information gain between the target variable and each feature quantity.
Wrapper method: Use a subset of feature quantities to actually apply a learning algorithm and select a subset of feature quantities to minimize the generalization error obtained by the cross-check method.

Select features - Machine learning "Zhu-sagi no Mori Wiki"

"Also, regarding the point that ""it cannot be simply compared because the units are different,"" as long as clustering is performed between the same dimensions, there seems to be no problem."

The above is the result of the investigation by the beginner, and there may be some mistakes, but I would like to give you an answer in the hope that it will be helpful.I think there are some wrongs and inaccuracies, so I would appreciate it if you could comment to someone more detailed.

2022-09-30 21:33

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656