A simple implementation method for classifying multiple Japanese sentences.

What are the advantages and disadvantages of classifying sentences?

■ Input
·Multiple Japanese sentences with about 200 characters
·Classification destinations (about 10 items defined in advance, such as romance, horror, suspense, etc.)

■ Output
Sentence A->Love
Sentence B->Suspense
Sentence C->Horror
...

■ How to categorize
After a little research, the following method is
I thought it would be good if I could easily implement it for myself (just a Rails engineer) who is an amateur around machine learning.

I would like to ask you about the advantages and disadvantages of the following two points.I would appreciate it if you could let me know if there are any other good ways.

Manually categorize the > tags (love, horror, suspense, etc.) that extract feature words from > TF-IDF and use them as tags.I've heard that TF should be calculated by myself and IDF should be generic.

Categorize by naive Bayes classification (I'm sorry I'm not familiar with it)

■ Supplemental
I'm implementing it in Rails, so I'd appreciate it if you had a gem

ruby-on-rails machine-learning

2022-09-30 20:18

1 Answers

The benefits of using TF-IDF are

You can use a library (such as scikit-learn) to calculate it, so you can easily design features.
If the model trained with these feature vectors retains the importance of coefficients and feature quantities, you can see things like, "Which word was important to prediction?"
Simple machine learning algorithms such as logistic regression, naive Bayes, and decision trees can be applied.

There is a point that(I think there are other things, but I mentioned it as something that comes to mind.)

Since you use naive Bayes, the importance of feature quantity can be obtained by using the link below.
https://stackoverflow.com/questions/50526898/how-to-get-feature-importance-in-naive-bayes

If you want to use rails, calling python from rails is relatively easy.
https://github.com/mrkn/pycall.rb

The disadvantage is

More data and more words increase the dimension of the feature vector, making memory inefficient.
I don't have any potential information directly (word A and word B are similar).
Adding learning to accommodate new words in the new data will incur a feature vector recalculation cost.

There are points such as .(I think there are other things, but I mentioned it as something that comes to mind.)

Other options include:

Use words2vec and tensorflow-hub's nnlm, universal sentence encoder, laser to represent sentences in vectors lower than tf-idf, and then enter them into models created in keras to learn.
Sentencepiece is used to divide each sentence into subwords (converting those subwords into IDs) with a BPE trained from the sentence set and input it into a typical sentence classification model created by keras (a model consisting of layers such as Embedding, LSTM, etc.).
Fine-tuning pre-learned language models such as BERT

2022-09-30 20:18

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656