a method of Japanese stemming (determining whether "good" and "good" are the same)

Asked 2 years ago, Updated 2 years ago, 84 views

Thank you for your help.

I use MeCab to divide Japanese sentences.Enter two very similar statements here.

Results

The dictionary used is IPADIC.These two sentences are the same when the Japanese speaker hears them aloud, but if you simply compare them by string, you will see two differences.

Also

In this case,

The result of the split will be different.

After reading aloud, I would like you to judge that these two pairs are the same sentence. Is there a good way?

Thank you for your cooperation.

The part I wrote in the postscript is also valid as an answer, so I transferred it to an answer.

japanese natural-language-processing mecab

2022-09-30 19:05

1 Answers

It's not perfect yet, but we've made some progress, so we'll move the additional part over here and re-edit it.

UniDic has a background that is not present in the IPADIC."It has a hierarchy of lexical elements, word forms, calligraphy forms, and pronunciation forms, and can give the same heading regardless of the fluctuation of the notation or variation of the word forms."

Look at the CSV file in the dictionary.

 Sushi restaurant, 5142, 5142, 9609, noun, common noun, general, *, *, *, Sushiya, Sushiya, Sushiya, Sushiya, Sushiya, Sushiya, Sushiya, Sushiya, Japanese, *, *, *, *
Sushi restaurant, 5142, 5142, 10432, noun, common noun, general, *, *, *, Sushiya, Sushiya, Sushiya, Sushiya, Sushiya, Sushiya, Sushiya, Sushiya, Japanese, *, *, *, *
Sushi restaurant, 5142, 5142, 8402, noun, common noun, general, *, *, *, Sushiya, Sushiya, Sushiya, Sushiya, Sushiya, Sushiya, Sushiya, Sushiya, Japanese, *, *, *, *

I think I can organize it into a sushi restaurant.

However, this is limited to parts of speech (nouns, etc.) that do not involve conjugation.For verbs, etc.

Arguised, 1272,1272,14547, Verbs, General, *,*, Lower 1st - La line, Preform - General, Architskarel, Walking tired, Arguitskarel, Arguitskarel, Arguitskarel, Sum, *, *, *, *
tired of walking, 1272, 1272, 12668, verb, general, *, *, lower stage - line - La, preform - general, alkyzkarel, tired of walking, alkyzkale, tired of walking, alkyzkarel, sum, *, *, *, *

So

Sentence 1: I slept tired of walking-> I slept tired of walking
Sentence 2: I fell asleep under the influence of -->I fell asleep

I'd like to judge that these two sentences are the same, but if you replace them with lexical elements and output them,

Sentence 1: I slept tired of walking-> I slept tired of walking
Sentence 2: I fell asleep under the influence of -- >I fell asleep tired of walking

It would be strange for me to say that

in Japanese.

This can be improved in two ways:

For example, "sushi restaurants," "sushi restaurants," and "sushi restaurants" have common "sushi restaurants" in their vocabulary.This is

  • Sushi restaurant->[Sushi restaurant, sushi restaurant]
  • Sushi restaurant->[Sushi restaurant, sushi restaurant]
  • Sushi restaurant->[Sushi restaurant, sushi restaurant]

summarizes in .When processing the lines of "sushi restaurant", choose the sushi restaurant with the most similar vocabulary from the same pronunciation of "sushiya."In this way, you can combine three sushi restaurants into one.

I tried using Levenstein distance to see if the strings were similar.

The same goes for verbs.

"Akikitsure" and "walking fatigue" have common vocabulary elements, "walking fatigue."Among the words that have a common lexicon "walking fatigue" when processing the line "arukitsukare", the pronunciation of the word "alkitsukare" is left, and compared to the lexicon, "walking fatigue" is obtained.

When dividing words,

Example
Input

  • Read the beginning of the book
  • Read the beginning of the book
  • I took a rest because I was tired of walking
  • Take a break
  • I'm going to a sushi restaurant today
  • I'm going to a sushi restaurant today
  • I'm going to a sushi restaurant today

Output

  • [Book, no, start, read, read]
  • [Book, no, start, read, read]
  • [I'm tired of walking]
  • [I'm tired of walking]
  • [Today, I'm going to a sushi restaurant]
  • [Today, I'm going to a sushi restaurant]
  • [Today, I'm going to a sushi restaurant]

Step 1 alone will leave some problems.

To solve this problem, we split the Wikipedia dump with UniDic and calculated the frequency of words.

Some words never appear in Wikipedia, so after applying Step 1, the word with the same lexicon, part of speech, and reading gives (for example) the "standard type" of the most frequently appeared expression in Wikipedia.

For example, for Tokyo,

Tokyo, noun, proper noun, place name, general, Tokyo 306123
Tokyo, nouns, proper nouns, place names, general, Tokyo 3467
Tokyo, nouns, proper nouns, place names, general, Tokyo 356
Tokyo, nouns, proper nouns, place names, general, Tokyo 758
TOKYO, noun, proper noun, place name, general, Tokyo 6
Tokyo, noun, proper noun, place name, general, Tokyo 2
Tokyo, noun, proper noun, place name, general, Tokyo 3

"Now, we will adopt ""Tokyo"" as the standard type."

This method is not perfect either, and there are some problems."For example, ""snake"" is

"
 Snake, noun, common noun, general, *, snake 3854
Ja, noun, common noun, general, *, snake 4091
Snakes, nouns, common nouns, general, *, snake 407

"For some reason, ""Ja"" is the most common, so if you replace it with this,

"
  • I don't like snakes->I don't like jaja

I still have a lot of work to do.


2022-09-30 19:05

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.