Thank you for your help.
I use MeCab to divide Japanese sentences.Enter two very similar statements here.
Results
The dictionary used is IPADIC.These two sentences are the same when the Japanese speaker hears them aloud, but if you simply compare them by string, you will see two differences.
Also
In this case,
The result of the split will be different.
After reading aloud, I would like you to judge that these two pairs are the same sentence. Is there a good way?
Thank you for your cooperation.
The part I wrote in the postscript is also valid as an answer, so I transferred it to an answer.
japanese natural-language-processing mecab
It's not perfect yet, but we've made some progress, so we'll move the additional part over here and re-edit it.
UniDic has a background that is not present in the IPADIC."It has a hierarchy of lexical elements, word forms, calligraphy forms, and pronunciation forms, and can give the same heading regardless of the fluctuation of the notation or variation of the word forms."
Look at the CSV file in the dictionary.
Sushi restaurant, 5142, 5142, 9609, noun, common noun, general, *, *, *, Sushiya, Sushiya, Sushiya, Sushiya, Sushiya, Sushiya, Sushiya, Sushiya, Japanese, *, *, *, *
Sushi restaurant, 5142, 5142, 10432, noun, common noun, general, *, *, *, Sushiya, Sushiya, Sushiya, Sushiya, Sushiya, Sushiya, Sushiya, Sushiya, Japanese, *, *, *, *
Sushi restaurant, 5142, 5142, 8402, noun, common noun, general, *, *, *, Sushiya, Sushiya, Sushiya, Sushiya, Sushiya, Sushiya, Sushiya, Sushiya, Japanese, *, *, *, *
I think I can organize it into a sushi restaurant.
However, this is limited to parts of speech (nouns, etc.) that do not involve conjugation.For verbs, etc.
Arguised, 1272,1272,14547, Verbs, General, *,*, Lower 1st - La line, Preform - General, Architskarel, Walking tired, Arguitskarel, Arguitskarel, Arguitskarel, Sum, *, *, *, *
tired of walking, 1272, 1272, 12668, verb, general, *, *, lower stage - line - La, preform - general, alkyzkarel, tired of walking, alkyzkale, tired of walking, alkyzkarel, sum, *, *, *, *
So
Sentence 1: I slept tired of walking-> I slept tired of walking
Sentence 2: I fell asleep under the influence of -->I fell asleep
I'd like to judge that these two sentences are the same, but if you replace them with lexical elements and output them,
Sentence 1: I slept tired of walking-> I slept tired of walking
Sentence 2: I fell asleep under the influence of -- >I fell asleep tired of walking
It would be strange for me to say that
in Japanese.This can be improved in two ways:
For example, "sushi restaurants," "sushi restaurants," and "sushi restaurants" have common "sushi restaurants" in their vocabulary.This is
summarizes in .When processing the lines of "sushi restaurant", choose the sushi restaurant with the most similar vocabulary from the same pronunciation of "sushiya."In this way, you can combine three sushi restaurants into one.
I tried using Levenstein distance to see if the strings were similar.
The same goes for verbs.
"Akikitsure" and "walking fatigue" have common vocabulary elements, "walking fatigue."Among the words that have a common lexicon "walking fatigue" when processing the line "arukitsukare", the pronunciation of the word "alkitsukare" is left, and compared to the lexicon, "walking fatigue" is obtained.
When dividing words,
Example
Input
Output
Step 1 alone will leave some problems.
To solve this problem, we split the Wikipedia dump with UniDic and calculated the frequency of words.
Some words never appear in Wikipedia, so after applying Step 1, the word with the same lexicon, part of speech, and reading gives (for example) the "standard type" of the most frequently appeared expression in Wikipedia.
For example, for Tokyo,
Tokyo, noun, proper noun, place name, general, Tokyo 306123
Tokyo, nouns, proper nouns, place names, general, Tokyo 3467
Tokyo, nouns, proper nouns, place names, general, Tokyo 356
Tokyo, nouns, proper nouns, place names, general, Tokyo 758
TOKYO, noun, proper noun, place name, general, Tokyo 6
Tokyo, noun, proper noun, place name, general, Tokyo 2
Tokyo, noun, proper noun, place name, general, Tokyo 3
"Now, we will adopt ""Tokyo"" as the standard type."
This method is not perfect either, and there are some problems."For example, ""snake"" is
" Snake, noun, common noun, general, *, snake 3854
Ja, noun, common noun, general, *, snake 4091
Snakes, nouns, common nouns, general, *, snake 407
"For some reason, ""Ja"" is the most common, so if you replace it with this,
"I still have a lot of work to do.
575 Who developed the "avformat-59.dll" that comes with FFmpeg?
613 GDB gets version error when attempting to debug with the Presense SDK (IDE)
622 Uncaught (inpromise) Error on Electron: An object could not be cloned
919 When building Fast API+Uvicorn environment with PyInstaller, console=False results in an error
573 rails db:create error: Could not find mysql2-0.5.4 in any of the sources
© 2024 OneMinuteCode. All rights reserved.