Procedure to install MeCab on a Mac
Install MeCab as per the link above and pip3 install mecab-python3 on python3
import MeCab
mecab=MeCab.Tagger("-Ochasen")
print(mecab.parse("Dachshund is walking."))
However, I got the error 'utf-8' codec can't decode bytes in position
.Please let me know.
[~] mecab-D 13:23:03
filename —/usr/local/mecab/lib/mecab/dic/ipadic/sys.dic
version —102
charset —utf8
type: 0
size: 392126
left size: 1316
right size: 1316
[~] mecab-P 13:24:06
boss-feature —BOS/EOS, *, *, *, *, *, *, *, *, *
boss-format:
config-charset —EUC-JP
cost-factor —700
dicdir: /usr/local/mecab/lib/mecab/dic/ipadic
dump-config:1
eon-format:
eos-format —EOS\n
eos-format-chasen —EOS\n
eos-format-chasen2—EOS\n
eos-format-simple —EOS\n
eos-format-yomi:\n
event-size —8
latency-level:0
max-grouping-size —24
nbest:1
node-format: %m\t%H\n
node-format-chasen: %m\t%f[7]\t%f[6]\t%F - [0,1,2,3]\t%f[4]\t%f[5]\n
node-format-chasen2: %M\t%f[7]\t%f[6]\t%F - [0,1,2,3]\t%f[4]\t%f[5]\n
node-format-simple: %m\t%F - [0,1,2,3]\n
node-format-yomi: %pS%f[7]
theta —0.75
unknown-eval-size:4
US>unk-format: %m\t%H\n
unk-format-chasen: %m\t%m\t%m\t%F - [0,1,2,3]\t\t\n
unk-format-chasen2: %M\t%m\t%m\t%F - [0,1,2,3]\t\t\n
US>unk-format-yomi: %M
[~] Echo Dachshund is walking. | mecab-Ochasen 13:24:13
Dachshund, Dachshund, Dachshund noun - General
Gaga is a particle-case particle-general
Walking A-Walking Verbs - Self-supporting 5-Step-C-Step-Step-Step-Step-Step-Step-Step
tete particle - conjunctive particle
Iriru verbs - non-autonomous basic form
。 Yes. Symbols - Phrase Points
EOS
import sys
sys.getdefaultencoding()
'utf-8'
It says
python python3 mecab
The default encoding for Python 3 source code is UTF-8, and the string (str
) retains Unicode.
The ipadic
in MeCab's dictionary (probably used by the questioner) is created by default in EUC.
This mismatch is the cause of the error. (Note: Wrong.)The questioner's dictionary is UTF-8)
Building a UTF-8 dictionary with ./configure--with-charset=utf8
is the easiest solution.
The mecab
command prints dictionary information with the option -D, --dictionary-info
.
$mecab-D
If the character code is UTF-8, the output looks like charset:UTF-8
.
Python also seems to be able to confirm the following:
mecab=MeCab.Tagger("-Ochasen")
info=mecab.dictionary_info()
print(info.charset)
© 2024 OneMinuteCode. All rights reserved.