I want to use MeCab in python3, but I get the error 'utf-8' codec can't decode'utf-8' codec can't decode.

Asked 2 years ago, Updated 2 years ago, 169 views

Procedure to install MeCab on a Mac

Install MeCab as per the link above and pip3 install mecab-python3 on python3

import MeCab
mecab=MeCab.Tagger("-Ochasen")
print(mecab.parse("Dachshund is walking."))

However, I got the error 'utf-8' codec can't decode bytes in position.Please let me know.

[~] mecab-D 13:23:03
filename —/usr/local/mecab/lib/mecab/dic/ipadic/sys.dic
version —102
charset —utf8
type: 0
size: 392126
left size: 1316
right size: 1316

[~] mecab-P 13:24:06
boss-feature —BOS/EOS, *, *, *, *, *, *, *, *, *
boss-format:
config-charset —EUC-JP
cost-factor —700
dicdir: /usr/local/mecab/lib/mecab/dic/ipadic
dump-config:1
eon-format:
eos-format —EOS\n
eos-format-chasen —EOS\n
eos-format-chasen2—EOS\n
eos-format-simple —EOS\n
eos-format-yomi:\n
event-size —8
latency-level:0
max-grouping-size —24
nbest:1
node-format: %m\t%H\n
node-format-chasen: %m\t%f[7]\t%f[6]\t%F - [0,1,2,3]\t%f[4]\t%f[5]\n
node-format-chasen2: %M\t%f[7]\t%f[6]\t%F - [0,1,2,3]\t%f[4]\t%f[5]\n
node-format-simple: %m\t%F - [0,1,2,3]\n
node-format-yomi: %pS%f[7]
theta —0.75
unknown-eval-size:4
US>unk-format: %m\t%H\n
unk-format-chasen: %m\t%m\t%m\t%F - [0,1,2,3]\t\t\n
unk-format-chasen2: %M\t%m\t%m\t%F - [0,1,2,3]\t\t\n
US>unk-format-yomi: %M

[~] Echo Dachshund is walking. | mecab-Ochasen 13:24:13
Dachshund, Dachshund, Dachshund noun - General
Gaga is a particle-case particle-general
Walking A-Walking Verbs - Self-supporting 5-Step-C-Step-Step-Step-Step-Step-Step-Step
tete particle - conjunctive particle
Iriru verbs - non-autonomous basic form
。   Yes. Symbols - Phrase Points
EOS
import sys 
sys.getdefaultencoding() 
'utf-8'

It says

python python3 mecab

2022-09-30 18:33

1 Answers

Character code mismatch

The default encoding for Python 3 source code is UTF-8, and the string (str) retains Unicode.
The ipadic in MeCab's dictionary (probably used by the questioner) is created by default in EUC.
This mismatch is the cause of the error. (Note: Wrong.)The questioner's dictionary is UTF-8)
Building a UTF-8 dictionary with ./configure--with-charset=utf8 is the easiest solution.

Checking Dictionary Character Codes

The mecab command prints dictionary information with the option -D, --dictionary-info.

$mecab-D

If the character code is UTF-8, the output looks like charset:UTF-8.
Python also seems to be able to confirm the following:

mecab=MeCab.Tagger("-Ochasen")
info=mecab.dictionary_info()
print(info.charset)


2022-09-30 18:33

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.