Environment: Python 3.5.2, MacOS Sierra
Below is a code that reads Excel data, creates a string, and uses ReCab to indicate nouns and number of appearances.
import pandas as pd
import MeCab
import sys
df=pd.read_excel("filename.xls", sheetname=0)
df = df.dropna()
m = MeCab.Tagger()
none_list = [ ]
for i indf:
for lin m.parse(i).splitlines():
ifl!='EOS' and l.split('\t')[1].split(',')[0]=='noun':
no_list.append(l.split('\t')[0])
no_cnt=collections.Counter(noun_list)
no=pd.DataFrame(list(noun_cnt.items())), columns=['noun', 'number of appearances'])
no=noun.sort_values('number of appearances', axis=0, ascending=True, insert=False,kind='quicksort',na_position='last')
noun=noun[noun['Number of appearances']>10]#Only those that appear more than 10 times
print(noun.tail())
Out
word Appearance Count
51 Worry 18
199 Yes 23
171 Security 31
156 Anxiety 40
154 Convenient 81
I ran the code in my file, but when I ran the code using different Japanese data, I got the following error, so I think it's an encoding problem.
NotImplementedError: Wrong number or type of arguments for overloaded function 'Tagger_parse'.
I think it is necessary to include encodes and decodes for utf-8
, but could someone please tell me?
This issue has nothing to do with Pandas and appears to be caused by the different format of i
passed to m.parse(i)
in Mecab if it works correctly and if it does not.It may be a character code, but it may not be.
To clarify what i
is, check what data you are trying to pass to the previous line, such as print(i)
, print(repr(i))
or print(type(i))
.Once you have confirmed it, try to isolate whether you can experience the same problem when you are not using Pandas.
Just to guess, if it doesn't work well, i
may be an array or other data format, not a string.
© 2024 OneMinuteCode. All rights reserved.