I am processing sentences with Python and MeCab.
for text indf ["msg_body"]:
for line in mecab.parse(text).rstrip().splitlines():
items=line.split("\t")
iflen(items) == 2:
surface, feature=items
if re.search("^(noun|verb, independent), feature) and not\
re.search("^(BOS/EOS|noun, number|symbol), feature):
small_list.append(surface)
else:
surface=""
small_list.append(surface)
As shown in , only nouns and verbs are given conditions (nouns exclude numbers and symbols).
However, if you count the parts of speech in the output,
['Noun', 'General', '*'] 216
["Verb", "Self-reliance", "*"] 139
["Noun", "Proper Noun", "General"] 121
["Noun", "Sa-variable connection", "*"] 63
["Noun", "Proper Noun", "Organization"] 29
["Noun", "Proper Noun", "Person Name"] 28
["Noun", "Proper Noun", "Region"] 24
["Adverb", "General", "*"] 7
["Adjective", "Self-reliance", "*"] 7
['Noun', 'Adjective Verb Stem', '*'] 5
and the excluded words.
Why is this?
Is there such an unstable behavior in MeCab?
Or is the code wrong?
Thank you for your cooperation.
What is in the feature? How do you generate instances of MeCab?
The feature is the code, as parsed by mecab=MeCab.Tagger("-b5242880")
is basically divided into words and parts of speech (surface and feature here).The mecab instance is as above.df is the data frame containing the text.The small_list is just a list of words that have cleared the conditions.
How do you count the feature
after filtering when you only collect surface
that meets the requirements?
I'm just parsing the remaining words again.
python regular-expression pandas mecab
Analysis of "Yes" (depending on the dictionary)
Oh verb, independent, *, *, five-step-la line, continuous data connection, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes.
auxiliary verb, *, *, *, special/ta, basic form,ta,ta,ta
Then, if you analyze this surface layer again, Of course
Oh, touching words, *, *, *, *, *, *, *, *, *, *, *, *, *, *, *, *, *, *, *, *, *
That's the result.
Additional information
if pre.search("^(noun|verb, independent), feature) and not\
re.search("^(BOS/EOS|noun, number|symbol), feature):
There is no particular problem with the code.
if pre.search("^(noun | verb, independent)" , feature) and not re.search("^ noun, number)" , feature) :
I thought it would be fine, but it's not an essential mistake.
© 2024 OneMinuteCode. All rights reserved.