be unable to exclude parts of speech that someone's speech

I am processing sentences with Python and MeCab.

for text indf ["msg_body"]:
    for line in mecab.parse(text).rstrip().splitlines():
        items=line.split("\t")
        iflen(items) == 2:
            surface, feature=items
            if re.search("^(noun|verb, independent), feature) and not\ 
                         re.search("^(BOS/EOS|noun, number|symbol), feature):
                small_list.append(surface)
            else:
                surface=""                    
                small_list.append(surface)

As shown in , only nouns and verbs are given conditions (nouns exclude numbers and symbols).
However, if you count the parts of speech in the output,

['Noun', 'General', '*'] 216
["Verb", "Self-reliance", "*"] 139
["Noun", "Proper Noun", "General"] 121
["Noun", "Sa-variable connection", "*"] 63
["Noun", "Proper Noun", "Organization"] 29
["Noun", "Proper Noun", "Person Name"] 28
["Noun", "Proper Noun", "Region"] 24
["Adverb", "General", "*"] 7
["Adjective", "Self-reliance", "*"] 7
['Noun', 'Adjective Verb Stem', '*'] 5

and the excluded words.
Why is this?
Is there such an unstable behavior in MeCab?
Or is the code wrong?
Thank you for your cooperation.

Add (Reply to Comments)

What is in the feature? How do you generate instances of MeCab?

The feature is the code, as parsed by mecab=MeCab.Tagger("-b5242880") is basically divided into words and parts of speech (surface and feature here).The mecab instance is as above.df is the data frame containing the text.The small_list is just a list of words that have cleared the conditions.

How do you count the feature after filtering when you only collect surface that meets the requirements?

I'm just parsing the remaining words again.

python regular-expression pandas mecab

2022-09-30 11:17

1 Answers

Analysis of "Yes" (depending on the dictionary)

Oh verb, independent, *, *, five-step-la line, continuous data connection, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes.
auxiliary verb, *, *, *, special/ta, basic form,ta,ta,ta

Then, if you analyze this surface layer again, Of course

Oh, touching words, *, *, *, *, *, *, *, *, *, *, *, *, *, *, *, *, *, *, *, *, *

That's the result.

Additional information

if pre.search("^(noun|verb, independent), feature) and not\ 
                         re.search("^(BOS/EOS|noun, number|symbol), feature):

There is no particular problem with the code.

if pre.search("^(noun | verb, independent)" , feature) and not re.search("^ noun, number)" , feature) :

I thought it would be fine, but it's not an essential mistake.

2022-09-30 11:17

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656