I want to sort the words in the text by TF-IDF score

Asked 2 years ago, Updated 2 years ago, 126 views

I want to list the nouns in the text file in order of tf-idf score.

In Python, I would like to use MeCab (+natto) to analyze the text containing the tweets, and sort it out by scoring the tf-idf of the extracted noun.After running the code, I got the following error:
I had just started programming and had no one to rely on, and I didn't really know what was going on and how to fix it, so I asked you a question.
Could you lend me some advice?

Traceback (most recent call last):
  File "tfidf_test_dataset.py", line 41, in <module>
    tfidf=vectorizer.fit_transform(corpus)
  File"/Users/macuser/Workspaces/jxpress/trendword/.direnv/python-3.7.3/lib/python 3.7/site-packages/sklearn/feature_extraction/text.py", line 1652, infit_transform
    X = super().fit_transform(raw_documents)
  File"/Users/macuser/Workspaces/jxpress/trendword/.direnv/python-3.7.3/lib/python 3.7/site-packages/sklearn/feature_extraction/text.py", line 1058, infit_transform
    self.fixed_vocabulary_)
  File"/Users/macuser/Workspaces/jxpress/trendword/.direnv/python-3.7.3/lib/python 3.7/site-packages/sklearn/feature_extraction/text.py", line970, in_count_vocab
    for feature in anyze (doc):
  File "/Users/macuser/Workspaces/jxpress/trendword/.direnv/python-3.7.3/lib/python 3.7/site-packages/sklearn/feature_extraction/text.py", line 352, in<lambda>
    tokenize (preprocess(self.decode(doc))), stop_words)
  File"/Users/macuser/Workspaces/jxpress/trendword/.direnv/python-3.7.3/lib/python 3.7/site-packages/sklearn/feature_extraction/text.py", line256, in<lambda>
    return lambdax: strip_accents(x.lower())
AttributeError: 'generator' object has no attribute' lower'

Source Codes Affected

 from natto import MeCab
import codecs
import sys
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

with codecs.open("tfidf_test.txt", "r", "utf-8") asf:
    corpus=f.read().split("\n")

mecab=MeCab('-d/usr/local/lib/mecab/dic/mecab-ipadic-neologd')

# if tagger.lang=='ja':
for txt in corpus:
    words=mecab.parse(txt, as_nodes=True)

    For win words:
        rm_list=["RT", "https", "co" ]
        if w.feature.split(",")[0]=="Noun":
            iflen(w.surface)>=2:
                if not any(rmin w.surface for rmin rm_list):
                    print(str(w.surface))
                else:
                    print("")
            else:
                print("")
        else:
            print("")

corpus = [mecab.parse(txt, as_nodes=True) for line in corpus]

vectorizer = TfidfVectorizer()
tfidf=vectorizer.fit_transform(corpus)

# Viewing Scores
print(tfidf.toarray())
# Number of text, number of words that appear
print(tfidf.shape)

# sort
feature_names=np.array(vectorizer.get_feature_names())
forvec intfidf:
    index=np.argsort(vec.toarray(),axis=1)[:,:-1]
    feature_words = feature_names [index]
    print(feature_words[:,:10])
The story of a man traveling around the world on a bicycle or motorcycle meeting a kitten desperately chasing him and changing his journey

We were able to win the gold prize in the Kyoto wind instrument competition for high school students!Thanks to the people who supported me so far.Thank you for your support.

This year, we filmed it again in the sunflower field behind the Hiraya Village playground.Most of the pictures I took were of me with a strange face. The most decent picture of my face.It's hard to tell where they are.

Supplementary information (for example, FW/Tool Version)

iOS 10.12.6, Python 3.7.3, Atom

python natural-language-processing

2022-09-29 21:34

1 Answers

If you use MeCab from python, you can add mecab-python3.The following scripts have been modified using mecab-python3 instead of natto:

import MeCab
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

mecab=MeCab.Tagger('-d/usr/local/lib/mecab/dic/mecab-ipadic-neologd')

corpus = [ ]
with open("tfidf_test.txt") asf:
    targets = {}
    rm_list=["RT", "https", "co" ]
    for line inf:
        words=mecab.parse(line).split("\n")
        tmp = [ ]
        For win words:
            w=w.trip().split()
            iflen(w) == 2:
                tmp.append(w[0])
                if w[1].startwith("noun,"):
                    targets [w[0]] = True
        corpus.append(''.join(tmp))

print(target)

vectorizer = TfidfVectorizer()
tfidf=vectorizer.fit_transform(corpus)

# Viewing Scores
print(tfidf.toarray())
# Number of text, number of words that appear
print(tfidf.shape)

# sort
feature_names=np.array(vectorizer.get_feature_names())
forvec intfidf:
    index=np.argsort(vec.toarray(),axis=1)[:,:-1]
    feature_words = feature_names [index]
    print([x for x feature_words[0] if x targets] [:10])

The list of str data should be passed to the fit_transform tfidf vectorizer.

The corpus Tokanized by natto is clearly not str-type.

You can change this to use the genuine mecab-python3 and add the -Owakati option to pass the simple spacing as a str type.However, in this case, it is not efficient to talk to them many times, so they can do everything in one talk, so I dare not use the -Owakati option.

AttributeError: 'generator' object has no attribute 'lower'

As you can see from this error, the elements in the list to pass to fit_transform must have a lower function.The lower function is data of type str.

The output looks like this:

 (omitted)
["World", "Good morning", "Kitten", "Desperate", "Bike", "Men", "Bicycle", "Where", "High School Student", "Sama"]
["High School Student", "Cheering", "Thanks", "Here", "Kotoko", "Dear", "Golden Prize", "Blow Music Contest", "Kyoto", "People"]
["Photo", "Photo", "Hiraya Village Office", "Sunflower Field", "This Year", "Where", "Strange Face", "Strange Face", "Strongest", "First", "Self"]


2022-09-29 21:34

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.