dfs[0]
, dfs[1]
to dfs[8]
data.
from transformers import AutoTokenizer
from collections import defaultdict
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", do_lower_case = True)
# Number zero
word_freqs_0 = defaultdict(int)
for text in dfs[0]['comment']:
words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
new_words = [word for word, offset in words_with_offsets]
for word in new_words:
word_freqs_0[word] = word_freqs_0[word] + 1
# Number one
word_freqs_1 = defaultdict(int)
for text in dfs[1]['comment']:
words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
new_words = [word for word, offset in words_with_offsets]
for word in new_words:
word_freqs_1[word] = word_freqs_1[word] + 1
I've written up to eight times 8 like this.
I want to make the above sentence into word_freeqs[0]~
word_freeqs[8]
using repetitive statements, so what should I do?
word_freqs = defaultdict(int)
After declaring the variable, I want to enter append
but it doesn't work.
It's a very simple refactoring problem. You collect and subtract repeated codes as a function.
# 0
word_freqs_0 = defaultdict(int)
for text in dfs[0]['comment']:
words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
new_words = [word for word, offset in words_with_offsets]
for word in new_words:
word_freqs_0[word] = word_freqs_0[word] + 1
# Number one
word_freqs_1 = defaultdict(int)
for text in dfs[1]['comment']:
words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
new_words = [word for word, offset in words_with_offsets]
for word in new_words:
word_freqs_1[word] = word_freqs_1[word] + 1
If you look at the common repetition, the change, dfs[0]
dfs[0]
dfs[1]
has changed, word_freeqs_0
and word_freeqs_1
have changed. And the rest are all the same. If you look closely at this, you can see that given a df
as an input, you can make word_freeq
a function that returns it and repeat it.
def get_word_freq_from_df(df: pd.DataFrame):
word_freq = defaultdict(int)
for text in df['comment']:
words_with_offsets =
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
new_words = [word for word, offset in words_with_offsets]
for word in new_words:
word_freq[word] = word_freq[word] + 1
return word_freq
Function completion.
Considering that you want to do this lump again in units,
If you make it into a code,
word_freqs = []
for df in dfs:
word_freq = get_word_freq_from_df(df)
word_freqs.append(word_freq)
If you rewrite this as a list compliance,
word_freqs = [ get_word_freq_from_df(df) for df in dfs ]
© 2024 OneMinuteCode. All rights reserved.