I am currently learning how to retrieve and preprocess sentence data on the Jupiter Notebook using Python.I succeeded in getting the sentence data, and then I'm in the process of preprocessing the sentence, but I'm having a hard time here.
There are three things I would like to do with preprocessing:
I entered the code according to the reference book, but it didn't work properly.I'd appreciate it if you could point out what's wrong with the professionals.
(By the way, I already got the text data before typing this code.)
import re
first_sentence='I've seen a picture of the man before.'
last_sentence= 'He was a good boy like a god'
_,text=original_text.split(first_sentence)
text,_=text.split(last_sentence)
text = first_sentence + text + last_sentence
text=text.replace('|', ').replace('', ')
text = re.sub('\w+'', '', text)
text = re.sub('[#\w+]', '', text)
text=text.replace('\r', ').replace('\n', '')
text=re.sub('[', '?']', '', text)
text = re.sub('(\w+)', '', text)
text = re.sub('[\w+]', '', text)
sentences=text.split('.')
print('Number of statements:',len(sentences))
sentences [:10]
If you use full-width and half-width symbols as a reference to the comments, it will be handled correctly.
Delete the beginning and end of the body as delimiters using split() method
It's working.
Remove unnecessary strings and symbols using string replacement or regular expressions
Please review the text.replace
below with reference to the sample code.
After deleting unnecessary strings, divide sentences into sentences by punctuation marks
It's working.
句 If you separate them only by punctuation marks, the heading will be combined into the next body.
Download and extract the zip file of the text file (with rubies) from Aozora Bunko before running.
Rewrite file_path
to the path of the text file generated after extracting it.
import re
# Rewrite todo file_path as needed
file_path=r'ningen_shikkaku.txt'
with open(file_path, 'r', encoding='shift_jis') asf:
original_text=f.read()
first_sentence='I've seen a picture of the man before.'
last_sentence= 'He was a good boy like a god'
_,text=original_text.split(first_sentence)
text,_=text.split(last_sentence)
text = first_sentence + text + last_sentence
text=text.replace('|', ').replace('', ')
text = re.sub('\w+'', '', text)
text = re.sub('[#[^]]+]', '', text)
text=text.replace('\r', ').replace('\n', '')
text=re.sub('[', '?']', '', text)
text = re.sub('(\w+)', '', text)
text = re.sub('[\w+]', '', text)
sentences=text.split('.')
print('Number of statements:',len(sentences))
print (sentences [45:49])
Number of statements: 1177
There must be something more expression and impression on what we call death, but if you put a horse's neck on the human body, it would look like this, anyway, it would make the viewer shudder and disgusting
I had never seen such a strange man's face before.
I've lived a life full of shame.
There must be something more expression and impression on what we call a "death face," but if we put a horse's neck on the human body, it would be like this, anyway, it would make the viewer shudder and disgusting.I had never seen such a strange man's face before.
[#New page]
[#3 Dents] The first note [#"First note" is a big headline]
I've lived a life full of shame.
© 2024 OneMinuteCode. All rights reserved.