I don't know how to preprocess sentences, I would appreciate it if you could help me.

I am currently learning how to retrieve and preprocess sentence data on the Jupiter Notebook using Python.I succeeded in getting the sentence data, and then I'm in the process of preprocessing the sentence, but I'm having a hard time here.

There are three things I would like to do with preprocessing:

I entered the code according to the reference book, but it didn't work properly.I'd appreciate it if you could point out what's wrong with the professionals.
(By the way, I already got the text data before typing this code.)

import re

first_sentence='I've seen a picture of the man before.'
last_sentence= 'He was a good boy like a god'
_,text=original_text.split(first_sentence)
text,_=text.split(last_sentence)
text = first_sentence + text + last_sentence

text=text.replace('|', ').replace('', ')
text = re.sub('\w+'', '', text)
text = re.sub('[#\w+]', '', text)
text=text.replace('\r', ').replace('\n', '')
text=re.sub('[', '?']', '', text)
text = re.sub('(\w+)', '', text)
text = re.sub('[\w+]', '', text)

sentences=text.split('.')
print('Number of statements:',len(sentences))
sentences [:10]

python jupyter-notebook

2022-09-30 19:19

1 Answers

If you use full-width and half-width symbols as a reference to the comments, it will be handled correctly.

Delete the beginning and end of the body as delimiters using split() method

It's working.

Remove unnecessary strings and symbols using string replacement or regular expressions

Please review the text.replace below with reference to the sample code.

After deleting unnecessary strings, divide sentences into sentences by punctuation marks

It's working.
句 If you separate them only by punctuation marks, the heading will be combined into the next body.

Sample Code

Download and extract the zip file of the text file (with rubies) from Aozora Bunko before running.
Rewrite file_path to the path of the text file generated after extracting it.

import re

# Rewrite todo file_path as needed
file_path=r'ningen_shikkaku.txt'
with open(file_path, 'r', encoding='shift_jis') asf:
    original_text=f.read()

first_sentence='I've seen a picture of the man before.'
last_sentence= 'He was a good boy like a god'
_,text=original_text.split(first_sentence)
text,_=text.split(last_sentence)
text = first_sentence + text + last_sentence

text=text.replace('|', ').replace('', ')
text = re.sub('\w+'', '', text)
text = re.sub('[#[^]]+]', '', text)
text=text.replace('\r', ').replace('\n', '')
text=re.sub('[', '?']', '', text)
text = re.sub('(\w+)', '', text)
text = re.sub('[\w+]', '', text)

sentences=text.split('.')
print('Number of statements:',len(sentences))
print (sentences [45:49])

Output

Number of statements: 1177
There must be something more expression and impression on what we call death, but if you put a horse's neck on the human body, it would look like this, anyway, it would make the viewer shudder and disgusting
I had never seen such a strange man's face before.
I've lived a life full of shame.

Pre-editing text (excerpt from Osamu Dazai's "Human Disqualification" by Aozora Bunko - text file (with rubies)

There must be something more expression and impression on what we call a "death face," but if we put a horse's neck on the human body, it would be like this, anyway, it would make the viewer shudder and disgusting.I had never seen such a strange man's face before.
[#New page]

[#3 Dents] The first note [#"First note" is a big headline]

　I've lived a life full of shame.

2022-09-30 19:19

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656