Improve training tokenization #8

shiffman · 2017-10-26T18:11:54Z

This results in a lot of junk, not accounting for apostrophes and other punctuation. It could probably be improved using nltk or just being more thoughtful?

text = open(path).read().lower().replace("\n", " ")

# Split into sentences (this could be improved! Using nltk?)
sentences = re.split("[.?!]", text);

# Split each sentence into words! (this could also be improved!)
final_sentences = []
for sentence in sentences:
    words = re.split("\W+", sentence)
    final_sentences.append(words)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve training tokenization #8

Improve training tokenization #8

shiffman commented Oct 26, 2017

Improve training tokenization #8

Improve training tokenization #8

Comments

shiffman commented Oct 26, 2017