How to use Tokenizer with punctuation?

Hey, this is a question I had that I answered myself after some
research. I can’t find a flair more applicable than ‘Question’ so I
will just answer it myself haha.

I was trying to use tf.keras.preprocessing.text.Tokenizer to
train a model for a language task. I wanted my model to include
certain punctuation in it’s output, like exclamation points and
commas and whatnot, and I wasn’t sure how to do this.

I figured that since the default value for filters in the
Tokenizer constructor is:

filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~tn'

then I would just have to remove the punctuation that I want to
be recognized. I then spent a few hours training my model.

DON’T DO THIS. It will not treat the
punctuation as separate tokens, but rather your vocabulary will be
filled with examples such as ‘man’ vs ‘man.’ vs ‘man,’, etc. These
will all be separate tokens.

Instead, you should preprocess all of your sentences to include
spaces between any punctuation that you want. This is how I did
it:

def separate_punctuation(s, filters=',.()?'): new_s = '' for char in s: if char in filters: new_s += ' ' + char + ' ' else: new_s += char return new_s.strip()

This way ‘Hello neighbor, how are you?’ will become ‘Hello
neighbor , how are you ?’. Thus, all punctuation will only take up
one element of your vocabulary and your model will generalize much,
much better.

Hope this saves someone else’s time.

submitted by /u/LivingPornFree

[visit reddit]
[comments]

Leave a Reply Cancel reply