Understanding what’s a token and the main problems related to the tokenization process
According to the Oxford dictionary, a token is “a thing serving as a visible or tangible representation of a fact, quality, feeling, etc.”. But what’s token from a Natural Language Processing perspective?
It may seem a very basic discussion, but you will see that the subject is not yet closed and discussions about the tokenization process will raise from the dead to haunt you quite often.
What’s a token?
For general purposes, the answer to the question “what’s a token” is: a token is considered as being a well-defined unit inside a string and the process of dividing a string into tokens is called “tokenization”*.
Another definition of tokenization comes from Stanford1. According to them, “given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation”.
For instance, in the sentence “She came here yesterday”, the tokens could be split into “She”, “came”, “here”, “yesterday”.
You just consider that anything between spaces is a single token. It makes sense because the words are units of signification. Therefore, they can be treated as single tokens.
So, what’s a token? Is a token a word?
No, a token is not a word. But sometimes it can be a word. In general, a token can be a single unit of signification or anything comprised between spaces and punctuation in a string. Including the punctuation itself.
For instance, a comma can be a token. A period can be a token. The word “token” can be a token. But words such as “aren’t” are tricky because depending on the case, it could be considered one single token or two merged tokens†.
If you take Stanford’s example, you will that there will be other ways to tokenize “aren’t” too, including the variations “aren’t”, “arent”, “are n’t” and “aren t”.
If a token can be anything, how to tokenize it correctly?
There’s no such thing as “how to tokenize correctly”. Tokenization depends on a lot of factors, including the language itself. For instance, in Portuguese, we have what’s called a “compound word”.
Compound words can be formed by agglutination or by juxtaposition. The first occurs when two or more words or radicals are merged to form a new token. During the process, one of the original tokens can lose a letter. So, what’s a token in this context?
In the juxtaposition’s case, two or more tokens are connected by a hyphen and maintain the same spelling and accentuation that they had before the composition, with alteration of the meaning.
So, an example of agglutination is the token “Aguardente” (liquor or brandy), which is the merge of the words “water” and “spicy”.
Aguardente (brandy) = Água (water) + Ardente (spicy)
And an example of juxtaposition is the word “arco-íris” (rainbow), formed by the tokens “arco” (bow) and “íris” (iris)
Arco-íris (rainbow) = “Arco” (bow) + “Íris” (iris)
In the first case, it’s easy to define what’s a token because you just consider the whole “aguardente” word as being one thing. But if you tokenize “arco-íris” using an English language tokenizer, you will get “Arco”, “-” and “íris”.
You completely lost the meaning of the word. The isolated tokens “arco” and “íris” have nothing to do with an “arco-íris”.
There’s no perfect tokenizer. There’s only the tokenizer that suits your needs 😛 However, it’s important to be consistent and to always use the same tokenizer inside the same project1.
Why is tokenization an important thing?
A single character, like the letter “b”, is sometimes empty of signification. Therefore, the word, or the token, is the smallest piece of signification in a string.
You usually count words, not characters. You analyze meaning by looking at words, not letters. Things such as syntax, entity relationships, and logic are built on top of words (or tokens).
Remember that words are single things trying to become one big string.
However, a letter can be a token. It’s the case of the indefinite article “a”, in English, of the definite articles “o” and “a” in Portuguese, and of the “y” pronoun in French. It’s all a question of semantics.
Finding word associations is also something that depends on the tokenization process. First, you have to tokenize a corpus for then discover how tokens are associated between them.
Thanks to tokenization, it’s possible to create word embeddings, get word frequency and apply other algorithms to text.
So, as you just saw, it’s hard to define what’s a token. It depends on the context, the language, and on the tokenizer that you are using.
- *Not to be confused with the process of encrypting sensitive data in a database.
- †In general, contractions are considered as being one single token because they are meaning-wise. However, some tokenizers may expand contractions, therefore transforming “aren’t” into “are not”.
- 1.Tokenization. Tokenization. Accessed January 27, 2021. https://nlp.stanford.edu/IR-bookGiven%20a%20character%20sequence%20and%20a%20defined%20document%20unit,%20tokenization%20is%20the%20task%20of%20chopping%20it%20up%20into%20pieces,%20called%20tokens%20,%20perhaps%20at%20the%20same%20time%20throwing%20away%20certain%20characters,%20such%20as%20punctuation.%20Here%20is%20an%20example%20of%20tokenizationhtml/htmledition/tokenization-1.html