Opentopia Directory Encyclopedia Tools

Tokenization

Encyclopedia : T : TO : TOK : Tokenization


In computer science, tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.

Take, for example, the following string. Unlike humans, a computer cannot intuitively 'see' that there are 9 words. To a computer this is only a series of 43 characters.

The quick brown fox jumps over the lazy dog
A process of tokenization could be used to split the sentence into word tokens. Although following example is given as XML there are many ways to store tokenized input:


The
quick
brown
fox
jumps
over
the
lazy
dog

The term is also used when, during the parsing of source code of some programming languages, the symbols are converted into a shorter representation which uses less memory. Commands such as print may be mapped to a number representation. Most BASIC interpreters used this to save room, a command such as print would be replaced by a single number which uses much less room in memory. In fact most lossless compression systems use a form of tokenization, although it's typically not referred to as such.

In human cognition tokenization is often used to refer to the process of converting a sensory stimulus into a cognitive "token" suitable for internal processing. A stimulus that is not correctly tokenized may not be processed or may be incorrectly merged with other stimuli.

 


From Wikipedia, the Free Encyclopedia. Original article here. Support Wikipedia by contributing or donating.
All text is available under the terms of the GNU Free Documentation License See Wikipedia Copyrights for details.

Search Titles
0123456789
ABCDEFGHIJ
KLMNOPQRST
UVWXYZ?

E-mail this article to:

Personal Message: