A Tokenizer segments the text into a sequence of tokens. As shown in Figure 8.4, CLAMP-Cancer provides three different models of tokenizer:
Each model will be described in more details.
DF_CLAMP_Tokenizer is the default tokenizer designed specifically for clinical notes. Advanced users can use the config.conf file to change the default tokenization.
This is an OpenNLP tokenizer. Advanced users can use its config.conf file to change its default model, en-token.bin.
This tokenizer uses the spaces in a sentence to separate the tokens.