Tokenizer

A Tokenizer segments the text into a sequence of tokens. As shown in Figure 8.4, CLAMP-Cancer provides three different models of tokenizer:

DF_CLAMP_Tokenizer
DF_ OpenNLP_Tokenizer
DF_Tokenize_by_spaces

Each model will be described in more details.

Three tokenizers and their configuration files

DF_CLAMP_Tokenizer

DF_CLAMP_Tokenizer is the default tokenizer designed specifically for clinical notes. Advanced users can use the config.conf file to change the default tokenization.

To replace the default file:
1. Double click on config.conf file to open it
2. Click on the button with three dots to browse for your own file
3. Click on the open button
  
  How to replace the default file

DF_ OpenNLP_Tokenizer

This is an OpenNLP tokenizer. Advanced users can use its config.conf file to change its default model, en-token.bin.

To replace the default model:
1. Double click on config.conf file to open it
2. Click on the button with three dots to browse for your own file
3. Click on the open button
  
  How to replace the default Model

DF_Tokenize_by_spaces

This tokenizer uses the spaces in a sentence to separate the tokens.