Tokenizer

A Tokenizer segments the text into a sequence of tokens. As shown in Figure 8.4, CLAMP-Cancer provides three different models of tokenizer:

  1. DF_CLAMP_Tokenizer
  2. DF_ OpenNLP_Tokenizer
  3. DF_Tokenize_by_spaces

Each model will be described in more details.

Three tokenizers and their configuration files
Three tokenizers and their configuration files
DF_CLAMP_Tokenizer

DF_CLAMP_Tokenizer is the default tokenizer designed specifically for clinical notes. Advanced users can use the config.conf file to change the default tokenization.

  1. To replace the default file:
    1. Double click on config.conf file to open it
    2. Click on the button with three dots to browse for your own file
    3. Click on the open button
      Sentence Detector
      How to replace the default file
DF_ OpenNLP_Tokenizer

This is an OpenNLP tokenizer. Advanced users can use its config.conf file to change its default model, en-token.bin.

  1. To replace the default model:
    1. Double click on config.conf file to open it
    2. Click on the button with three dots to browse for your own file
    3. Click on the open button
       OpenNLP_Tokenizer
      How to replace the default Model
DF_Tokenize_by_spaces

This tokenizer uses the spaces in a sentence to separate the tokens.