Sentence Detector

A sentence is defined as the longest whitespace trimmed character sequence between two punctuation marks. A Sentence Detector utilizes different methods to detect a sentence. As shown in picture below, CLAMP-Cancer provides three different models to detect a sentence:

  1. DF_CLAMP_Sentence_Detector
  2. DF_CLAMP_Sentence_by_newline
  3. DF_CLAMP_OpenNLP_sentence_detector

Each model is described in details in the following sections.

Sentence Detector
Three sentence detectors and their configuration files
DF_CLAMP_Sentence_Detector

DF_CLAMP_Sentence_Detector is the default sentence detector in CLAMP-Cancer. It is designed specifically for clinical notes and takes into account the distinctive characteristics observed in sentences found in clinical texts. To configure the DF_CLAMP_Sentence_Detector, please click on the config file. A pop-up window opens where you can modify two parameters: Medical Abbreviation, and Max Sentence Length.

Medical Abbreviation:

There are some medical abbreviations that have punctuation marks at their beginning (".NO2) while some of them have it at the end (spec.). Providing a list of such abbreviations would help the detector to identify sentences more accurately. By default, CLAMP-Cancer has provided a comprehensive list of medical abbreviation which can be found in this file: defaultAbbrs.txt

  1. To replace the abbreviation file:
    1. Double click on config.conf file to open it
    2. Click on the button with three dots to browse for your own file
    3. Click on the open button
      Sentence Detector
      How to replace the abbreviation file
  2. To edit the current file:
  1. Double click on the defaultAbbrs.txt file to open it
  2. Add the terms that you want to include in the abbreviation file
  3. Click on the Save button on the toolbar
Sentence Detector
How to replace the abbreviation file
Max Sentence Length

Checking the checkbox for "Break long sentences or not?" allows users to break long sentences into the number of words that they have specified in the input textbox. Please refer to the following picture for more information.

Max Sentence Length
Interface for config.conf of the DF_CLAMP_Sentence_Detector
DF_CLAMP_Sentence_by_newline

This detector will identify new sentences using the line breaks in the file, i.e., each line in the file is treated as a single sentence.

DF_CLAMP_OpenNLP_sentence_detector

This is an OpenNLP sentence detector which advanced users can use its config.conf file to change its default model.

  1. To replace the default model:
    1. Double click on config.conf file to open it
    2. Click on the button with three dots to browse for your own file
    3. Click on the open button
      Sentence Detector_OpenNlp
      How to replace the default Model