This component consists of different feature extractors (Figure 9.1), which are used for extracting different types of features for named entity recognition, CLAMP-Cancer users will use this component to build their own named entity recognizer in a corpus annotation project (Refer to Section 4.2) . Similar to the previous components, we can customize these features by changing or replacing their default config files. Explanation of each extractor is as follows:
It is a type of word representation feature generated on the unlabeled data which is provided by the SemEval 2014 Challenge. Advanced users can eplace their own Brwon clustering file with the system’s default file.
For more information on how to create your own Brown Clustring file visit:
https://github.com/percyliang/brown-cluster
This extractor uses a dictionary consisting of terms and their semantic types from UMLS to
extract potential features.
Advanced users can replace or edit the default file following the steps below:
Note:The format of the content should be as the same as the default file:
(phrase then tab
then semantic type)
This module uses the words along with their part-of-speech (pos) tagging as NER features.
This function extracts the prefix and suffix of words that may be a representative of a specific type of named entities.
Similar to the brown clustering, it is a type of word representation feature generated on unlabeled data using a 3 rd party package. For more information visit: https://jcheminf.springeropen.com
This function extracts the section in which a candidate named entity presents.
This function distinguishes the pattern of a sentence by CLAMP-Cancer built in rules.
Similar to the brown clustering and random indexing, it is a type of distributed word representation feature generated on the unlabeled data (MIMIC II) provided by the SemEval 2014 Challenge using a neural network.Advanced users can replace the default file with their own file.
This function extracts the type of a word; it identifies whether or not it begins with an english letter, number, and etc.
This function extracts the regular expression patterns of words that may indicate a specific type of named entity. Advanced users can create their own regular expressions or edit the default file