-
Notifications
You must be signed in to change notification settings - Fork 0
NER
Locate and classify named entities mentioned in unstructured text into pre-defined categories (person names, organizations, locations, time expressions, quantities, etc.)
Keywords: NER
Approaches: rule-based, machine learning
Tools: Stanford NER, Corleone, AllenNLP, Spacy, flair, BERT
Table of content
NER applied to a newspaper front page (The New York Herald, 1888). Source: retronews.fr
Performing named entity recognition (NER) in natural language unstructured text or in the text within metadata is useful in a variety of use cases, from document management and knowledge organisation to information extraction and information retrieval.
The main tasks involve named entity recognition (identify a portion of text as an entity), categorisation (determine the nature of the entity, e.g. a person's name), disambiguation/linking (ascertain a person’s name in a non ambiguous manner) and relation extraction (discover relations between named entities).
Real world applications of NER in libraries are:
- information retrieval: named entities used as a resource for information retrieval use cases in digital libraries
- population of knowledge bases: enrichment of catalog records with named entities information; linking records between knowledge bases
- cross-lingual document clustering: documents mentioning the same entities are likely to be linked
- summarization: named entities are informational ’anchors’ helping to identify key elements of a text
- anonymization: removing named entities (particularly person's names) from documents
- text analytics use case: digital humanities, quantitative analysis, etc.
Recognition and categorisation of named entities make use of morphological, lexical or contextual features through rules, gazetters and other linguistic resources. In real-life systems, these kind of clues are never totally reliable (in particular for historical materials) and statistical models are needed.
The following sections expose a variety of approaches and techniques for NER. This survey is a recent resource that is highly recommended reading.
Commercial AI or NLP platforms on-line demos can give a broad sense of what NER is when applied to heritage textual documents:
- Google Cloud Natural Language
- AllenNLP (Python open source library)
- SparkNLP
Text samples (e.g. taken from the New York Herald) can be copy/paste to the demos. The following illustration shows an AllenNLP NER model applied to a paragraph of text.
AllenNLP NER model applied to newspaper content
These out-of-the-box systems have the advantage of being immediately operational, especially for the English language. On the other hand, since NER is known to be domain sensitive, they will not provide the best results.
Rule-based systems are the first approach used for NER. Manually-crafted rules are generally expressed as regular expressions which combine morphological clues (like uppercase), lexical clues (names, titles) and contextual (local grammar) clues.
Then the input text needs to traverse a finite-state automaton (their execution is fully “automatic” but they need manually-crafted rules). When the automaton strikes a matching rule, it leads to the action part of the rule for constructing the named entity annotation.
Some NER platforms like GATE, OpenNLP, SpaCy etc. support rule-based NER:
- GATE (graphical interface and Java API),
- Apache OpenNLP (Java library)
- SpaCy (Python library)
- NOOJ (Java app)
- [Unitex/GramLab](https://unitexgramlab.org/fr (IDE)
The impresso NER tutorial includes a hand-on session on a rule-based system and its gazetteers component.
Pro/Cons:
- Developers and linguists create language-specific and domain-specific linguistic resources (gazetteers, set of rules...), which can be very time consuming. These resources are used for development and evaluation.
- Developer is in control of the overall annotation pipeline and
Annotators create annotated text corpora according to the target typology of entities. Statistical model can then learns from these annotated data.
Data is used for training, development and evaluation, and developer only specifies features, statistical model, and learning algorithm.
Pro/Cons:
- Tools are language-independent and the annotation task does not require skilled linguists.
- But statistical NER typically requires a large amount of manually annotated training data (annotated data is in control).
The next sections introduce various machine learning-based NER.
Conditional random field (CRF) are a class of statistical modeling method used for structured prediction (introduction to Conditional Random Field are quoted in the resources section). A CRF model can take context into account and consider "neighboring" samples, an essential feature for text processing and particularly for NER, which can be cast as a sequence labeling problem. For natural language processing, linear chain CRFs are popular, which implement sequential dependencies in the predictions. Skip-chain CRF, another variant, can handle long-distance dependency between the text flow.
Stanford NLP Group's named entity recognizer is an implementation of linear chain CRF sequence models. Stanford NER is available for download and the package includes components for command-line invocation, running as a server and a Java API. It can also be tested on line.
Stanford NER has been used for applying NER to heritage newspapers during the Europeana Newspapers project. Annotated datasets (BIO tagging scheme) for a variety of languages can be downloaded from the project's github. A model for French, trained on 200 chunks of 1,000 words each extracted from newspapers (1870-1945) is available on api.bnf.fr. A classical beginning-inside outside (BIO) tagging scheme is used to distinguish multiple adjacent instances of the same type of named entity and a named entity spanning multiple words. After downloading the EN-Stanford.zip archive, open a terminal and launch the Java annotator:
> java -mx500m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier BnF.ner-model.ser.gz -outputFormat inlineXML -textFile sample.txt
It outputs an annotated text (inline IOB or inline XML formats) with 3 named entity categories (Person, Location, Organization).
Le naufrage du Titanic constitue bien la
plus effroyable catastrophe maritime que
l'on ait eu à enregistrer jusqu'à présent.
Après les angoisses suscitées par les
premières dépêches annonçant l'accident, on
s'était remis à espérer. Les dépêches de
<I-LIEU>New-York</I-LIEU>, d'Halifax et
de <I-LIEU>Montréa</I-LIEU>
avaient en partie dissipé Jes terribles
appréhensions qui avaient d'abord étreint
tous les coeurs.
...
On y trouve cependant les noms de MM.
<I-PERS>Bruce Ismay</I-PERS>, président de la <I-ORG>White Star
Line</I-ORG>, J.-B. <I-PERS>Thayer</I-PERS>, président du
<I-ORG>Pensylvania Raiiroad</I-ORG>.
This post blog demonstrates how to use the CRF implementation provided by the sklearn Python package.
Pro/Cons:
- CRF has higher accuracy than other classical methods (Hidden Markov model, MaxEnt). – Training and inference can be slow.
- CRF, like all supervised training methods, requires data that has been annotated for a specific task. Enhancing supervised methods with unsupervised text representations can alleviate this issue. These representations can be trained on large unannotated corpora and can learn implicit semantic and syntactic information.
Rule-based and statistical NLP methods consider words as atomic symbols. On the contrary, distributional semantics try to quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in very large corpora of language data. The underlying idea of distributional semantics can be summed up in this hypothesis: linguistic items with similar distributions have similar meanings.
The WebVectors demo lets you submit a word and get its nearest semantic associates. Semantic similarity between words is calculated as a distance (cosine similarity) between their corresponding vectors. We can see that this similarity reflects both syntactic and semantic similarity.
Capturing these distributional characteristics and using them for practical application (like word prediction, survey analysis, recommendation, etc.) to measure similarity between words, phrases, or documents can be done with a variety of techniques, from the vector space model (each word is represented by a vector, whose dimension corresponds to the size of the vocabulary) which results in very sparse vector space of high dimensionality, to word embedding techniques, which are a perfect input for numeric machine learning methods.
Word embedding techniques rely on a variety of approaches, neural network inspired, probabilistic, algebraic. word2vec (2013) is the most successful example of word embeddings.
This resource visually explains the underlying concepts and the way word embeddings are trained from text data.
This word2vec demo trains word embeddings in the browser, given an input text.
For a NER task, the hypothesis is made that word vectors belonging to the same NE category occur in close vicinity in the vector space of the word embeddings. Applying a classification approach on the vectors of words learns a decision boundary between the NER classes. The next figure illustrates this hypothesis with the WebVectors demo. Proper nouns vectors are closely clustered.
Pro/Cons:
- Pre-trained word embeddings can be used in NLP tasks that use small amounts of labeled data.
- word2vec (and other similar approaches like GloVe or FastText) develops a unique representation for each word (by making a synthesis of its different possible contexts), which means polysemy and homonymy are not handled properly. Moderns approaches produce a representation of each word in its particular context in the sentence (such as ELMo, BERT, GPT).
- word2vec is vocabulary based. Most of the recent methods deal with frequent sequences of characters, which allows them to represent also "out of vocabulary" words (for which no representation has been previously learned) by combining representations of parts of these words.
Recurrent neural (RNN) network based models have been proposed to tackle sequence tagging problems like named entity recognition. Neural nets enable an effective representation learning and they have full access to contextual cues needed for NER. A variety of NN based models for sequence tagging task have succeeded each other over the years, and they are considered as the state of the art: LSTM, bidirectional LSTM (BI-LSTM), LSTM networks with a CRF layer (LSTM-CRF), bidirectional LSTM networks with a CRF layer (BI-LSTM-CRF). Check out this article for an introduction to these different architectures or this one for the theoretical background. On LSTM architecture, read this post blog. Chapter 9 of the SLP "bible" is another essential reading.
In a NLP context, to understand a sentence we need to process the data in a given sequence, interpreting each word in the context of the words that have come before it. RNN support processing of sequential data by the addition of a loop. This loop allows the network to step through sequential input data whilst persisting the state of nodes in the hidden layer between steps.
RNN maintain a memory based on history information using a hidden layer or specially designed cells, which enables the model to predict the current output conditioned on long distance features.
For a NER task, the network learns to output the most probable NER tags sequence. Its input layer represents features (one-hot-encoding for word feature or dense vector features). The input layer has the same dimensionality as feature size. Its output layer represents a probability distribution over named entity categories labels (it has the same dimensionality as size of NE categories).
Sequence tagging with a LSTM using Python
impresso tutorial: neural NER with Spacy and Flair
Pro/Cons:
- Character and word embeddings trained on large text corpora contain a lot of morphological and lexical information
- Explicit feature engineering can be reduced to a minimum
- Relatively small amounts of task-specific annotation data give good performance
Big transformer-based language models like BERT (Devlin et al., 2019) have become increasingly popular in NLP due to their high performance. They are based on the principle of transfer learning: in the self-supervised pre-training phase they learn general language properties from large amounts of text, which can then be applied to specific downstream tasks through fine-tuning.
Pre-training a big transformer model is rather resource intensive: it requires many GB of text data and gpu-accelerated machines. Fortunately, a lot of pre-trained models of different flavors can be freely downloaded from the Huggingface repository and fine-tuned for your particular needs. Fine-tuning does not require as many computational resources as pre-training but it does require an annotated dataset for the task at hand.
In order to fine-tune a language model for NER, an annotated NER dataset is required. Open source datasets are available for most major languages, but the quality does vary. Annotating a new dataset or supplementing an existing one might be necessary to have full control of the type of entities that are recognized. As an example, in a library contest it might be of particular interest to have a model that can accurately recognize works of art and publishers. In order to do this, additional material migh have to be annotated and added to the model fine-tuning step."
AllenNLP platform makes available pretrained models like the Fine Grained Named Entity Recognition tagger: "This model identifies a broad range of 16 semantic types in the input text. It is a reimplementation of Lample (2016) and uses a bi-LSTM with a CRF layer, character embeddings and ELMo embeddings"). The Model Usage tab from the AllenNLP NER demo page shows how to proceed. See also this tutorial.
A basic Python script (AllenNLP folder) applied the model to a sample of the New York Herald:
>pip3 install allennlp==2.1.0 allennlp-models==2.1.0
>python3 AllenNLP-NER.py
This resource shows how one can fine-tune the BERT model to perform named entity recognition.
See also how a language model for Swedish is first built and then used for classical NLP task like POS tagging and named entity recognition.
A Hugging Face NER demo for Swedish is available.
Pro/Cons:
- The best performances
- Training a transformer model is computationally intensive
- Named Entity Recognition and Classification on Historical Documents: A Survey, Maud Ehrmann, Ahmed Hamdi, Elvys Linhares Pontes, Matteo Romanello, Antoine Doucet
- impresso project:
- post blog: Named entity processing in a nutshell
- (extensive) NER tutorial: Named Entity Processing for Digital Humanities tutorial, DH2019 conference
- Resources on NER:
- A serie of blog posts on practical NER
- Resources for named entities (datasets and more) and the related paper
- Datasets
- CLEF HIPE 2020: Named Entity Recognition and Linking on Historical Newspapers (or the CEUR extended version) evaluation campaign for English, German and French languages. Noisy OCR has a strong impact on performances.
- NLP-progress: a repository to track the progress in NER
- BIO tagging scheme for labelling text
- Conditional Random Fields:
- Introduction to CRF: EN; FR
- Stanford NER: Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370.
- CRF-NER with Python
- word2vec:
- Efficient Estimation of Word Representations in Vector Space by Mikolov et. al
- Sequence Models in Machine Learning Course, Andrew Ng, on Coursera
- Neural Networks:
- Explaining Recurrent Neural Networks
- Speech and Language Processing, Dan Jurafsky and James H. Martin, Chapter 9
- NER for information retrieval on OCRed content:
- NER for automatic metadata generation:
- Anonymization/pseudonimisation: