An Intro to Named Entity Recognition Using Hidden Markov Models

A few weeks ago, I was asked to create an Named Entity Recognition (NER) model as part of a take home assesment. Though I haven't gotten the job (yet), I really enjoyed working on the problem. And I'd love to share my work with you.

*Please note that the company has graciously assured me that the work I did was my own. I used an open source dataset and open source libraries, and I am not disclosing any confidential information.

Let's break this post down into 6 parts:

  1. The Problem
  2. The Data
  3. Markov Processes
  4. The Hidden Markov Model (HMM)
  5. Feature Engineering
  6. Model Selection
  7. Analysis & Conclusions

The Problem

Named Entity Recognition is a particularly interesting NLP task. A subtask of information extraction, it comprises identifying named entities - objects that may have names, such as people, places, organizations - in text documents.

There a number of ways to define and tag entities. For this post, we'll use the definitions and entity tags outlined by the Gronigen Meaning Bank (GMB). They are:

  • Person (PER)
  • Location (GEO)
  • Organization (ORG)
  • Geo-political Entity (GPE)
  • Artifact (ART)
  • Event (EVE)
  • Natural Object (NAT)
  • Time (TIM)

Note that all non-entity words/tokens will be tagged with 'O', for other.

Consider the following sentence:

On Sunday the United Nations condemned Kim Jong-Un and North Korea for continued Nuclear weapons testing.

There are 4 different entities mentioned within it - a day/time, an organization, a specific person, and a nation. Here's what the tagged version of the sentence would look like:

On SundayTim the United NationsOrg condemned Kim Jong-UnPer and North KoreaNat for continued Nuclear weapons testing.

Or, in a tabular representation:

Index Word Tag
0 On O
1 Sunday TIM
2 the O
3 United B-ORG
4 Nations I-ORG
5 condemned O
... ... ...

The Data

We'll use a corpus of documents, also drawn from the GMB, to train and evaluate our model. Specifically, we'll use a small sample of about 6,000 unique sentences.

Here's a link to the dataset.

As with many NLP tasks, we will be working with sequence data. We consider sentences to be a sequence in which each word's meaning is dependent on both the other words in the sentence and the order in which they appear.

In this post, we'll assume sentences to be a special type of sequence - one that results from a type of Markov Process called a Markov Chain.

Markov Process

What is a Markov Process? It's a process that generates a sequence in which each element in the sequence is the result of a conditional probability distribution, conditioned on the state of the sequence at that position.

Here's a more formal description:

  • Consider a sequence of elements $y$
  • The element (AKA emission) $y_i$ at the $i$th position is a random variable that takes the conditional distribution parameterized on state $x_i$ (AKA emission probability): $$y_i \sim P( y \mid x_i )$$
  • The state $x_i$ at the $i$th position is a random variable takes a conditional distribution parameterized on the previous state: $$x_i \sim P( x \mid x_{i-1} )$$

A Markov chain is simply a Markov Process in which the state space is discrete.

We'll consider Sentences to be the result of a Markov Chain in which the states are entity tags. Let's modify the description above to demonstrate:

  • Consider a sequence of words $w$
  • The word $w_i$ at the $i$th position is a random variable that takes the conditional distribution parameterized on the entity tag $x_i$: $$w_i \sim P( w \mid x_i )$$
  • The entity tag $x_i$ at the $i$th position is a random variable that takes a conditional distribution parameterized on the previous tag: $$x_i \sim P( x \mid x_{i-1} )$$

Note that at each position $i$, both the word $w_i$ and the future tag $x_{i+1}$ depend only on the current tag $x_i$. Thus, at any position, future predictions are independent of the past history of the sequence. This is an important assumption for a Markov Process.

One can make the case that this is assumption is flawed in the context of language, and claim that semantic meaning cannot be accurately modeled by a Markov Process. However, that discussion is outside the context of this post. So we'll just roll with it :)

This assumption enables us to use a Hidden Markov Model (HMM) to solve the problem. HMMs are commonly used in sequential text classification tasks, such as POS tagging.

You can learn more about Markov Processes here.

The Hidden Markov Model (HMM)

How do we model a Markov Chain? Just as we described above! There's just one consideration we have to take into account: the observability of the state.

In a standard Markov Chain, the state is observable. That is to say that, at a position $i$, we know the state $x_i$. For example: a model that predicts the temperature on a given day conditional on the state of cloud cover (sunny, partially cloudy, cloudy, etc), the state (cloud cover) is clearly observable.

In our problem, the state (entity tag) is only partially observable. That is to say that, for a word at position $i$, we cannot directly observe the tag state $x_i$. What we do know: the word at position $i$. Thus, we're interested in building a model that will estimate tag $x_i$ by finding the tag with the highest probability of emitting word $w_i$, given $w_i$ and $x_{i-1}$. In other words: $MAX( P(x_i | w_i, x_{i-1}) )$.

The model does so by learning and estimating the conditional probability distributions $w_i \sim P( w \mid x_i )$ and $x_i \sim P( x \mid x_{i-1} )$.

How? There are a few methods. We'll be using an open-source implementation of an HMM called seqlearn, which uses the Viterbi algorithm.

Feature Engineering

In our discussion thus far, we've used only 2 features: previous state $x_{i-1}$ and word $w_i$. However, we aren't restricted to these alone. We can add additional information to help improve our probability distribution estimations, and ultimately our hidden state predictions.

We can add additional features about $w_i$ - for example, capitalization, wether it includes a digit, its position in the sentence, etc. We can also extend our previous state history from a single word ($w_{i-1}$), or a unigram, to multiple words, or n-grams. For example: we could use bigrams and seek to find $P(x_i | w_i, x_{i-1}, x_{i-2})$.

For this post, we'll fit and compare HMMs using a set of unigram models with the following features:

  • POS: Part-of-speech tag.
  • capitalized: Wether or not a word is capitalized.
  • position: Index of word position in a sentence.

We'll also drop the prefixes from Tag (e.g. transform B-geo to geo) for training the model.

This is to avoid potential errors - for example, classifying a sequence of tags as ['O', 'O', 'I-nat']. In other words - we don't need transmission probabilities for B- to B- tags, from O to I- tags, or from B-<entity_a> to I-<entity_b>.

This will also reduce the number of class labels we'll have to predict, which will help address class imbalance (though not much).

We should note that we might lose some information. Specifically, we would miss the cases in which we transition from B-<entity_a> to B-<entity_b, and from I-<entity_a> to B-<entity_b>. However, for now we'll assume these cases are rare enough that model performance will be largely unaffected.

Model Selection

We'll use K-fold cross validation with f1, precision and recall scores to assess model performance. You can see the code I used for training and fitting the models here. For brevity, I'll assume you're familiar with these accuracy metrics and won't go into the details.

Specifically, we'll examine 3 versions of each metric:

  • Weighted: Average for each Tag, weighted by support
  • Macro: Average for each Tag, unweighted by support
  • Micro: Average for all tags, unweighted

With that in mind, the scores for each of the 3 model versions we trained are:



Weighted:	0.912
Micro:		0.924
Macro:		0.450

Weighted:	0.916
Micro:		0.924
Macro:		0.607

Weighted:	0.924
Micro:		0.924
Macro:		0.389



Weighted:	0.918
Micro:		0.927
Macro:		0.441

Weighted:	0.925
Micro:		0.927
Macro:		0.552

Weighted:	0.927
Micro:		0.927
Macro:		0.398



Weighted:	0.867
Micro:		0.895
Macro:		0.320

Weighted:	0.883
Micro:		0.895
Macro:		0.541

Weighted:	0.895
Micro:		0.895
Macro:		0.267


We can see that a basic unigram model using 2 features (Word and POS) performs best accross the board, and would be the model we'd select for use or presentation.

However there is reason for concern: while the weighted and micro f1 scores seem strong at 0.912 and 0.914, respectively, the macro score is a concerningly low 0.441. Why is this? Let's take a look at the average f1 score, broken down by Tag:



O:	    0.967
per:	0.630
geo:	0.476
org:	0.702
gpe:	0.556
tim:	0.591
art:	0.073
nat:	0.020
eve:	0.036


With this view, we can see that the model performance is high for the majority non-entity tag O, and extremely varied for the other tags. We'll examine the reasons behind this in the next section.

Analysis & Potential Next Steps

To recap - our baseline, 2 feature unigram model clearly performs the best, and is the final model we should select for the task.

However, its performance is lacking for most non-majority classes. Though its weighted and micro f1, precision and recall scores are relatively high, they're inflated by the model's performance on the majority class O.

The macro scores and class-specific scores show reason for concern. Accuracy metrics for many classes are less than impressive. This is because the extreme prevalance of the O class is affecting the model's transition probabilities.

Why is this? It's likely the result of our relatively small sample size and class imbalance. Breaking down our dataset by Tag, we can see that there are relatively few observations for many of the entity tags:

Tag   Count  %_Total
nat     29   0.0438
eve     82   0.1239
art     87   0.1315
gpe   1264   1.9105
tim   1494   2.2581
org   2163   3.2693
per   2341   3.5383
geo   2484   3.7545
O    56217  84.9700

Since the HMM is a probabalistic model that relies on a set of estimated conditional probability distributions, training it on a dataset this imbalanced skews the emission probabilities for every state towards the O class.

For next steps, there are two approaches we could use to address the issue of class imbalance:

  • Use an ensemble of binary classifiers (one for each tag, either HMM or Random Forest) trained on the entire dataset.
  • Use an ensemble of binary classifiers (one for each tag) trained on subsets in which each minority class is balanced with the majority class.
    • This has been shown to improve performance of HMMs in the presence of imbalanced classes, as outlined here.

Closing Remarks

That's all! Thanks for reading. If you'd like to learn more, please reference the linked materials above, or feel free to reach out on LinkedIn.

Written by