How to Serve a Web App With NGINX, uWSGI and Flask. And Why You'd Want to.

So, you want to serve your Python app on the web? And you’re not sure how to? You know, you remind me a lot of myself. Seems like just yesterday that I was in your shoes.

And that’s because I was. Yesterday I had to patch and re-deploy a Flask app I’d thrown together a few years back. Patching was fine, but when it came around to re-deploying, I realized that I had no idea what I was doing - I’d originally deployed it by hastily copying and pasting code snippets from tutorials and Stack Overflow. But I hadn’t actually learned anything in the process.

So I went and did some research. My goal was to get a high level understanding of each component and its purpose. I’m glad I did - having a basic understanding of the network architecture behind my app made deploying it much, much easier. And I’d love to share what I learned with you.

In this post, we’ll answer the following questions:

  1. What are NGINX, uWSGI and Flask?
  2. How do they interact with each other? Why would I want to use all 3?
  3. How do I serve an app with them?

What are NGINX, uWSGI and Flask?

  • NGINX: A server application.
  • uWSGI: A server application that supports the WSGI calling convention.
    • WSGI: A convention for calling Python applications. Think of it as the language Python applications speak when communicating with the internet.
    • uwsgi: A protocol for transmitting WSGI calls.
  • Flask: A micro-framework for Python applications.

How do they interact with each other? Why would I want to use them?

Let’s say you were to click on this link for my app, AmazonFeatures. The browser on your local computer would open a new tab and send a request to the remote machine where the app is hosted. The request would be routed to Flask via NGINX, uWSGI, and app’s response would be returned in reverse. Here’s a diagram of that process:

app_architecture

So, why set it up this way? To understand, let’s walk through the details:

  • Your browser (the client) requests the app’s landing page.
  • This request is sent via https to the app’s host.
  • NGINX receives and decrypts the request.
  • NGINX forwards the request parameters via uwsgi to uWSGI.
  • uWSGI receives the request.
  • uWSGI extracts the WSGI parameters from the uwsgi request and calls the Flask app.
  • Flask executes with the parameters and produces a response.
  • The response is sent back to your browser through these steps in reverse.

You’re probably wondering why we’re using 2 server applications. So, why not use NGINX alone? We can’t. We need uWSGI because NGINX doesn’t support WSGI - thus, NGINX can’t send or receive information from Flask.

Why not use uWSGI alone? Convenience. NGINX is relatively easy to configure for SSL. It can also serve standard HTML and JS apps, and is easy to setup as a reverse proxy (a server that forwards traffic to other servers). For example - this site runs on the node based Ghost framework, and is served via NGINX. Since I already knew how to handle SSL with NGINX, it was easier for me to do so and forward traffic to uWSGI than learn how to with uWSGI.

How do I serve an app with these components?

I won’t go into the configuration details here. Digital Ocean has an excellent set of posts for this purpose, which you can read here:

However, I will cover the high-level steps:

  1. Get a remote host for your app (I use an AWS EC2 instance).
    • For security, it’s best to disable HTTP/HTTPS traffic for now.
  2. Ensure you have SSH access to your remote host.
  3. Install your app files.
  4. Ensure your app runs on the server.
    • Run your app on a secure port - I like to use localhost:8000
    • Check your app is running at the host address with ssh tunnel
  5. Install and configure NGINX and uWSGI.
    • Check that your setup works by navigating to your host’s IP address in your browser. You should see your app’s landing page.
  6. Change the A-record for your app’s domain. Wait for the change to propagate.
  7. Install certbot. Obtain certificates for your app domain.
    • Enable HTTP/HTTPS traffic if you disabled it earlier.
  8. Navigate to the domain in a browser. Make sure the app runs properly.
  9. Pat yourself on the back. Go get celebratory Nachos or something.

Closing Remarks

That's it! Before I end, I just want to encourage you to not be discouraged when something doesn't work. It's frustrating, but if you stick with it I'm sure you'll either solve it or come up with a workable solution. For example - I could never get uwsgi to run as a service as outlined in this Digital Ocean post. Instead, I run the app as a background process using tmux. It's hacky, but it works.

As always, I hope you learned something useful. Thanks for reading!

Time Series Forecasting With Autoregression

Forecasting (predicting future values) with time series can be tricky. This is because time series data may exhibit behavior that violates the assumptions of many modeling methods. Because of this, are a few special considerations you need to make when working with time series data. This post will serve as an introduction to and reference for some of the behaviors you should look for when modeling time series data with autoregression.

Why We Need To Be Careful

First, we should note that time series behavior is of particular concern for parametric models (models for which we make an assumption about the functional form of the process that generates the series). For non-parametric models (models where we don't make assumptions about the form of our series' generative process, such as neural networks and tree based methods), we may not need to worry about time's effect on our model parameters.

We'll be talking about considerations you need to make when using parametric model to forecast a response variable $Y$ with time series. This is important, because the behavior of the time series will determine which functional form we choose to model $Y$ with (aka its characteristic equation).

Let's assume we're going to use an autoregressive forecast model, fitted using Ordinary Least Squares (OLS). A key assumption of OLS is that the observations in our dataset are independent. This means that the characteristic equation of $Y$ is based on a random variable that follows a probability distribution whose moments (mean, variance, skew, kurtosis, etc) are constant. In other words: none of parameters of the underlying distribution of $Y$ are a function of time. A time series that meets these criteria is said to be stationary.

Formally, this means that the characteristic equation we've chosen does not have a unit root. If a characteristic equation of $Y$ has a unit root > 0, then at least one of the moments of its underlying probability distribution is a function of time. Thus, our key assumptions of regression are violated and our regression model will not accurately model the data. A time series whose characteristic equation has a unit root > 0 is non-stationary.

Using linear based models with stationary time series, though imperfect, works reasonably well. We can transform a non-stationary time series into a stationary one by identifying its problematic behavior and applying the appropriate transformation(s).

Stationary Time Series

Let's examine the characteristics of a stationary time series. Here's an example below I genereated using random data:

stationary_example-1

By itself, it doesn't look like much. There are no patterns in the data over time - and that's exactly what we want. Patterns over time generally indicate a dependence on it. Let's add a rolling mean and a rolling standard deviation to see what the absence of time dependent characteristics looks like:

stationary_example_rolling

Note how the rolling mean and rolling standard deviation are relatively constant. This indicates that $Y$ does not depend on a function of time. We can show show this more clearly by looking at both the residuals of the data (each value $y_i$ minus the sample mean $\hat{y}$) and a best fit line fit on the residuals (the line that minimizes the RSS):

stationary_example_residuals

Our best fit line has a near-zero slope, and the residuals appear to be independently distributed and centered around 0 with constant variance. Note that this step - in which we examine how $Y$ varies around its mean - is also referred to as its first difference.

Dickey-Fuller Test

How can we determine if the characteristic equation of $Y$ has a unit root? With a unit root test. There are a multitude of unit root tests (Dickey Fuller, Phillps-Peron, KPSS, etc), but the specific test you should use depends on your null hypothesis about the underlying process of $Y$. In this post we'll examine the Augmented Dickey-Fuller test, which is appropriate when the characteristic equation we've chosen is an autoregressive process.

Let's take a moment to recall what an autoregressive process is. It's a way to represent sequential data in which each response variable $y_t$ depends on one or more of its previous, or "lagged" values $y{t-1}, y_{t-2},...,y_{t-n}$ and a stochastic (random) term. For example: a simple autoregressive model is represented by:

$$y_t = \rho y_{t-1} + \mu_t$$

Where $\rho$ is a coefficient representing the effect of the lagged value $y_{t-1}$ on $y_t$ and $\mu_t$ is an error term. In this case, a unit root is present if $\rho>=1$: this means that $Y$ may grow or shrink unbounded. If $\rho<1$, then $Y$ will oscillate around the mean of its probability distribution, which satisfies the assumptions we've made about $Y$.

We'll use a slightly more complicated characteristic equation:

$$\Delta y_t = \alpha + \beta t + \gamma y_{t-1} + \delta_1 \Delta y_{t-1} + \cdots + \delta_{p-1} \Delta y_{t-p+1} + \varepsilon_t,$$

In which $\alpha$ is a constant, $beta$ is a trend on $t$, $\gamma$ is analogous to $\rho$, and $p$ is the number of lag terms to include. This form allows us to model problematic behavior such as drift and seasonaility. One thing to note: when performing a unit root test, our null hypthesis is that a root is present, while our alternative can vary. Generally, it's that the series is either stationary or trend-stationary (a unit root is not present but a trend is).

Test

Based on the data above, we should use the Augmented Dickey Fuller test with an alternative hypothesis that the series is stationary with no trend. Using statsmodels.tsa.stattools.adfuller to test this, we obtain the following p-value:

from statsmodels.tsa.stattools import adfuller

results = adfuller(df.data, regression='c', autolag='AIC')
print(f"P-value: {results[1]}")

>>> P-value: 0.007923477202111397

We can safely assume this series to be stationary, and we should feel comfortable using a trendless autoregression model to forecast it.

Non-Stationary Time Series

Let's take a look at a non-stationary time series and examine how its behavior violates the assumptions of linear model. The plot below shows the number of homes sold in Seattle each month from June 2008 through May 2016.

home_sales

Off the bat, we can tell there are some patterns. Here's the same plot, with a 6 month rolling mean and standard deviation:

home_sales_rolling_avg

There's clearly an upward trend in both mean and standard deviation. Taking a look at the first difference confirms this:

home_sales_residuals

However, there also appears to be a cyclical effect that repeats itself every 12 months. This type of pattern is referred to as seasonality, and it's a common behavior of time series. To confirm its presence, we can take a seasonal difference of the data based on the cycle period length (find each value of $y_i - y_{i-12}$):

home_sales_seasonal_difference

Note how the best fit line is now horizontal - this indicates that we've removed the the seasonal trend from the data, and may have succeeded in stationarizing it. Its non-zero location indicates that there's a constant upward trend, as the plot of the first difference did.

We can test for a unit root in the presence of a constant upward trend with a variant of the augmented Dickey Fuller test:

# get seasonal difference
seasonal_diff = df['Homes Sold'] - df['Homes Sold'].shift(12)
seasonal_diff = seasonal_diff[~seasonal_diff.isnull()]

results = adfuller(seasonal_diff, regression='c', autolag='AIC')
print(f"P-value: {round(results[1], 4)}")

>>> P-value: 0.0694

We can't reject our null at a critical value of .05, which indicates that there may be some behavior we still haven't accounted for. However, for the purpose of this blog post we'll simply cheat and use a value of .1 :)

Forecast

Assuming we've stationarized the data, the plots above tells us that we should use a seasonal AR model with a constant upwards trend, a lag of 12, and a seasonal period of 12 (for an annual cycle). We'll use a basic test-train split to assess our model performance, using the first 6 years to train the forecast model and the last 2 to test our forecast.

Below we setup and train the model, forecast values over the 2 year period from 2014-06-30 to 2016-05-31, and plot the original data along with the model's predictions over its training period and the forecasted values.

import matplotlib.pyplot as plt
from statsmodels.tsa.statespace.sarimax import SARIMAX

# get train, test split
train = df['Homes Sold'].iloc[:72]
test = df['Homes Sold'].iloc[72:]

# fit model, get results
model = SARIMAX(
    train,
    order=(12,0,0),
    seasonal=(12,0,1,12),
    trend='c'
)
sarimax_fit = model.fit()

# setup figure, axis
fig = plt.figure(figsize=(12,4))
ax = plt.subplot(111)

# plot lines
data = ax.plot(df['Homes Sold'], alpha=0.67)[0]
y_pred = ax.plot(train.index, model_fit.forecasts[0], alpha=0.67)[0]

# plot predictions
y_pred_forecast = ax.plot(forecast)[0]

# set title, labels
ax.set_title('Seattle Monthly Home Sales SARIMAX Forecast')
ax.set_ylabel('Homes')

ax.legend([data, y_pred, y_pred_forecast], ['Homes Sold', 'In-Sample Prediction', 'Forecast'], loc=0)

plt.tight_layout()

home_sales_sarimax-1

Results

Our model appears to follow the data reasonably well, with some caveats. Below is a plot of the residuals of the model as it performs on the test data (referred to as the in-sample residuals - the errors from the predictions made using the training sample) shown in blue, the best fit line of those errors, and the residuals of the forecasted values, shown in green, with a best fit line of those errors.

home_sales_sarimax_residuals

The best fit line on the training values is near zero, which implies that the model has accounted for the data's non-stationary behavior. However, note that the best fit line of the forecast residuals has a noticeable negative slope: this means that the errors of the model are growing in a negative direction over time, which signals non-stationarity. Should we be concernerned about this?

Yes and no. The reason behind this is that the amplitude (max and min) of the annual cycles in the test data (2015 and 2016) is greater than those in our training data. We can see this change in volatility by calculating the coefficient of variation (standard deviation over mean) for sales in each year:

Year Coefficient of Variation
2009 0.26
2010 0.22
2011 0.17
2012 0.16
2013 0.19
2014 0.18
2015 0.25
2016 0.28

The average from 2009 through 2014 is 0.197, and from 2015 through 2016 it's 0.265. Unfortunately, since there's no increasing trend in volatility in the training data, our model has not captured this behavior.

However, this doesn't mean our forecasts are bad. Here's a plot of the residuals of our SARIMAX forecast compared to those of a simple linear regression model and a standard ARIMA model:

home_sales_model_comparison_residuals

The Root Mean Squared Error (RMSE) of the models:

Model RMSE
SLR 1104.026
ARIMA 1022.765
SARIMAX 867.245

You can see that our SARIMAX model performs significantly better than the others. If you look closely at the graph of the residuals above, you'll see that the residuals from the ARIMA and SARIMAX models start relatively close and begin to diverge. This is because the ARIMA model does not take drift into account - it has no constant time trend. If we were to extend our forecast further out, the difference in RMSE between the ARIMA and SARIMAX models would likely keep increasing.

Closing Remarks

I hope this post was helpful for you. However, if something about my explanation didn't click, here's a great intro post on the subject. And if you have any questions, concerns or corrections for me, please feel free to reach out to me on LinkedIn. Thanks for reading!

An Intro to Named Entity Recognition Using Hidden Markov Models

A few weeks ago, I was asked to create an Named Entity Recognition (NER) model as part of a take home assesment. Though I haven't gotten the job (yet), I really enjoyed working on the problem. And I'd love to share my work with you.

*Please note that the company has graciously assured me that the work I did was my own. I used an open source dataset and open source libraries, and I am not disclosing any confidential information.

Let's break this post down into 6 parts:

  1. The Problem
  2. The Data
  3. Markov Processes
  4. The Hidden Markov Model (HMM)
  5. Feature Engineering
  6. Model Selection
  7. Analysis & Conclusions

The Problem

Named Entity Recognition is a particularly interesting NLP task. A subtask of information extraction, it comprises identifying named entities - objects that may have names, such as people, places, organizations - in text documents.

There a number of ways to define and tag entities. For this post, we'll use the definitions and entity tags outlined by the Gronigen Meaning Bank (GMB). They are:

  • Person (PER)
  • Location (GEO)
  • Organization (ORG)
  • Geo-political Entity (GPE)
  • Artifact (ART)
  • Event (EVE)
  • Natural Object (NAT)
  • Time (TIM)

Note that all non-entity words/tokens will be tagged with 'O', for other.

Consider the following sentence:

On Sunday the United Nations condemned Kim Jong-Un and North Korea for continued Nuclear weapons testing.

There are 4 different entities mentioned within it - a day/time, an organization, a specific person, and a nation. Here's what the tagged version of the sentence would look like:

On SundayTim the United NationsOrg condemned Kim Jong-UnPer and North KoreaNat for continued Nuclear weapons testing.

Or, in a tabular representation:

Index Word Tag
0 On O
1 Sunday TIM
2 the O
3 United B-ORG
4 Nations I-ORG
5 condemned O
... ... ...

The Data

We'll use a corpus of documents, also drawn from the GMB, to train and evaluate our model. Specifically, we'll use a small sample of about 6,000 unique sentences.

Here's a link to the dataset.

As with many NLP tasks, we will be working with sequence data. We consider sentences to be a sequence in which each word's meaning is dependent on both the other words in the sentence and the order in which they appear.

In this post, we'll assume sentences to be a special type of sequence - one that results from a type of Markov Process called a Markov Chain.

Markov Process

What is a Markov Process? It's a process that generates a sequence in which each element in the sequence is the result of a conditional probability distribution, conditioned on the state of the sequence at that position.

Here's a more formal description:

  • Consider a sequence of elements $y$
  • The element (AKA emission) $y_i$ at the $i$th position is a random variable that takes the conditional distribution parameterized on state $x_i$ (AKA emission probability): $$y_i \sim P( y \mid x_i )$$
  • The state $x_i$ at the $i$th position is a random variable takes a conditional distribution parameterized on the previous state: $$x_i \sim P( x \mid x_{i-1} )$$

A Markov chain is simply a Markov Process in which the state space is discrete.

We'll consider Sentences to be the result of a Markov Chain in which the states are entity tags. Let's modify the description above to demonstrate:

  • Consider a sequence of words $w$
  • The word $w_i$ at the $i$th position is a random variable that takes the conditional distribution parameterized on the entity tag $x_i$: $$w_i \sim P( w \mid x_i )$$
  • The entity tag $x_i$ at the $i$th position is a random variable that takes a conditional distribution parameterized on the previous tag: $$x_i \sim P( x \mid x_{i-1} )$$

Note that at each position $i$, both the word $w_i$ and the future tag $x_{i+1}$ depend only on the current tag $x_i$. Thus, at any position, future predictions are independent of the past history of the sequence. This is an important assumption for a Markov Process.

One can make the case that this is assumption is flawed in the context of language, and claim that semantic meaning cannot be accurately modeled by a Markov Process. However, that discussion is outside the context of this post. So we'll just roll with it :)

This assumption enables us to use a Hidden Markov Model (HMM) to solve the problem. HMMs are commonly used in sequential text classification tasks, such as POS tagging.

You can learn more about Markov Processes here.

The Hidden Markov Model (HMM)

How do we model a Markov Chain? Just as we described above! There's just one consideration we have to take into account: the observability of the state.

In a standard Markov Chain, the state is observable. That is to say that, at a position $i$, we know the state $x_i$. For example: a model that predicts the temperature on a given day conditional on the state of cloud cover (sunny, partially cloudy, cloudy, etc), the state (cloud cover) is clearly observable.

In our problem, the state (entity tag) is only partially observable. That is to say that, for a word at position $i$, we cannot directly observe the tag state $x_i$. What we do know: the word at position $i$. Thus, we're interested in building a model that will estimate tag $x_i$ by finding the tag with the highest probability of emitting word $w_i$, given $w_i$ and $x_{i-1}$. In other words: $MAX( P(x_i | w_i, x_{i-1}) )$.

The model does so by learning and estimating the conditional probability distributions $w_i \sim P( w \mid x_i )$ and $x_i \sim P( x \mid x_{i-1} )$.

How? There are a few methods. We'll be using an open-source implementation of an HMM called seqlearn, which uses the Viterbi algorithm.

Feature Engineering

In our discussion thus far, we've used only 2 features: previous state $x_{i-1}$ and word $w_i$. However, we aren't restricted to these alone. We can add additional information to help improve our probability distribution estimations, and ultimately our hidden state predictions.

We can add additional features about $w_i$ - for example, capitalization, wether it includes a digit, its position in the sentence, etc. We can also extend our previous state history from a single word ($w_{i-1}$), or a unigram, to multiple words, or n-grams. For example: we could use bigrams and seek to find $P(x_i | w_i, x_{i-1}, x_{i-2})$.

For this post, we'll fit and compare HMMs using a set of unigram models with the following features:

  • POS: Part-of-speech tag.
  • capitalized: Wether or not a word is capitalized.
  • position: Index of word position in a sentence.

We'll also drop the prefixes from Tag (e.g. transform B-geo to geo) for training the model.

This is to avoid potential errors - for example, classifying a sequence of tags as ['O', 'O', 'I-nat']. In other words - we don't need transmission probabilities for B- to B- tags, from O to I- tags, or from B-<entity_a> to I-<entity_b>.

This will also reduce the number of class labels we'll have to predict, which will help address class imbalance (though not much).

We should note that we might lose some information. Specifically, we would miss the cases in which we transition from B-<entity_a> to B-<entity_b, and from I-<entity_a> to B-<entity_b>. However, for now we'll assume these cases are rare enough that model performance will be largely unaffected.

Model Selection

We'll use K-fold cross validation with f1, precision and recall scores to assess model performance. You can see the code I used for training and fitting the models here. For brevity, I'll assume you're familiar with these accuracy metrics and won't go into the details.

Specifically, we'll examine 3 versions of each metric:

  • Weighted: Average for each Tag, weighted by support
  • Macro: Average for each Tag, unweighted by support
  • Micro: Average for all tags, unweighted

With that in mind, the scores for each of the 3 model versions we trained are:

==========================

UNIGRAM

F1 SCORES
Weighted:	0.912
Micro:		0.924
Macro:		0.450

PRECISION SCORES
Weighted:	0.916
Micro:		0.924
Macro:		0.607

RECALL SCORES
Weighted:	0.924
Micro:		0.924
Macro:		0.389

==========================

UNIGRAM_CAPITALIZED

F1 SCORES
Weighted:	0.918
Micro:		0.927
Macro:		0.441

PRECISION SCORES
Weighted:	0.925
Micro:		0.927
Macro:		0.552

RECALL SCORES
Weighted:	0.927
Micro:		0.927
Macro:		0.398

==========================

UNIGRAM_CAPITALIZED_POSITION

F1 SCORES
Weighted:	0.867
Micro:		0.895
Macro:		0.320

PRECISION SCORES
Weighted:	0.883
Micro:		0.895
Macro:		0.541

RECALL SCORES
Weighted:	0.895
Micro:		0.895
Macro:		0.267

==========================

We can see that a basic unigram model using 2 features (Word and POS) performs best accross the board, and would be the model we'd select for use or presentation.

However there is reason for concern: while the weighted and micro f1 scores seem strong at 0.912 and 0.914, respectively, the macro score is a concerningly low 0.441. Why is this? Let's take a look at the average f1 score, broken down by Tag:

==========================

UNIGRAM

F1 SCORES
O:	    0.967
per:	0.630
geo:	0.476
org:	0.702
gpe:	0.556
tim:	0.591
art:	0.073
nat:	0.020
eve:	0.036

==========================

With this view, we can see that the model performance is high for the majority non-entity tag O, and extremely varied for the other tags. We'll examine the reasons behind this in the next section.

Analysis & Potential Next Steps

To recap - our baseline, 2 feature unigram model clearly performs the best, and is the final model we should select for the task.

However, its performance is lacking for most non-majority classes. Though its weighted and micro f1, precision and recall scores are relatively high, they're inflated by the model's performance on the majority class O.

The macro scores and class-specific scores show reason for concern. Accuracy metrics for many classes are less than impressive. This is because the extreme prevalance of the O class is affecting the model's transition probabilities.

Why is this? It's likely the result of our relatively small sample size and class imbalance. Breaking down our dataset by Tag, we can see that there are relatively few observations for many of the entity tags:

Tag   Count  %_Total
nat     29   0.0438
eve     82   0.1239
art     87   0.1315
gpe   1264   1.9105
tim   1494   2.2581
org   2163   3.2693
per   2341   3.5383
geo   2484   3.7545
O    56217  84.9700

Since the HMM is a probabalistic model that relies on a set of estimated conditional probability distributions, training it on a dataset this imbalanced skews the emission probabilities for every state towards the O class.

For next steps, there are two approaches we could use to address the issue of class imbalance:

  • Use an ensemble of binary classifiers (one for each tag, either HMM or Random Forest) trained on the entire dataset.
  • Use an ensemble of binary classifiers (one for each tag) trained on subsets in which each minority class is balanced with the majority class.
    • This has been shown to improve performance of HMMs in the presence of imbalanced classes, as outlined here.

Closing Remarks

That's all! Thanks for reading. If you'd like to learn more, please reference the linked materials above, or feel free to reach out on LinkedIn.

Feature-Based Sentiment Analysis: An Introduction

The goal of this post is to provide a high level introduction to the core concepts of Sentiment Analysis. We'll define the Sentiment Analysis task, discuss the concepts of subjectivity and objectivity, and breifly discuss how Sentiment Analysis can be applied to extract specific feature-opinion pairs from text.

The Sentiment Analysis Task

What exactly is Sentiment Analysis? It's the classification and extraction of sentiment - opinions and their associated emotions - from text.

An opinion is a sentiment expressed on a specific entity, such as a product, person, organization, or location. It's expressed by an entity, and an entity that expresses an opinion is an opinion holder.

We use the term object to refer to the target entity of an opinion. An object may consist of a set of components and attributes, which we refer to as features. Opinions may also be expressed on features, and a feature may consist of a sub-set of features. An opinion about the object itself is a general opinion. An opinion about a feature of an object is a specific opinion.

The sentiment of an opinion - wether the opinion is positive, negative, or neutral - has several terms. It may be referred to as its orientation, polarity, or semantic orientation. Consider the following passage:

"The iPhone is great. However, the battery doesn't last very long."

It's comprised of 2 sentences, each of which express an opinion. The first is a positive general opinion about the iPhone object. The second is a negative specific opinion about the battery component feature.

Both of the features mentioned are explicit - they are explicitly referenced by name. However, features may also be implicit - not directly referenced, but implied. For example:

"The phone is a bit large."

This sentence contains a negative specific opinion about the implicit feature, "size".

Let's use these concepts to formally define the task:

An object o is comprised of a set of features F = {f1, f2,...,fn}. This includes a special feature that represents the object itself. Each feature fi may be explicitly represented by any term or phrase from the set Wi = {wi1, wi2,...,wim}, or implictly indicated by any term or phrase from the set Ii = {i1, i2,...,iq}.

An opinionated text document d is comprised of a set of sentences S={s1, s2, ... ,sm}. These sentences contain opinions expressed by a set of opinion holders {h1,h2, ... ,hq} on a set of objects {o1, o2, ... ,oq}. For each object oj, opinions are expressed on a subset of the object's features, Fj.

For each opinion in d, we seek to obtain a quintuple of information (hi, oj, fjk, oojk, t) where:

  • hi is the opinion holder.
  • oj is the target object of the opinion.
  • fjk is the target feature of the opinion.
  • oojk is the orientation of the opinion.
  • t is the time at which the opinion was expressed.

For each feature fjk we seek to identify all of its direct representations Wjk and implicit references Ijk.

Note that this is a simplified version. It only covers direct opinions, and omits comparative and indirect opinions. However, for the purpose of this introductory post, it'll do.

Subjectivity and Objectivity

Not all of the text in an opinionated document contains opinions. Generally, we use sentences as the basic units of text, so part of the task is identifying a document's opinionated sentences.

There are two types of opinionated sentence:

  • Subjective: Expresses feelings or beliefs.
  • Objective: Expresses factual information.

Opinions expressed in subjective sentences are explicit. For example, in this subjective sentence:

"The UI was intuitive and easy to use."

The positive ("intuitive", "easy to use") specific opinion expressed on the explicit feature "UI", is stated directly.

Opinions expressed in objective sentences are implicit. For example, in this objective sentence:

"I returned the phone after 2 days."

A negative general opinion of the phone is implied by the fact that the opinion holder returned it after such a short period of time.

This is an important distinction to keep in mind. Though objective sentences may be opinionated, nearly all unopinionated sentences are objective. And because sentiment is inherently subjective, we typically use subjective as a synonym for opinionated and objective as a synonym for unopinionated.

Approaches for Opinion Extraction

Identifying and extracting opinions is perhaps the most difficult sub-task of Sentiment Analysis. It involves modeling semantic meaning, which is a notoriously challenging problem in NLP. Advanced methods for opinion extraction range from using manually created POS patterns and opinion word lexicons in conjuction with Conditional Random Fields, to using Recurrant Neural Networks to perform unsupervised identification of expressive phrases. These approaches rely on advanced Machine-Learning and Deep Learning concepts, and are out of scope for this post. However, if you have the time and requisite knowledge, the above papers are worth a read.

A more basic approach is to simply disregard objective opinions and focus only on explicit subjective opinions, as outlined in this paper. This involves using a hand-generated set of POS tag patterns to identify explicit subjective opinions (such as noun-adjective pairs) applying a series of ML techniques to the identified opinions to determine opinion orientation and their target features. This is the approach we'll take to provide an example in the next post.

Conclusions

If you'd like to learn more, here's an excellent paper by Bing Liu that goes deeper into the details of the problem definition.

In the next post, we'll highlight the concepts we've learned so far by performing opinion extraction using the approach mentioned above.

Tutorial: Create Your Own Package With Homebrew and Python

A few weeks ago, I wrote a script to manage AWS EC2 spot instances. It was the most complex BASH script I'd ever written. I was proud, and while riding that pride on my wave of success I confidently decided to extend my script into a full-blown package I could share with the world.

Two weeks and an uncountable number of hours later, I've succeeded. But I now recognize that my confidence was borderline arrogance. It turns out that turning your project into a package isn't hard - you really just need to create an extra file or two. However, learning WHAT those files should contain, WHERE your project files will be installed to, and the WHY behnd those two items is extremely difficult. I couldn't find a single resource that covered all of those items.

So I've created one. This tutorial is intended for people who've never created a package before, and are unfamiliar with how Linux handles terminal commands and executable files. It will walk you through HOW to create a package step by step, and WHY each step is taken. I hope this saves you time, hours of pain, and enables you to share something cool with the world. Here we go!

Contents

  • What is a package?
  • Overview of Homebrew
  • Basic Package Requirements
  • Steps to Create Package
  • Example
    • Example Package
    • Linux Commands
    • Formula Content
  • Closing Remarks

What is a package?

A package is simply a set of files that do something. It could be a full-blown application with its own modules and libraries, or simply a handful of scripts. Some of the most useful packages are command line utilities, such as wget and awscli.

Overview of Homebrew

Homebrew is a package manager for OS X. It's gimmick is that it 'brews' and installs packages for you. It's typically used to install command line utilities and other general packages. And it can be used for packages written in any language.

Let's walk through the steps Homebrew takes when you install a package. First, there are 4 terms you should know:

  • Formula: A ruby file, <package>.rb. It contains a description of the package, as well as instructions for how to 'brew' it. Specifically:
    • Where to find the package files.
    • Steps for installing and setting up the package (e.g. compiling code, installing dependencies, running tests, etc)
    • Where package files should be installed on your computer.
  • Tap: Directory or GitHub repo that contains Formulas.
  • Keg: Local directory where the 'brewed' package files are stored.
  • Cellar: Local directory with Kegs of 'brewed' packages.

Note that the path for the Cellar is usr/local/Cellar, and the path for a keg is usr/local/Cellar/<package>/<version_number>.

Here's what installing a package looks like:

  • You run the command brew install <package>.
  • Homebrew searches its Taps for the corresponding Formula.
  • Homebrew then reads the Formula's instructions and:
    • Downloads the package files.
    • Creates a Keg in your computer's Cellar for the package.
    • 'Brews' the package.

Basic Package Requirements

A Homebrew package requires 4 things:

  1. A tarball of the package files - e.g. awspot-0.1.tar.gz. *This is sometimes referred to as the source tarball.
  2. A GitHub repo containing the tarball.
  3. A Formula file for the package - e.g. awspot.rb.
  4. A Tap containing the package formula.

You should keep your package files and their tarball in the same repo. And, if you're planning on sharing your package with anyone, I'd recommend creating a repo for your Tap.

Here's the repo containing our example formula.

Package Creation Steps

At a high level, here are the steps you need to take to create your own package:

1. Compress your package files.

Compress your project files using tarball. Name your file using this pattern: <project name>-<version>.tar.gz, e.g. hworld-1.0.tar.gz

Here's an example command:

tar --exclude='./.git' --exclude='./README.md' -zcvf "hworld-1.0.tar.gz" .

2. Setup your Tap.

If you don't have an existing Tap, you should create a directory or repo for your package's formula file. You can then add your tap with the command: brew tap <path to tap>, e.g. brew tap https://github.com/rob-dalton/homebrew-tap.

3. Create your package Formula.

Run brew create <link to tarball>, e.g. brew create https://github.com/rob-dalton/hworld. This will automatically generate a formula file with the appropriate link and SHA256 hash value.

4. Fill out your package Formula.

Define the installation instructions and any dependencies your package may have.

Example

Let's examine the details of a formula using an example.

Hworld

I've created a simple command line utility, Hworld. It prints "Hello world!" to the console when the command hworld is run.

It's stored in this GitHub repo.. The tap containing its formula is located here.

You can install it yourself by running brew tap https://github.com/rob-dalton/homebrew-tap and brew install hworld.

Linux Commands

In order to understand how Hworld works, we need to understand how Linux handles commands. It's pretty simple: when you run a command, Linux searches for an executable file that matches the command. It looks for this file in the locations specified in the environment variable PATH (a list of directories).

For example, when you run the command, brew install hworld, Linux will do the following:

  1. Search through the directories listed in PATH, in order.
  2. Look for an executable file named brew.
  3. If the file brew is found, execute the file and pass the entire command and its args to it.
  4. If no matching file is found, return a command not found error.

Note that Linux will execute the FIRST match it finds.

For our package, the hworld command will execute a single script file, also named hworld:

#!/bin/bash

echo "Hello world!"

Hworld Formula

The formula for hworld is shown below. You can see that it simply extends Homebrew's Formula class. Let's take a look at it and expand on several key attributes and methods:

class Hworld < Formula
  desc "Simple hello world script."
  homepage "https://github.com/rob-dalton/hworld"
  url "https://github.com/rob-dalton/hworld/raw/master/hworld-1.0.tar.gz"
  sha256 "8443118e257c4c109332ae58df932da99f3bd1291a67b8a8a0283f529bc4f48e"
  version "1.0"

  def install
    # install hworld script, create symlink to script in /usr/local/bin
    bin.install "hworld"
  end

  test do
    # test script output
    assert_equal %x('#{bin}/hworld'), "Hello world!\n"
  end
  
end

sha256

Every Formula has a sha256 value - this is the hash value of your compressed package files. Brew does this to ensure the tarball it downloads from url is the one specified in the Formula, and contains only the files expected by the Formula.

You can obtain the sha256 value for your package source by running brew create <url> - this will create a Formula file similar to the one above with the url and sha256 fields filled out for you.

Alternatively, you can run shasum -a 256 <tarball> to generate the sha256 value alone.

install

Brew only installs files specified under the install method. Brew will discard all package files not explicitly handled in install.

The line bin.install "hworld" tells brew to do 2 things:

  • Install the script file hworld under our package's prefix.
  • Create a symlink to our file with the same name in /usr/local/bin

It's a good practice to install your command's executable script to /usr/local/bin. It's where Homebrew generally installs package executables (for example, the script that the brew command executes lives there), and it should exist in most users' PATH.

Now, when you run the command hworld, Linux should find and execute usr/local/bin/hworld.

test

After a package is successfully installed, you can run the test method with the command brew test <package name>. This provides a good way to test your package is working properly, even if it installs without any errors. It's also a best practice - in fact, Homebrew won't even review a package for inclusion in homebrew/core if it doesn't include a few tests.

Our test simply executes our script file and tests if its output matches our expectations - e.g. it returns the string "Hello world!".

Documentation

For more information, you should refer to the documentation here.

Closing Remarks

That's it! Again, I hope this was useful.

For more information on Homebrew and Formula creation, please refer to Homebrew's website. Also, please note that for brevity and simplicity, I didn't discuss how Homebrew handles dependencies in this post. However, it's a topic worth reviewing.