Tech in T: depth + breadth‎ > ‎AI‎ > ‎Machine Learning‎ > ‎Sequence‎ > ‎


To sum up: to build a conditional random field, you just define a bunch of feature functions (which can depend on the entire sentence, a current position, and nearby labels), assign them weights, and add them all together, transforming at the end to a probability if necessary.

Come up with a list of features
f(sentence s, word ith, label_i, label_i-1) ∈ R , but usually just returns {0,1}. e.g. 
// by restricting our features to depend on only the current and previous labels, rather than arbitrary labels throughout the sentence, we are building a linear-chain CRF
// E.g., one feature function could be how much we suspect that the current word should be labeled as an adjective given that the previous word is “very”.

Score of labeling l for sequence s:
score = 0
for all words in s
for all features
       score += feature_weight * feature_value

Convert this labeling's score to probability among all labelings:

Make prediction:
Find most probable prediction
argmax_y* p(y* | x)      // prob of labelings given x (conditioned on x)

Example Feature Functions
f1(s,i,li,li−1)=1 if li=ADVERB and the ith word ends in “-ly”; 0 otherwise.
If the weight λ1 associated with this feature is large and positive, then this feature is essentially saying that we prefer labelings where words ending in -ly get labeled as ADVERB.

f2(s,i,li,li−1)=1 if i=1 and li=VERB, and the sentence ends in a question mark; 0 otherwise.
If the weight λ2 associated with this feature is large and positive, then labelings that assign VERB to the first word in a question (e.g., “Is this a sentence beginning with a verb?”) are preferred.

f3(s,i,li,li−1)=1 if li−1=ADJECTIVE and li=NOUN; 0 otherwise.
A large, positive weight for this feature means that adjectives tend to be followed by nouns.

f4(s,i,li,li−1)=1 if li−1=PREPOSITION and li=PREPOSITION.
A *negative* weight λ4 for this function would mean that prepositions don’t tend to follow prepositions, so we should avoid labelings where this happens.

Smells like Logistic Regression…

The form of the CRF probabilities p(l|s) might look familiar. That’s because CRFs are indeed basically the sequential version of logistic regression: whereas logistic regression is a log-linear model for classification, CRFs are a log-linear model for sequential labels.


HMMs take a generative approach to labeling, defining

where p(li|li−1) are transition probabilities (e.g., the probability that a preposition is followed by a noun);
p(wi|li) are emission probabilities (e.g., the probability that a noun emits the word “dad”).

transition probabilities and emission probabilities (also known as output probabilities). The transition probabilities control the way the hidden state at time t is chosen given the hidden state at time t − 1 {\displaystyle t-1} t-1.

CRFs are more powerful than HMMs– they can model everything HMMs can and more.

Note that the log of the HMM probability is logp(l,s)=logp(l0)+∑ilogp(li|li−1)+∑ilogp(wi|li)

This has exactly the log-linear form of a CRF if we consider these log-probabilities to be the weights associated to binary transition and emission indicator features.

That is, we can build a CRF equivalent to any HMM by:
    For each HMM transition probability p(li=y|li−1=x), define a set of CRF transition features of the form fx,y(s,i,li,li−1)=1 if li=y and li−1=x. Give each feature a weight of wx,y=logp(li=y|li−1=x)

    Similarly, for each HMM emission probability p(wi=z|li=x), define a set of CRF emission features of the form gx,y(s,i,li,li−1)=1 if wi=z and li=x. Give each feature a weight of wx,z=logp(wi=z|li=x)

Thus, the score p(l|s) computed by a CRF using these feature functions is precisely proportional to the score computed by the associated HMM, and so every HMM is equivalent to some CRF. However, CRFs can model a much richer set of label distributions as well, for two main reasons:

    CRFs can define a much larger set of features. Whereas HMMs are necessarily local in nature (because they’re constrained to binary transition and emission feature functions, which force each word to depend only on the current label and each label to depend only on the previous label), CRFs can use more global features. For example, one of the features in our POS tagger above increased the probability of labelings that tagged the first word of a sentence as a VERB if the end of the sentence contained a question mark.
    CRFs can have arbitrary weights. Whereas the probabilities of an HMM must satisfy certain constraints (e.g., 0<=p(wi|li<=1,∑wp(wi=w|l1)=1) , the weights of a CRF are unrestricted (e.g., logp(wi|li) can be anything it wants).

Learn Weights

Use gradient descent: Randomly initialize the weights of our CRF model. To shift these randomly initialized weights to the correct ones:
For each training example
For each feature function fi
calculate the gradient of the log probability of the training example with respect to λi

// Note that the first term in the gradient is the contribution of feature fi under the true label, and the second term in the gradient is the expected contribution of feature fi under the current model. This is exactly the form you’d expect gradient ascent to take.

Move λi in the direction of the gradient

Finding the Optimal Labeling

Suppose we’ve trained our CRF model, and now a new sentence comes in. How do we do label it?

The naive way is to calculate p(l|s) for every possible labeling l, and then choose the label that maximizes this probability. However, since there are k^m possible labels for a tag set of size k and a sentence of length m, this approach would have to check an exponential number of labels.

A better way is to realize that (linear-chain) CRFs satisfy an optimal substructure property that allows us to use a (polynomial-time) dynamic programming algorithm to find the optimal label, similar to the Viterbi algorithm for HMMs.

Application: Gift receiving as tagging using CRF

To gather data for the graphs above, I simply looked for phrases of the form “I want XXX for Christmas” and “I got XXX for Christmas”. However, a more sophisticated CRF variant could use a GIFT part-of-speech-like tag (even adding other tags like GIFT-GIVER and GIFT-RECEIVER, to get even more information on who got what from whom) and treat this like a POS tagging problem. Features could be based around things like “this word is a GIFT if the previous word was a GIFT-RECEIVER and the word before that was ‘gave’” or “this word is a GIFT if the next two words are ‘for Christmas’”.

dynamic programming for HMM (viterbi)

Naive Bayes vs Logistic Regression

As others have said, they both train feature weights wj for the linear decision function jwjxj (decide true if above 0, false if below).  The difference is how you fit the weights from training data.
In NB, you set each feature's weight independently, based on how much it correlates with the label.  (Weights come out to be the features' log-likelihood ratios for the different classes.)

In logistic regression, by contrast, you set all the weights together such that the linear decision function tends to be high for positive classes and low for negative classes.  (Linear SVM's work the same, except for a technical tweak of what "tends to be high/low" means.)

The difference between NB and LogReg happens when features are correlated.  Say you have two features which are useful predictors -- they correlate with the labels -- but they themselves are repetitive, having extra correlation with each other as well.  NB will give both of them strong weights, so their influence is double-counted.  But logistic regression will compensate by weighting them lower.

This is a way to view the probabilistic assumptions of the models; namely, Naive Bayes makes a conditional independence assumption, which is violated when you have correlated/repetitive features.

One nice thing about NB is that training has no optimization step.  You just calculate a count table for each feature and you're done with it -- it's single pass and trivially parallelizable every which way.

One nice thing about LR is that you can be sloppy with feature engineering.  You can throw in multiple variations of a feature without hurting the overall model (provided you're regularizing appropriately); but in NB this can be problematic.

Naive Bayes:
P(Y_k|x1,..,xn)=P(Y_k) Πi P(xi|Y_k)

A generative model (e.g., naive Bayes) explicitly models the joint probability distribution p(x,y) and then uses the Bayes rule to compute p(y|x). On the other hand, a discriminative model (e.g., logistic regression) directly models p(y|x). the generative model has its own advantages such as the capability of dealing with missing data, low # of data, low labeled data, etc. but given enough data discriminative models outperform


HMMHMM is a generative model and it gives the output directly by modeling the transition matrix based on the training data. The results can be improved by providing more datapoints, but there is no direct control over the output labels. HMM learns the transition probabilities on its own based on the training data provided. Hence if we provide more datapoints, then we can improve the model to include wider variety. CRF is a discriminative model which outputs a confidence measure. This is really useful in most cases because we want to know how sure the model is about the label at that point. This confidence measure can be thresholded to suit various applications. The good thing about confidence measure is that the number of false alarms is low compared to HMM.

The primary advantage of CRFs over HMMs is their conditional nature, resulting in the relaxation of the independence assumptions required by HMMs. Additionally, CRFs avoid the label bias problem, a weakness exhibited by Markov models based on directed graphical models. A CRF can be considered as a generalization of HMM or we can say that a HMM is a particular case of CRF where constant probabilities are used to model state transitions. CRFs outperform HMMs on a number of real-world sequence labeling tasks.