Machine Learning with Haskell - EM Algorithm and State Monads
Several months into teaching myself Haskell, I do now genuinely think Haskell is HOT. As an all-time Python lover, at first, I found Haskell codes mostly incomprehensible. However, I finally kind of grasp the dark magic of monads.
This article is about my little experiment on utilizing Haskell’s monads in machine learning algorithm.
Most intractable algorithms in machine learning accompany iterative computations, optimizing the performance by updating parameters step-by-step.
Naturally, iteration and parameter updates connect to the concept of ‘state’s, or storage of variables.
The words update and variable do not sound appropriate for a functional language that constantly emphasizes “purity” and traumatically hates side-effects.
However, there is a Haskellic way of dealing with states, using the State
Monad.
In this article, I will try to explain my own implementation of one of the most popular iterative algorithms in machine learning, EM (Expectation Maximization) algorithm, making use of Haskell’s State
Monad.
EM Algorithm and Coin Flip Example
EM algorithm is an algorithm that tries to find the best parameter of a model when some of the data are missing or unobserved. It would be difficult to fully derive the EM algorithm in this article, but here’s a brief explanation of a simple problem setting to implement (for a full derivation, refer to this nice Stanford University’s CS229 material.
Let’s follow the popular coin flip example. Suppose the villain “Two-Face” now has two coins, COIN_A
and COIN_B
, to choose. These coins are inside a pouch and “Two-Face” randomly picks a coin inside the pouch without looking. We can say that the probability of each coin is 0.5.
P(COIN_A) == P(COIN_B) == 0.5
Each coin are biased, meaning that the coins do not give heads and tails with equal probability. If a coin’s bias is 0.8, then out of 100 flips the coin is likely to give around 80 heads. “Two-Face” is being gentle and throws 10 flips once he picks a coin. If more than half of the throws give heads, the victim is doomed, otherwise spared. Now suppose that “Two-Face” doesn’t know how biased each coins are.
P(BIAS_A) = ?
P(BIAS_B) = ?
“Two-Face” has done this ‘act of judgement’ against 10 targets. The results are shown in this table;
Coin (A or B) | Flips | Result (No. Heads) |
---|---|---|
? | H T T T T T T T T H | SPARED (2) |
? | H H H H T H H H H H | KILLED (9) |
? | T T H H H H T H H H | KILLED (7) |
? | H H H H H H T H T H | KILLED (8) |
? | H H H T H H H H H H | KILLED (9) |
? | H H T H H H T H H H | KILLED (8) |
? | T T T T T H T H T H | SPARED (3) |
? | T T H H H T T T T T | SPARED (3) |
? | H H H T H H H H T T | KILLED (7) |
? | H T T T T T T T H T | SPARED (2) |
Clearly, the the coins are biased. Our job is to find estimates of coin biases, provided the train data - flip results for 10 targets.
HMatrix - the numpy of Haskell
We now need to find a way to represent the numeric values needed during the calculations.
A really nice package for this purpose is hmatrix.
Just like numpy from Python world, hmatrix
provides nice abstractions for representing vectors and matrices, as well as functions for performing linear algebra operations.
We can express the probability of coins as a Vector
of real values (R
). The vector
function transforms a list of Double
s into type Vector R
.
import Numeric.LinearAlgebra
probCoin :: Vector R -- R is an alias for Double
probCoin = vector [0.5, 0.5] -- P(COIN_A) == P(COIN_B) == 0.5
Similarly, the coin-flip results can also be expressed as a Vector
containing the number of heads in 10 throws.
headObserved :: Vector R -- converted to R for convenience
headObserved = vector [2, 9, 7, 8, 9, 8, 3, 3, 7, 2] -- number of heads
Keeping the state with State
Monad
Before proceeding with actual implementation of EM, we need to understand how State
Monad will make sense for EM.
Haskell’s State
Monad allows us to pass around the state as we iterate.
As a sidenote, State
Monad’s name is somehow misleading since it doesn’t actually contain any state values, but wraps a state processing function, which can be accessed with runState
.
The state processing function gets the previous state as input and outputs both the (possibly) modified state and the output value as tuple.
-- State [state type] [output type]
newtype State s a = State { runState :: s -> (a, s) }
EM is an itrative algorithm repeating the E-step and then the M-step until log-likelihood converges.
# pseudocode
while not converged:
model.e_step()
model.m_step()
In our problem setting, the only parameters that gets updated are the estimation of coin biases (which “Two-Face” dosn’t know!). This is the only “state” we need to keep in our model.
type Params = Vector R -- coin biases
Then our type for states for State
Monads in our models will be Params
.
State Params ? -- State [state type] [output type]
The type for output values will be determined as we proceed with our implementation of EM step.
We have just went through the fist step of designing a model that can contain “state” using State
monad!
Parameter initialization
We don’t know anything about coin biases (of course, we are doing all this mess in order to estimate the coin biases!). However, in order to run the EM algorithm, we need to initialize the biases (parameters) with arbitrary probability values, say, 0.4 and 0.6.
This is like saying:
Just imagine for now that the biases are 0.4 and 0.6. How probable are each trials for each coin?
This value is temporary and will be updated in the M-Step described later on.
-- initial coin bias
initParam :: Params
initParam vector [0.4, 0.6]
E-Step
The main purpose E-Step (expecation step) is to fill in the blanks from missing (unobserved) data, using the temporary parameters (coin biases). We must calculate the probabilities for the coins used for each trials (10 throws), given the bias and trials. It is easy to derive the formula for this probability using Bayes Rule.
For example, calculating the probability of COIN_A
and COIN_B
for trial 1 (which gave 2 heads) will be:
# pseudocode
EXP(COIN_A, TRIAL_1)
= P(COIN_A | TRIAL_1, COIN_BIAS)
= P(TRIAL_1 | COIN_A, COIN_BIAS) * P(TRIAL_1, COIN_BIAS) / sum_over_coin(P(TRIAL_1 | COIN_X, COIN_BIAS))
EXP(COIN_B, TRIAL_1)
= P(COIN_A | TRIAL_1, COIN_BIAS)
= P(TRIAL_1 | COIN_B, COIN_BIAS) * P(TRIAL_1, COIN_BIAS) / sum_over_coin(P(TRIAL_1 | COIN_X, COIN_BIAS))
Let’s break down the terms in the expression above.
the likelihood: P(TRIAL_1 | COIN_A, COIN_BIAS)
The expression P(COIN_A | TRIAL_1, COIN_BIAS)
can be evaluated with binomial probability;
the probability of getting TIRAL_1
heads out of 10 throws of a coin (COIN_A
) that has bias of COIN_BIAS_A
.
# pseudocode
# 1 - COIN_BIAS_A = probability of giving tails
# 10 - TRIAL_1 = number of tails in trial 1
P(COIN_A | TRIAL_1, COIN_BIAS) = Binom(TRIAL_1, 10, COIN_BIAS_A) = (COIN_BIAS_A ^ TRIAL_1) * ((1 - COIN_BIAS_A) ^ (10 - TRIAL_1))
The pseudocode above can be made a function in Haskell easily.
binomProb :: Matrix R -> Matrix R -> Matrix R -> Matrix R
binomProb heads tails bias = (bias ** heads) * ((1 - bias) ** tails)
What’s happening here: binomProb
is getting inputs of Matrix
types.
Matrix
simply represents the collection of training samples (i.e. number of heads).
Mathematical operators are applied element-wise, so heads
, tails
and bias
all have same dimensions.
This way, we can simply multiply the matrices once instead of iterating through all training examples.
The heads
matrix has size (num_coins, num_samples)
, which is (2, 10)
since we have two coins and ten trials.
bias
matrix contains the bias of coins, replicated by the number of samples.
The output value will surely have size (num_coins, num_samples)
, where each row represents the likelihood of each samples for a coin.
More of this will be clearer after we get to the prior.
the prior: P(TRIAL_1, COIN_BIAS)
The prior is simple: the probability of picking the coin from pouch = probCoin
.
joint probability: P(TRIAL_1 | COIN_A, COIN_BIAS) * P(TRIAL_1, COIN_BIAS)
Multiplying the prior and the likelihood gives us the joint probability.
-- probability that an event will happen = P(coin) * P(num_heads | coin)
jointProb :: Vector R -> Vector R -> Vector R -> Matrix R
jointProb numheads headBias coinprobs = coinprobMat * binomProb headMat tailMat headBiasMat
where
numCoins = size headBias -- 2
numSamples = size numheads -- 10
headMat :: Matrix R = fromRows $ replicate numCoins numheads -- size (numCoins, numSamples)
tailMat = 10 - headMat
headBiasMat :: Matrix R = fromColumns $ replicate numSamples headBias
coinprobMat :: Matrix R = fromColumns $ replicate numSamples coinprobs -- size (numCoins, numSamples)
The jointProb
defined above simply multiplies likelihood and prior element-wise, after formatting the input vectors appropriately into Matrix
es in order to make computation easier.
The output matrix of jointProb
will now contain the joint probabilities P(COIN_i, TRIAL_j)
at (i, j).
the denominator: sum_over_coin(P(TRIAL_1 | COIN_X, COIN_BIAS))
The denominator for the probability expression is there for normalization, in order to make the sum of P(TRIAL_1 | COIN_X, COIN_BIAS)
over coins into 1.0.
We already have the values we need to sum over.
The direction we are summing over is the column-direction, so we create a function that does this.
-- calculate the sum over colums
-- this is like: map sum [column_vectors]
columnSum :: Matrix R -> Matrix R
columnSum x = (1><dim) (repeat 1) <> x -- matmul with size (1 x colsize) = sum over columns
where
dim :: Int = fst $ size x -- dimension of axis 0 (size of column)
-- normalize a matrix across column (axis 1)
columnNormalize :: Matrix R -> Matrix R
columnNormalize x = x / colSum
where
colSum = columnSum x
The infix operator (><) just means to create a type Matrix
having specific size. (1><dim) (repeat 1)
is thus creating a Matrix
of size (1, dim) filled with 1’s.
Another infix operatior (<>) is matrix multiplication.
probability of coins per sample.
We have prepared everything we need in order to evaluate the expected values of coins for each sample (trial).
-- E-step => probability of coins
-- each row (index axis 0) => examples
-- each columns (index axis 1) => coin's expected values (per example)
-- resulting size : (num_coins, num_examples)
coinProbEst :: Vector R -> Vector R -> Vector R -> Matrix R
coinProbEst heads biases coinprobs = columnNormalize jointp
where
jointp = eventProb heads biases coinprobs -- joint probability
Calculating coinProbEst
for our coin flip example, and then transposing it (for visual purpose), we get:
(10><2)
[ 0.9192938209331653, 8.070617906683486e-2
, 3.755317588381989e-2, 0.9624468241161802
, 0.16494845360824748, 0.8350515463917526
, 8.070617906683486e-2, 0.9192938209331653
, 3.755317588381989e-2, 0.9624468241161802
, 8.070617906683486e-2, 0.9192938209331653
, 0.8350515463917526, 0.16494845360824748
, 0.8350515463917526, 0.16494845360824748
, 0.16494845360824748, 0.8350515463917526
, 0.9192938209331653, 8.070617906683486e-2 ]
where each row are the samples and the values in columns each denote the probability of coins for that sample.
For example, in the first row, the probability that COIN_A
was coin that gave the result is 92% while the probability that COIN_B
was the coin is just 8%.
Since trial 1 gave only 2 heads, COIN_A
having initial bias of 0.4 is MUCH more probable than COIN_B
, having bias of 0.6.
M-Step
Now that we have the probabilities of coin labels for each trials, we need the compute the parameters (= coin biases) that maximizes the data likelihood (hence the M-step). The bias values that maximizes the data probability is expressed as:
# pseudocode
UPDATED_BIAS_A = TOTAL_HEADS_FOR_COIN_A / TOTAL_THROWS_FOR_COIN_A
This is intuitive. If a coin gave 8 heads out of 10 throws, we would normally think that the coin has a bias of about 0.8.
However, we only have the probabilities of coins for each trials.
Instead of counting the number of heads, we need to evaluate the weighted sum of heads, where the weights are the probability of coin that gave those heads.
For example, for the first trial that gave 2 heads, 92% is attributed to COIN_A
while other 8% is attributed to COIN_B
.
So the number of heads that COIN_A
gave for trial 1 is 0.92 * 2 = 1.84
.
Weighted sum can be directly interpreted as taking the dot product of probability vector and heads vector. Using this interpretation, we can go ahead and update the parameters.
-- calculate updated parameters
calcUpdatedParams :: Vector R -> Matrix R -> Vector R
calcUpdatedParams obsvd exps = fromList $
map (\weightVec -> -- expeced values for coin
weightVec <.> obsvd / weightVec <.> totalThrows) -- weighted sum = dot product
sampleExpected
where
-- Expected values for samples per coin. If there are two coins, length sampleExpected == 2
sampleExpected :: [Vector R] = toRows exps -- becomes a list of : [expected values of all samples for coin]
totalThrows :: Vector R = vector $ replicate numSamps totalThrow
numSamps :: Int = size $ head sampleExpected
The dot product weightVec <.> obsvd
replaces the TOTAL_HEADS_FOR_COIN_A
in the pseudocode above.
In the same manner, weightVec <.> totalThrows
replaces the TOTAL_THROWS_FOR_COIN_A
.
Calculating the updated parameters once with our problem settings will yield new bias values: 0.318 for COIN_A
, and 0.760 for COIN_B
.
The results are, again, intuitive.
The trials that spared the targets only gave 2 or 3 heads, and initial bias of 0.4 seems too high. After the update, it has been lowered to 0.318.
Similarly, the number of heads that killed the targets gave around 7 to 9 heads, and initial bias of 0.6 is definitely too low; after the update, it has been incremented to 0.76.
Iteration and State Monad
With calcUpdatedParams
, we can now update the coin biases - but how do we iterate the update process?
Moreover, where do we store the intermediate values?
We can do some recursions, providing the iteration number, stopping when iteration number reaches zero, appending intermediate parameter values as we recurse, and passing THEM around.
This is dirty. What a mess!
We can use State
Monad introduced above to answer both questions.
redefining functions using State Monad
Remind that the parameters of our model are coin bias values.
In the E-step, instead of directly returning the coin probability matrix, coinProbEst
can be modified to return a State
.
This state is a wrapper of state-processing function that takes the previous parameters (coin biases) and outputs both then next parameters and coin probability matrix.
coinProbEst :: Vector R -> Vector R -> State Params (Matrix R)
coinProbEst heads coinProbs = state $ \params ->
(columnNormalize (eventProb heads params coinProbs), params) -- doesn't modify the state
Note that during the E-step doesn’t update the coin bias values, so the input state is passed on without being modified.
state
function is used to create a State
Monad by providing the state processing function (that should have type s -> (a, s)
, which is Params -> (Matrix R, Params)
in this case).
In the M-step, calcUpdatedParams
can be used also to create a State
.
State-processing function in this State
monad will contain the updated parameters.
updateParams :: Vector R -> Matrix R -> State Params ()
updateParams heads probcoin = state $ const ((), calcUpdatedParams heads probcoin) -- pass on the updated state
Parameter initialization can also be expressed using State
using put
function, which contains a state-processing function that replaces the state being passed around.
-- input = initial coin bias values
initEm :: Vector R -> State Params ()
initEm = put
-- note the definition of put
put newState = state $ \_ -> ((), newState)
combination and iteration
One EM step contains one E-Step and one M-Step.
We can combine these two steps easily with bind
, now that we have monads.
emStep :: Vector R -> Vector R -> State Params ()
emStep heads coinProbs = coinProbEst heads coinProbs >>= updateParams heads
The (>>=)
operator, or monadic bind
, accepts a State
monad first and a function.
The output value of State
monad’s processing function (in this case, the output of coinProbEst
’s State
is the coin probabilities) is then applied to the function,
creating the next State
monad (in this case the output State
of updateParams
).
We can make use of the fact that two other values required for iteration - observed heads and probability of coins being picked from the pouch - are constant.
The two constants are headObserved
and probCoin
, respectively.
This leads to a simplified wrapper State
that runs one EM-step.
-- one stepper state for EM
-- after the step, return the current state
emStepper :: State Params Params
emStepper = emStep headObserved probCoin >> get
get
function used here simply returns the current state as output.
get = state $ \s -> (s, s)
Now, performing the iteration is as simple as calling replicateM
over emStepper
monad!
-- iterate EM algorithm multiple times and collect intermediate params
emIter :: Int -> State Params [Params]
emIter numIter = replicateM numIter emStepper
This is how stateful iterations are made in Haskell. No for-loops, no while-loops.
Results & Conclusion
Checking out the intermediate values after 10 iterations, after initializing the coin biases with [0.4, 0.6]
is expressed as:
-- intermediate parameters
intermParams :: [Vector R] = evalState (initEm (vector [0.1, 0.3]) >> emIter 10) $ vector []
evalState
takes a State
and an initial state value (of type Params
), and returns the output value after all state-processing is finished.
The empty vector vector []
provided here is meaningless because it is overwritten by initEm
.
The results of 10 iterations are as follows (with some pretty-printing);
Step : 1 Params: [0.3181271315275837,0.7601145899025433]
Step : 2 Params: [0.26299184390385155,0.799827186851992]
Step : 3 Params: [0.2550167719144156,0.8001081427495718]
Step : 4 Params: [0.2541708988752831,0.7999958870233562]
Step : 5 Params: [0.25408453342034,0.7999816586068763]
Step : 6 Params: [0.25407572598399125,0.7999801541376964]
Step : 7 Params: [0.25407482733626807,0.7999799996411089]
Step : 8 Params: [0.2540747356291876,0.7999799838568136]
Step : 9 Params: [0.2540747262701262,0.7999799822456467]
Step : 10 Params: [0.25407472531499103,0.7999799820812141]
The bias for COIN_A
converges to 0.25, while bias for COIN_B
converges to 0.8.
Let’s hope we made “Two-Face” happy with our results!
The results are actually consistent with the fact that I have actually generated the 10 trial data presented above from random binomial distribution having bias of 0.25 and 0.8 each.
This was my little experiment spawned from curiosity about how powerful Haskell’s monads can be, even when implementing machine learning algorithms.
Machine learning without python was unimaginable at first, but the joy playing with monads and actually having generated this nice results with EM algorithm was thrilling and exciting than I have ever imagined
(Look at the iterations using monads in emIter
! A one-liner!?).
I have come to think that functional languages are even better-suited for machine learning than python or any other imperative languages are.
Such thoughts have been more boosted after reading this article about neural networks and functional programming.
I will also have to admit, though, that learning Haskell and State
Monads was painful.
So much headaches, I haven’t attempted to use even more awesome concepts like GADTs, Dependent Types, and/or DataKinds.
Even so, there might be some deeper something between machine learning and functional programming, and I would be very glad to see how those two get along in the future.
- The full codes can be seen in this repository, and the following is the full haskell code for this article:
{-# LANGUAGE ScopedTypeVariables #-}
module EmCoinState where
import Control.Monad.Trans.State (State, put, state, get)
import Control.Monad (replicateM)
import Numeric.LinearAlgebra
(Vector
, R
, Matrix
, vector
, fromList
, toList
, fromRows
, (<.>) -- dot product (vector)
, (><) -- matrix formation
, (<>) -- matmul
, size
, toRows
, fromColumns
)
-- total number of throws
totalThrow :: R
totalThrow = 10
-- initial parameters
-- initial bias of two coins giving heads
initParam :: Params -- vector of real values - R is just an alias of Double
initParam = vector [0.4, 0.6]
-- observed data represented by the number of heads in 10 coin-flips
headObserved :: Vector R
headObserved = vector [2, 9, 7, 8, 9, 8, 3, 3, 7, 2]
-- test the validity of observed data - should be less than 10
testObserved :: Vector R -> Bool
testObserved = foldl (\acc hd -> ((hd <= 10) && acc)) True . toList
-- probability of coin being generated
probCoin :: Vector R
probCoin = vector [0.5, 0.5]
-- define type parameters
type Params = Vector R
-- binomial probability
binomProb :: Matrix R -> Matrix R -> Matrix R -> Matrix R
binomProb hs tails bs = (bs ** hs) * ((1 - bs) ** tails)
-- probability that an event will happen = P(coin) * P(num_heads | coin)
eventProb :: Vector R -> Vector R -> Vector R -> Matrix R
eventProb numheads headBias coinprobs = coinprobMat * binomProb headMat tailMat headBiasMat
where
numCoins = size headBias
numSamples = size numheads
headMat :: Matrix R = fromRows $ replicate numCoins numheads
tailMat = 10 - headMat
headBiasMat :: Matrix R = fromColumns $ replicate numSamples headBias
coinprobMat :: Matrix R = fromColumns $ replicate numSamples coinprobs
-- calculate the sum over colums
-- this is like: map sum [column_vectors]
columnSum :: Matrix R -> Matrix R
columnSum x = (1><dim) (repeat 1) <> x -- matmul with size (1 x colsize) = sum over columns
where
dim :: Int = fst $ size x -- dimension of axis 0 (size of column)
-- normalize a matrix across column (axis 1)
columnNormalize :: Matrix R -> Matrix R
columnNormalize x = x / colSum
where
colSum = columnSum x
-- E-step == expected values of coins
-- each row (index axis 0) => examples
-- each columns (index axis 1) => coin's expected values (per example)
-- P(coin_i | observed, theta) = P(observed | coin_i, theta) * P(coin_i) / sum_over_k( P(observed | coin_k, theta) * P(coin_k) )
-- prod_over_x ( binom x_head (10 - x_head) coin_i_bias * 0.5 )
-- resulting size : (num_coins, num_examples)
coinProbEst :: Vector R -> Vector R -> State Params (Matrix R)
coinProbEst heads coinProbs = state $ \params ->
(columnNormalize (eventProb heads params coinProbs), params) -- doesn't modify the state
-- M-step = calculate updated theta
-- new_theta_coin = sum (weighted heads) / sum (weighted total_throws)
updateParams :: Vector R -> Matrix R -> State Params ()
updateParams heads coinExps = state $ const ((), calcUpdatedParams heads coinExps)
-- calculate the next parameters
calcUpdatedParams :: Vector R -> Matrix R -> Vector R
calcUpdatedParams obsvd exps = fromList $
map (\weightVec -> -- expeced values for coin
weightVec <.> obsvd / weightVec <.> totalThrows) -- weighted sum = dot product
sampleExpected
where
-- Expected values for samples per coin. If there are two coins, length sampleExpected == 2
sampleExpected :: [Vector R] = toRows exps -- becomes a list of : [expected values of all samples for coin]
totalThrows :: Vector R = vector $ replicate numSamps totalThrow
numSamps :: Int = size $ head sampleExpected
-- initial coin bias values
initEm :: Vector R -> State Params ()
initEm = put
-- state contains the parameters.
-- TODO : make state processing function output a log-likelihood + current state
emStep :: Vector R -> Vector R -> State Params ()
emStep heads coinProbs = coinProbEst heads coinProbs >>= updateParams heads
-- one stepper state for EM
-- after the step, return the current state
emStepper :: State Params Params
emStepper = emStep headObserved probCoin >> get
-- iterate EM algorithm multiple times and collect intermediate params
emIter :: Int -> State Params [Params]
emIter numIter = replicateM numIter emStepper