Few months into functional programming by learning Haskell, I’ve made a capitalist choice to learn scala. It didn’t seem to me that scala isn’t as beautiful and elegant as was told by my college professor, but with love of functional programming, I’ve tried to implement a simple Naive Bayes Classifier, which I’ve also studied a while ago.

Abstraction of DataSet

Before we begin training anything, we first need an abstraction of the dataset.

type LabeledData, representing a labeled data (supervised learning here)
type DataSet, representing a data set, consisting of multiple LabeledDatas along with some helper functions

LabeledData

LabeledData is parameterized by two types - the type of label and the type of sample. It is simply a tuple of two elements (label, sample) (called Product2 in scala), but with aliased accessors to _1 and _2 to increase readability.

trait LabeledData[LabelType, SampleType] extends Product2[LabelType, SampleType] {
  def label: LabelType = _1
  def sample: SampleType = _2
}

// companion object
object LabeledData {
  def apply[A, B](l: A, s: B): LabeledData[A, B] = new LabeledData[A, B] {
    override def _1: A = l
    override def _2: B = s
    override def canEqual(that: Any): Boolean = that.isInstanceOf[LabeledData[A, B]]
  }
}

DataSet

DataSet is just a sequence of LabeledDatas, but it should also offer some functions for easy processing later on. Especially, groupByClass function, which groups samples according to their labels, will become very handy when we calculate means and variances.

Any derivatives of DataSet will contain data as its member value. It contains multiple LabeledData having labels of type Label and samples of type Sample.

trait DataSet[Label, Sample] {
  // type alias for sequence of labeled data
  type DataSeqType = Seq[LabeledData[Label, Sample]]

  // the actual data
  def data: Seq[LabeledData[Label, Sample]]
  def size: Int = data.size

  // distinct classes (labels) of dataset
  def classes: Seq[Label] = data.map(_.label).distinct

  // groups samples by their classes
  lazy val groupByClass: Map[Label, Seq[Sample]] =
    data groupBy(_.label) mapValues(_.map(_.sample))
}

object DataSet {
  // construct data from sequence of tuples
  def fromTupleSequence[A, B](rawData: Seq[(A, B)]): Seq[LabeledData[A, B]] =
    rawData.map(a => LabeledData(a._1, a._2))
}

MnistDataSet

Now we extend DataSet to manipulate MNIST dataset. MNIST consists of 60,000 training examples and 10,000 test examples. Each sample is a handwritten number of 28 x 28 pixels having labels 0 to 9, the actual number written.

MNIST images MNIST dataset - handwritten digits from 0 to 9

The type for label is Int (the digit), and the type for each sample is a Seq[Int]. The length of each sample is 784 (28 x 28 pixels), whose values are 0 to 255.

So we extend DataSet[Int, Seq[Float]] to create MnistDataSet. Implementations of helper functions such as mean and variance are added to help calculate statistical values conveniently. These are implemented in the inherited class, not in the trait DataSet, because of the ambiguity of both Label and Sample types - what if Sample is a sequence of Booleans? (Any functional advice on this?) I don’t think the implementations of both mean and variance is optimal, but it is, at least, straightforward.

case class MnistDataSet(data: Seq[LabeledData[Int, Seq[Float]]])
  extends DataSet[Int, Seq[Float]] {

  // find the mean of values
  def mean(list: Seq[Float]): Float = list.sum / list.size

  // find the varaiance of values
  def variance(list: Seq[Float], mean: Float): Float =
    (list.map(x => math.pow(x - mean, 2)).sum / list.size).toFloat

  // given a list, calculate both the mean and variance
  def meanVariance(list: Seq[Float]): (Float, Float) = {
    lazy val avg = mean(list)
    (avg, variance(list, mean = avg))
  }

  lazy val sampleMeanVariances: Map[Int, Seq[(Float, Float)]] =
    // here the values are sequence of Seq[Float], whose length is 784 (for MNIST)
    // transpose of this will make 784 sequences, grouping the samples by attributes
    groupByClass.mapValues(_.transpose.map(meanVariance).toVector)
}

We can now read in MNIST dataset from file using DataSet’s fromTupleSequence function, with the help of file reader I’ve borrowed from swiftlearner.

val minstDataSet: MnistDataSet = MnistDataSet(DataSet fromTupleSequence labeledTrainIterator.toSeq)

Gaussian Naive Bayes Classifier

Finally the classifier! We’re building a Naive Bayes classifier with sample data having continuous pixel values. Some can argue that actual pixel values are discrete - integers from 0 to 255 - but for the sake of problem we can take this as being continuous (they’re not boolean values). So each pixel values assume Gaussian distribution.

Naive Bayes model takes an important “naive” assumption - conditional independence:

each feature x_i is conditionally independent of all other x_j, if i != j

This assumption, if you think about it, is actually false for MNIST dataset. Are pixels right above or right below a certain pixel truely independent? Nevertheless, it is a ‘safe enough’ assumption. We can determine whether the digit is ‘0’ given the test pixel values by whether certain pixel position were ‘generally dark’ for all other examples of ‘0’ we’re trained with.

The Naive Bayes classifier is finding the most probable ‘class’ given the dataset. In other words, we’re finding the class that makes the conditional probability P(class | dataset) the largest.

The conditional independence assumption simplifies the likelihood part.

GaussianNaiveBayes trait

All we need for using the Naive Bayes classifier is prior and likelihood estimates. For the sake of simple calculations, we use log-likelihood and log-prior instead. This will let us compute the sum of log-likelihoods instead of product of likelihoods.

trait GaussianNaiveBayes {
  def data: DataSet[Int, Seq[Float]]

  // calculate P(X | Y)
  def logLikelihood: Map[Int, Seq[Float => Option[Double]]]

  // calculate P(Y)
  def logPrior: Map[Int, Float]

  // calculate P(Y | X)
  def predict(x: Seq[Float]): Int
}

Using MnistDataSet data, we can now create a class MnistGaussianNaiveBayesClassifier.

case class MnistGaussianNaiveBayesClassifier(data: MnistDataSet) 
  extends GaussianNaiveBayes

Parameter estimates

log prior estimate. Prior is the probability of each class’ occurrence. Log prior is just logs of them. MNIST should have prior estimates of roughly 1/10 for all classes.

lazy val logPrior: Map[Int, Float] = {
  lazy val totalSize = data.size
  data.groupByClass.mapValues(x => math.log(x.size.toFloat / totalSize).toFloat)
}

log likelihood estimate. For each attribute of input sample, the log likelihood is the gaussian distribution having the sample mean and variance. To compute this, we use sampleMeanVariances function of MnistDataSet, which uses transpose to group samples by 784 different attributes. This was actually the nice part of functional programming, for inherently supporting numpy.transpose from python world I’m used to.

logLikelihood maps each attribute to a function that takes an attribute value Float and outputs a Double, the likelihood. Mapping to a curried function gaussianProb(mean: Float, variance: Float)(x: Float): Double makes it easier to compute the probability of a value that will be later on given (in this case, x - the test example’s attribute value) while locking other parameters we can compute beforehand (in this case, mean and variance of each attribute). This reduces the need to remember the mean and variances of all attributes, as would other imperative languages do. By currying the parameter x after mean and variance, a Gaussian distribution is defined, only waiting for an input value to compute the probability.

val epsilon = 0.1f
lazy val logLikelihood: Map[Int, Seq[Float => Option[Double]]] =
  // epsilon is added as a prior (hence making this an MAP estimate)
  data.sampleMeanVariances.mapValues(_.map({ case (mean, variance) =>
    gaussianProb(mean, variance + epsilon) _
  }))

MAP and Gaussian prior. If a certain pixel contains a value 0 for all training samples, then its sample distribution will have both mean and variance 0. The the probability of this pixel for a test sample will be 0, and this is a problem - the entire likelihood term will become 0 as it computes the product of all attributes’ probabilities. Thus we need to compute the maximum a-posterior estimate (MAP) instead of MLE. Setting the prior for each attribute as a normal distribution with known variance (0.1f in my case), the posterior distribution will have variance of sample variance + prior variance. This is why the gaussianProb accepts variance + epsilon as its variance parameter.

Prediction

Now we know our priors and likelihoods, we are ready to predict which digit a handwritten number is. This is as easy as computing the argmax of conditional probability of the mathematical equation written above.

The argmax function returns the key that has the maximum value.

def argmax[A](map: Map[A, Double]): A =
  map.keysIterator.reduceLeft((x, y) => if (map(x) > map(y)) x else y)

And the final predict function will be:

override def predict(x: Seq[Float]): Int = argmax(logPrior.map({
  case (cls, pri) =>
    val like = logLikelihood(cls).zip(x).flatMap { case (f, xVal) => f(xVal) }
    val posterior = like.sum + pri
    cls -> posterior  // map the class with posterior probability
}))

Testing the classifier

With the following script, we can test the classifier with the first 10 test samples.

object Main {
  def main(args: Array[String]): Unit = {
    val mnist: MnistDataSet = Mnist.minstDataSet
    val mnistTest: MnistDataSet = Mnist.mnistTestDataSet
    val testData = mnistTest.data
    val classifier = MnistGaussianNaiveBayesClassifier(mnist)
    println("classifier loaded")
    println("Testing with 10 examples: ")
    val predictTestSet = testData.take(10).map(samp => (samp.label, classifier.predict(samp.sample)))
    println("Results: (correct_label, predicted_label)")
    println(predictTestSet.toVector)
  }
}

And the results show:

Results: (correct_label, predicted_label)
Vector((7,7), (2,2), (1,1), (0,0), (4,9), (1,1), (4,8), (9,9), (5,4), (9,9))

It predicts more than half of test samples correctly (which is good) while predicting some wrong. Mispredicting a 4 with 5 seems especially odd. However, with such a naive assumption of conditional independence, it is happy to see the algorithm actually works, although the accuracy is horrible compared to state-of-the-art deep learning models giving error as low as 0.21%.

Conclusion

The test script runs VERY SLOW, and I will try to improve this. As a fresh-out-of-imperative-world, the performance optimization in functional programming language is still hard to grasp.

However, playing with functional programming for machine learning algorithms was an exciting challenge. As easy as Naive Bayes classifier is, it already had plenty of rooms for functional programming ideas to practice with. I look forward to write some iterative algorithms such as EM or PCM - which will be much more fun.

Complete code used in this post is uploaded at this github repository.

Until next time!