STK++ 0.9.13
Clustering: How to add new mixture models (Part 1)

Introduction

This tutorial will show how to add a mixture model to STK++.

The design pattern of the Clustering project is based on the idea of plugin so that you can add your own mixture model to the architecture without many effort. If you want a better integration of your mixture model to the library, the Clustering project proposes a set of interface classes with predefined behavior that can be used at your convenience.

In short, the main Interface to derive is the IMixture class. This class is well documented and pure virtual methods to implement are self explaining. The class MixtureComposer estimates the mixture model using only pointers on this interface.

It is thus possible to implement and integrate mixture models to the project with many freedom. However, if you want a closer integration of the mixture model to the Clustering project, it can be convenient to use general design pattern adopted by the project and to implement the Interface classes furnished.

This tutorial will describe step by step how to integrate new models to the project Clustering.

About Mixture Models

A mixture model on some subset $ J\subset \mathbb{R}^d $ is a density of the form

\[
    f(\mathbf{x}|\boldsymbol{\theta})
    = \sum_{k=1}^K p_k f(\mathbf{x};\boldsymbol{\lambda}_k,\boldsymbol{\alpha})
    \quad \mathbf{x} \in J.
\]

The $ p_k > 0$ with $ \sum_{k=1}^K p_k =1$ are the mixing proportions. The density f is called the component of the model. The parameters $\boldsymbol{\lambda}_k, \, k=1,\ldots K $ are the cluster specific parameters and the parameters $ \boldsymbol{\alpha} $ are the shared parameters.

The whole set of parameters is thus

\[
\boldsymbol{\theta} = \{ (p_k)_{k=1}^K, (\boldsymbol{\lambda}_k)_{k=1}^K, \boldsymbol{\alpha} \}.
\]

For example the diagonal Gaussian mixture model STK::DiagGaussian_sjk is the most general diagonal Gaussian model and has a density function of the form

\[
 f(\mathbf{x}|\theta) = \sum_{k=1}^K p_k \prod_{j=1}^d
   \frac{1}{\sqrt{2\pi}\sigma^j_{k}} \exp\left\{-\frac{1}{2}\left(\frac{x^j-\mu^j_k}{\sigma^j_{k}}\right)^2\right\}
\]

All the parameters are cluster specific. There is no shared parameters and $ \boldsymbol{\lambda}_k = \{ (\mu^j_k)_{j=1}^d, (\sigma^j_{k})_{j=1}^d \} $

On the other side the diagonal Gaussian mixture model STK::DiagGaussian_s is the most parsimonious diagonal Gaussian model and has density function

\[
 f(\mathbf{x}|\theta) = \sum_{k=1}^K p_k \prod_{j=1}^d
   \frac{1}{\sqrt{2\pi}\sigma} \exp\left\{-\frac{1}{2}\left(\frac{x^j-\mu^j_k}{\sigma}\right)^2\right\}.
\]

In both cases the means are class specific parameters, i.e. $ \boldsymbol{\lambda}_k = \{ (\mu^j_k)_{j=1}^d \}$. In the second model the common standard deviation is a shared parameter, i.e. $ \boldsymbol{\alpha} = \{ \sigma \} $.

We will illustrate this tutorial with the intermediate mixture model STK::DiagGaussian_sjsk with density function

\[
 f(\mathbf{x}|\theta) = \sum_{k=1}^K p_k \prod_{j=1}^d
   \frac{1}{\sqrt{2\pi}\sigma^j\sigma_{k}}
   \exp\left\{-\frac{1}{2}\left(\frac{x^j-\mu^j_k}{\sigma^j\sigma_{k}}\right)^2\right\}.
\]

This mixture model was not implemented in previous version of the Clustering project (before 2017/09/05).

Creating an Identification (Id)

The first step is to be able to identify the new model as a model recognized by the STK interfaces. This is achieved by updating the files STK_Clust_Util.h and STK_Clust_Util.cpp.

In the first file, we just add a new line in the Mixture enumeration

Before:After:
enum Mixture
{
Gamma_ajk_bjk_ =0,
//...
Gamma_a_bk_, // = 11
Gaussian_sjk_ =20,
Gaussian_sk_,
Gaussian_sj_,
Gaussian_s_, // = 23
//...
}
enum Mixture
{
Gamma_ajk_bjk_ =0,
//...
Gamma_a_bk_, // = 11
Gaussian_sjk_ =20,
Gaussian_sk_,
Gaussian_sj_,
Gaussian_s_,
Gaussian_sjsk_, // = 24
//...
}

In the second file we update the input/output utilities functions

MixtureClass mixtureToMixtureClass( Mixture const& type);
Mixture stringToMixture( std::string const& type);
Mixture stringToMixture( std::string const& type, bool& freeProp);
std::string mixtureToString( Mixture const& type);
std::string mixtureToString(Mixture type, bool freeProp);

by adding the following pieces of code

// The model Gaussian_sjsk_ is part of the DiagGaussian_ models
MixtureClass mixtureToMixtureClass( Mixture const& type)
{
//...
if (type == Gaussian_sjsk_) return DiagGaussian_;
//...
}
// The model Gaussian_sjsk_ is encoded as "Gaussian_sjsk"
Mixture stringToMixture( std::string const& type)
{
//...
if (toUpperString(type) == toUpperString(_T("Gaussian_sjsk"))) return Gaussian_sjsk_;
//...
}
// The model Gaussian_sjsk_ is encoded as "Gaussian_p_sjsk" (fixed proportions)
// or "Gaussian_pk_sjsk" (free proportions)
Mixture stringToMixture( std::string const& type, bool& freeProp)
{
freeProp = false;
//...
if (toUpperString(type) == toUpperString(_T("Gaussian_p_sjsk"))) return Gaussian_sjsk_;
//...
freeProp = true;
//...
if (toUpperString(type) == toUpperString(_T("Gaussian_pk_sjsk"))) return Gaussian_sjsk_;
//...
}
// The model Gaussian_sjsk_ is encoded as "Gaussian_sjsk"
std::string mixtureToString( Mixture const& type)
{
//...
if (type == Gaussian_sjsk_) return String(_T("Gaussian_sjsk"));
//...
}
// The model Gaussian_sjsk_ is encoded as "Gaussian_p_sjsk" (fixed proportions)
// or "Gaussian_pk_sjsk" (free proportions)
std::string mixtureToString(Mixture type, bool freeProp)
{
if (freeProp == false)
{
//...
if (type == Gaussian_sjsk_) return String(_T("Gaussian_p_sjsk"));
//...
}
else
{
//...
if (type == Gaussian_sjsk_) return String(_T("Gaussian_pk_sjsk"));
//...
}
#define _T(x)
Let x unmodified.
String mixtureToString(Mixture const &type)
convert a Mixture to a String.
std::basic_string< Char > String
STK fundamental type of a String.

Encoding the Parameters

The struct STK::ModelParameters encapsulates the parameters of the mixture model. It is a template struct that must be fully specialized and store the parameters of the mixture model $ \mu_{kj}, \sigma_j, \sigma_k $. This class also compute statistics about the values of the parameters along the iterations (the mean and the standard deviation).

template<>
struct ModelParameters<Clust::Gaussian_sjsk_>
{
Array1D<CPointX> mean_; // An array of size K with the row-vector of the mean
CVectorX sigmak_; // Standard deviations of the classes
CPointX sigmaj_; // Standard deviations of the variables
Array1D< Stat::Online<CPointX, Real> > stat_mean_; // An array with the statistics of the means
Stat::Online<CVectorX, Real> stat_sigmak_; // Statistics for the sd of the classes
Stat::Online<CPointX, Real> stat_sigmaj_; // Statistics for the sd of the variables
//....
}

This class structure must also implement:

  • a constructor with the number of cluster given,
  • a copy constructor,
  • an access to the mean and to the standard deviation independent of the model,
  • a method which resize the arrays (mean_ and sigmaj_, ) when the data will be set and the number of variables will be known,
  • updating/getting/releasing the statistics computed online,
  • a way to set the parameter to some values (for initialization to a specific value purpose).
ModelParameters(int nbCluster);
ModelParameters( ModelParameters const& param);
inline Real const& mean(int k, int j) const { return mean_[k][j];}
inline Real sigma(int k, int j) const { return sigmak_[k] * sigmaj_[j];}
void resize(Range const& range);
void updateStatistics();
void setStatistics();
void releaseStatistics();
template<class Array>
void setParameters( ExprBase<Array> const& params);

This structure is defined in file STK_DiagGaussianParameters.h. We don't detail the implementation which can be found in file STK_DiagGaussianParameters.cpp.

Creating the Mixture

The class DiagGaussian_sjsk is a template terminal class derived (recursively) from the class DiagGaussianBase . The template parameter is the type of the Array storing the data. This class must implement

  • a constructor with the number of cluster given,
  • a copy constructor,
  • a random initialization methods of the parameters,
  • a run method updating the values of the parameters,
  • compute the number of parameters.
DiagGaussian_sjsk( int nbCluster);
DiagGaussian_sjsk( DiagGaussian_sjsk const& model);
void randomInit( CArrayXX const* const& p_tik, CPointX const* const& p_tk);
bool run( CArrayXX const* const& p_tik, CPointX const* const& p_tk) ;
inline int computeNbFreeParameters() const
{ return (this->nbCluster()+1)*p_data()->sizeCols();}
};

The run method update the parameters $ (\mu_{k}^j), (\sigma^j), (\sigma_{k}) $ by maximizing

\[
 \sum_{i=1}^n \sum_{k=1}^K t_{ik}
 \left(
   -\log({\sigma^j\sigma_{k}})
   -\frac{1}{2}\left(\frac{x^j-\mu^j_k}{\sigma^j\sigma_{k}}\right)^2
 \right).
\]

Moreover, the traits class MixtureTraits must be instanced (it is used by base class which don't know the parameter type)

namespace hidden
{
template<class Array_>
struct MixtureTraits< DiagGaussian_sjsk<Array_> >
{
typedef Array_ Array;
typedef ModelParameters<Clust::Gaussian_sjsk_> Parameters;
};
} // namespace hidden

Concrete implementation can be found in file STK_DiagGaussian_sjsk.h