STK++ 0.9.13
A STK++ application example: the MixBin program

Introduction

This tutorial will show you how to construct an application (call MixBin) using the STK++ library as background. We will detail

  • how to read an option file,
  • how to read a data file and handle the missing values,
  • how to use an existing implemented law in STK++,
  • how to implement STK++ interfaces in order to estimate a statistical model.

A (toy) statistical model

We consider a statistical model of Bernoulli random variable of the following form

\[
      f(\mathbf{x}|\theta) =
       \prod_{j=1}^p p_{j}^{x^j} (1-p_{j})^{1-x^j},
       \quad x^j\in\{0,1\}.
  \]

The model will be estimated using a sample $ (\mathbf{x}_i)_{i=1}^n $ with the possibility that some values are missing at random.

We will illustrate the usage of our toy application using the 1984 United States Congressional Voting Records Database

Schlimmer, J. C. (1987). Concept acquisition through representational adjustment.
Doctoral dissertation, Department of Information and Computer Science,
University of California, Irvine, CA.
* 

that can be found at the UCI Machine Learning Repository.

The first lines of the file "house-votes-84.data" are given hereafter

  republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
  republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
  democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
  democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
  democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
  ...

We will not make use of the first column. This data set is interesting in the sense that we will have to parameterize the Read/Write classes in order to adapt our reading to the coding convention used in it.

Reading an option file

Our first task, is to communicate with the end-user. MixBin is a console-based program and the computations are driven by an option file furnished by the user like this one

Option file "house-votes-84.opt"
# The MIXBIN page: the name of a page, like the name of an option, is not case sensitive
[MixBin]
filename = ./data/house-votes-84.data # name of the csv file containing the data

# How is coded the binary data set. Default values are 0,1,.
mapping= y,n,?         # Note: a list of parameter used always comma separators

# dimensions of the models to estimate, the model with the best BIC criteria
# will be returned.
dims = 2:2

This file can be found in the "tutorial/MixBin/data/" directory.

In this option file, there exists only one page ([MixBin]), but there could more that one pages: for example we could have added a page for the outputs, a specific page for the parameters of the models, and so on...

See also
IPage,
ReadWritePages.

How to implement a new Page

The DManager project allows to extend the pre-defined class IPage in order to read options file like this one. For each page, we have to implement the IPage interface class

MixBinPage class header
#ifndef MIXBINPAGE_H
#define MIXBINPAGE_H
#include <STKernel.h>
#include "DManager.h"
#include <map>
namespace tuto
{
class MixBinPage: public STK::IPage
{
public:
MixBinPage();
MixBinPage( MixBinPage const& page);
inline virtual ~MixBinPage() {}
inline STK::String const& fileName() const { return fileName_;}
inline std::map<STK::Binary, STK::String> const& mapping() const { return mapping_;}
inline STK::Range const& dims() const { return dims_;}
virtual bool validate();
/*The clone method must always be reimplemented */
inline virtual IPage* clone() const { return new MixBinPage(*this); }
private:
STK::String fileName_;
std::map<STK::Binary, STK::String> mapping_;
STK::Range dims_;
};
} // namespace tuto
#endif /* MIXBINPAGE_H */
This file include all the other header files of the project DManager.
This file include all the header files of the project STKernel.
A IPage is an interface base class for reading and/or writing a page of option in a file.
Definition STK_IPage.h:94
Index sub-vector region: Specialization when the size is unknown.
Definition STK_Range.h:265
std::basic_string< Char > String
STK fundamental type of a String.

The constructor have been implemented in the following way

MixBinPage::MixBinPage(): IPage(_T("MixBin"), 1, false)
, fileName_()
, mapping_()
, dims_(2,1) // dims_=2:2
{
mapping_.insert(std::pair<Binary, String>(zero_, String(_T("0"))));
mapping_.insert(std::pair<Binary, String>(one_, String(_T("2"))));
mapping_.insert(std::pair<Binary, String>(one_, String(_T("2"))));
options_.reserve(3);
options_.push_back(Option(_T("filename"), Option::string_, false));
options_.push_back(Option(_T("mapping"), Option::lstring_, true));
options_.push_back(Option(_T("dims"), Option::range_, true));
options_.back().setValue(typeToString(dims_));
}
#define _T(x)
Let x unmodified.

we give the name of the page in the constructor of IPage and initialize the members of the class to its default value. Observe that some options are optional. The only one necessary is the filename option. The second option, "mapping", is a Option::lstring_, it means that when we will read this option we will get a list of String.

The final step is to overwrite the default value in the map mapping_ if the option mapping is present. This can be achieved by overwriting the validate() virtual function. This function is called when a page is read from a file and allows to check if the values given are correct. Here is the piece of code that accomplish this task

bool MixBinPage::validate()
{
std::list<String> const& coding = options_[1].get(std::list<String>());
// ... get the other fields, more tests, here
typename std::list<String>::const_iterator it = coding.begin();
if (coding.size()>0) mapping_[zero_] = *it; ++it;
if (coding.size()>1) mapping_[one_] = *it; ++it;
if (coding.size()>2) mapping_[binaryNA_] = *it;
}

Reading option pages

When the pages required by the application have been defined, they can be read using the class ReadWritePages. In the MixBin application, we use it directly in the main.

ReadWritePages pages;
pages.addPage(MixBinPage());
if (!pages.read(option_file)) { ... } // manage any error
pages.write(stk_cout); // write the options in the console
MixBinPage const* p_page = static_cast<MixBinPage const*>(pages[0]); // get the first page, MixBinPage
std::string fileName = p_page->fileName();
std::map<STK::Binary, STK::String> encoding = p_page->mapping();
Range dims = p_page->dims();
#define stk_cout
Standard stk output stream.

Reading a data set (csv files)

The STK++ library provide a template class TReadWriteCsv which allow to read and write csv files of any type. In our example, we want to read the (Binary) data set "houses-votes-84.data" which is composed of binary and missing data.

This can be achieved in the following way

TReadWriteCsv<Binary> rw(fileName);
rw.setWithNames(false);
rw.setInMapping(decoding);
rw.setWithMapping(true);
if (!rw.read()) { ...} // manage error

The map decoding have been constructed using the map encoding and is just the reversed mapping. For more details have a look directly to the mixbin.cpp file.

The first few lines of the values stored in the variable rw are given hereafter

 .,1,0,1,0,0,0,1,1,1,0,.,0,0,0,1,0
 .,1,0,1,0,0,0,1,1,1,1,1,0,0,0,1,.
 .,.,0,0,.,0,0,1,1,1,1,0,1,0,0,1,1
 .,1,0,0,1,.,0,1,1,1,1,0,1,0,1,1,0
 .,0,0,0,1,0,0,1,1,1,1,0,.,0,0,0,0
 ...
Note
In case the data file handle different kind of data, the class TReadWriteCsv can be used using String as template parameter and it can make use of the ImportFromCsv class.
See also
ImportFromCsv, ExportToCsv.

Finally we transfer data stored in the TReadWriteCsv, into some container adapted to the computation. We have mainly two kind of arrays that can be used: the STK::Array2D or the STK::CArray classes. As we don't need to perform algebraic operations with the data set, we chose to use the Array2D class which is more flexible for managing data. Moreover, in this case, the transfer from the TReadWriteCsv to the Array2D is linear in the number of variable (the number of columns) and not in the number of data as if would have been with CArray class).

The following lines of code will do the job:

CsvToArray<Array2D<Binary> > trans(rw, 0.2);
if (!trans.run()) {... } // manage any error during the transfer
Array2D<Binary> data;
data.move(*trans.p_data());

Note that how we make use of the move function which allow us to transfer the data in constant time (see Using references and move).

The threshold 0.2 is the rate of missing values in each column that we can afford. If the rate of missing values is higher, the column is discarded. Set 0 if you don't want NA values in your sample. But take care that you may want to remove rows (samples) from your data set rather than columns.

The first lines of the value stored in data are given hereafter:

 1 0 1 0 0 0 1 1 1 0 . 0 0 0 1
 1 0 1 0 0 0 1 1 1 1 1 0 0 0 1
 . 0 0 . 0 0 1 1 1 1 0 1 0 0 1
 1 0 0 1 . 0 1 1 1 1 0 1 0 1 1
 0 0 0 1 0 0 1 1 1 1 0 . 0 0 0
 ...

Define a Probability law in STK++

The STK++ library provide some predefined tools for handling Bernoulli random variates:

See also
Law::Bernoulli, MultiLaw::JointBernoulli, Law::IUnivLaw, MultiLaw::JointProbability.
Note
A joint Bernoulli law is the joint law of independent random Bernoulli variates not necessarily identically distributed.

All these classes allow to compute the probability density function (pdf), the log-pdf (lpdf) and to simulate Bernoulli random variates or (uncorrelated) multi-dimensional Bernoulli random variates. The Bernoulli class implement the computation of the pdf and log-pdf in the following way:

Real Bernoulli::pdf(Binary const& x) const
{
switch (x)
{
case zero_: return 1.-prob_;
case one_: return prob_;
default: break;
}
return Arithmetic<Real>::NA();
}
Real Bernoulli::lpdf(Binary const& x) const
{
switch (x)
{
case zero_: return (prob_ == 1) ? -Arithmetic<Real>::infinity() : std::log(1.-prob_);
case one_: return (prob_ == 0) ? -Arithmetic<Real>::infinity() : std::log(prob_);
default: break;
}
return Arithmetic<Real>::NA();
}

As can be noticed, the NA (Not-Available) values are handle by the class, but the returned value Arithmetic<Real>::NA() can be a problem. The computation of the log-pdf in the JointProbability class is implemented in the following way:

virtual Real lpdf( RowVector const& x) const
{
if (x.range() != jointLaw_.range())
{STKRUNTIME_ERROR_NO_ARG(JointProbability::lpdf(x),dimensions mismatch);}
Real sum = 0.;
for (int j = x.begin(); j < x.end(); ++j)
{ sum+= Arithmetic<Type>::isNA(x[j]) ? 0. : jointLaw_[j].lpdf(x[j]);}
return sum;
}
#define STKRUNTIME_ERROR_NO_ARG(Where, Error)
Definition STK_Macros.h:138

Define a statistical model in STK++

Let us see how to build a statistical model in STK++. For the MixBin application we have two ways to define the binary model. We can either implement the IUnivStatModel interface and create a statistical model for each column of the array data or implement the IMultiStatModel interface class.

These interfaces are running classes, derived from the interface STK::IRunnerUnsupervised. A IRunnerUnsupervised declare two pure virtual functions run() and run(weights). The call to these methods will launch the estimation process.

In the first case, the code would be the following

BernoulliModel< Array2DVector<Binary> > univariateModel;
Array2DVector<Binary> var;
for (int j= data.beginCols(); j <data.endCols(); ++j)
{
var.move(data.col(j));
univariateModel.setData(var);
univariateModel.run();
stk_cout << _T("j= ") << j << _T(". ")
<< _T("Model.law().prob() = ") << univariateModel.law().prob() << _T("\n";);
}

and in the second case the code will be

DataBridge<Array2D<Binary> > bridge("Binary Data Id", data);
ModelBernoulli_pj<Array2D<Binary> > jb(bridge);
jb.run();
for (int j= data.beginCols(); j < data.endCols(); ++j)
{
stk_cout << _T("j= ") << j << _T(". ")
<< _T("jb.law().prob(") << j << _T(") = ")
<< jb.param().prob()[j] << _T("\n";);
}