STK++ 0.9.13
|
This tutorial will show you how to construct an application (call MixBin) using the STK++ library as background. We will detail
We consider a statistical model of Bernoulli random variable of the following form
The model will be estimated using a sample
We will illustrate the usage of our toy application using the 1984 United States Congressional Voting Records Database
Schlimmer, J. C. (1987). Concept acquisition through representational adjustment. Doctoral dissertation, Department of Information and Computer Science, University of California, Irvine, CA. *
that can be found at the UCI Machine Learning Repository.
The first lines of the file "house-votes-84.data" are given hereafter
republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,? democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y ...
We will not make use of the first column. This data set is interesting in the sense that we will have to parameterize the Read/Write classes in order to adapt our reading to the coding convention used in it.
Our first task, is to communicate with the end-user. MixBin is a console-based program and the computations are driven by an option file furnished by the user like this one
Option file "house-votes-84.opt" |
---|
# The MIXBIN page: the name of a page, like the name of an option, is not case sensitive [MixBin] filename = ./data/house-votes-84.data # name of the csv file containing the data # How is coded the binary data set. Default values are 0,1,. mapping= y,n,? # Note: a list of parameter used always comma separators # dimensions of the models to estimate, the model with the best BIC criteria # will be returned. dims = 2:2 |
This file can be found in the "tutorial/MixBin/data/" directory.
In this option file, there exists only one page ([MixBin]), but there could more that one pages: for example we could have added a page for the outputs, a specific page for the parameters of the models, and so on...
The DManager project allows to extend the pre-defined class IPage in order to read options file like this one. For each page, we have to implement the IPage
interface class
MixBinPage class header |
---|
#ifndef MIXBINPAGE_H
#define MIXBINPAGE_H
#include <STKernel.h>
#include "DManager.h"
#include <map>
namespace tuto
{
{
public:
MixBinPage();
MixBinPage( MixBinPage const& page);
inline virtual ~MixBinPage() {}
inline std::map<STK::Binary, STK::String> const& mapping() const { return mapping_;}
virtual bool validate();
/*The clone method must always be reimplemented */
inline virtual IPage* clone() const { return new MixBinPage(*this); }
private:
STK::String fileName_;
std::map<STK::Binary, STK::String> mapping_;
STK::Range dims_;
};
} // namespace tuto
#endif /* MIXBINPAGE_H */
This file include all the other header files of the project DManager. This file include all the header files of the project STKernel. A IPage is an interface base class for reading and/or writing a page of option in a file. Definition STK_IPage.h:94 Index sub-vector region: Specialization when the size is unknown. Definition STK_Range.h:265 |
The constructor have been implemented in the following way
we give the name of the page in the constructor of IPage and initialize the members of the class to its default value. Observe that some options are optional. The only one necessary is the filename option. The second option, "mapping", is a Option::lstring_
, it means that when we will read this option we will get a list of String.
The final step is to overwrite the default value in the map mapping_
if the option mapping is present. This can be achieved by overwriting the validate()
virtual function. This function is called when a page is read from a file and allows to check if the values given are correct. Here is the piece of code that accomplish this task
When the pages required by the application have been defined, they can be read using the class ReadWritePages
. In the MixBin application, we use it directly in the main.
The STK++ library provide a template class TReadWriteCsv
which allow to read and write csv files of any type. In our example, we want to read the (Binary) data set "houses-votes-84.data" which is composed of binary and missing data.
This can be achieved in the following way
The map decoding have been constructed using the map encoding and is just the reversed mapping. For more details have a look directly to the mixbin.cpp file.
The first few lines of the values stored in the variable rw
are given hereafter
.,1,0,1,0,0,0,1,1,1,0,.,0,0,0,1,0 .,1,0,1,0,0,0,1,1,1,1,1,0,0,0,1,. .,.,0,0,.,0,0,1,1,1,1,0,1,0,0,1,1 .,1,0,0,1,.,0,1,1,1,1,0,1,0,1,1,0 .,0,0,0,1,0,0,1,1,1,1,0,.,0,0,0,0 ...
Finally we transfer data stored in the TReadWriteCsv, into some container adapted to the computation. We have mainly two kind of arrays that can be used: the STK::Array2D or the STK::CArray classes. As we don't need to perform algebraic operations with the data set, we chose to use the Array2D class which is more flexible for managing data. Moreover, in this case, the transfer from the TReadWriteCsv to the Array2D
is linear in the number of variable (the number of columns) and not in the number of data as if would have been with CArray
class).
The following lines of code will do the job:
Note that how we make use of the move function which allow us to transfer the data in constant time (see Using references and move).
The threshold 0.2 is the rate of missing values in each column that we can afford. If the rate of missing values is higher, the column is discarded. Set 0 if you don't want NA values in your sample. But take care that you may want to remove rows (samples) from your data set rather than columns.
The first lines of the value stored in data
are given hereafter:
1 0 1 0 0 0 1 1 1 0 . 0 0 0 1 1 0 1 0 0 0 1 1 1 1 1 0 0 0 1 . 0 0 . 0 0 1 1 1 1 0 1 0 0 1 1 0 0 1 . 0 1 1 1 1 0 1 0 1 1 0 0 0 1 0 0 1 1 1 1 0 . 0 0 0 ...
The STK++ library provide some predefined tools for handling Bernoulli random variates:
All these classes allow to compute the probability density function (pdf), the log-pdf (lpdf) and to simulate Bernoulli random variates or (uncorrelated) multi-dimensional Bernoulli random variates. The Bernoulli class implement the computation of the pdf and log-pdf in the following way:
As can be noticed, the NA (Not-Available) values are handle by the class, but the returned value Arithmetic<Real>::NA()
can be a problem. The computation of the log-pdf in the JointProbability class is implemented in the following way:
Let us see how to build a statistical model in STK++. For the MixBin application we have two ways to define the binary model. We can either implement the IUnivStatModel
interface and create a statistical model for each column of the array data
or implement the IMultiStatModel
interface class.
These interfaces are running classes, derived from the interface STK::IRunnerUnsupervised. A IRunnerUnsupervised declare two pure virtual functions run()
and run(weights)
. The call to these methods will launch the estimation process.
In the first case, the code would be the following
and in the second case the code will be