Compile C with CMAKE - A Machine Learning Example
CMAKE can generate specific Makefiles for a variety of systems, on Windows-MinGW-w64 it can be installed with:
pacman -S mingw-w64-x86_64-cmake
It can also supports libraries or flags to use when linking object-files. The C code below, for example, requires the Gnu Scientific Library (GSL) for generating a flat distribution of random numbers.
When compiled it outputs data for a 3-class classification problem, where points are uniformly distributed around three predefined centers.
/* filename: kmeansdata.c */
#include <stdio.h>
#include <math.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
int
main (void)
{
int numClusts;
int i,j;
int totalPoints = 200;
// GSL random numbers generator
const gsl_rng_type * T;
gsl_rng * r;
// Initialize the random number generator (Tausworthe)
T = gsl_rng_taus;
r = gsl_rng_alloc (T);
gsl_rng_set (r, gsl_rng_default_seed);
//number of centers
numClusts =3;
double centers[numClusts][2];
centers[0][0]=0.3 ;
centers[0][1]=0.4 ;
centers[1][0]=0.7 ;
centers[1][1]=0.8 ;
centers[2][0]=0.5 ;
centers[2][1]=0.6 ;
double points[totalPoints];
for (j =0 ; j< totalPoints; j++){
for (i = 0 ; i< numClusts; i++)
{
printf("%0.6e, %0.6e, %d\n",centers[i][0]+(gsl_ran_flat(r, -0.1,0.1)),
centers[i][1]+(gsl_ran_flat(r, -0.1,0.1)), i);
}
}
gsl_rng_free (r);
return 0;
}
For this simple project, CMAKE requires a single file named CMakeLists.txt shown below.
cmake_minimum_required(VERSION 3.18)
project(kmeansdata
VERSION 0.0.2
LANGUAGES C
DESCRIPTION "test kmeans data for unsupervised machine learning"
HOMEPAGE_URL https://tessarinseve.pythonanywhere.com/nws/index.html
)
add_executable(${PROJECT_NAME})
target_sources(${PROJECT_NAME}
PUBLIC
kmeansdata.c
)
target_link_libraries(${PROJECT_NAME} -lgsl -lgc -lgslcblas -lm)
The project Makefile is then created from the command line with:
$ cmake -G "Unix Makefiles" .
Followed by:
$ make
and it produces the kmeansdata.exe in the current directory. To display the points' distribution, I can reuse the script csvgui.py described here.
The figure below shows the generated points around the initial centers.
Unsupervised Machine Learning with KMeans
When the labels are removed from the dataset, it's possible to reclassify the data with an unsupervised algorithm. This example is particularly simple, where the numbers of centroids can be easily guessed and the three blobs are well separated. KMeans should be able to perfectly recreate the initial classification for this dataset, as demonstrated in the video below.
# filename: testkmeans.py
from sklearn.cluster import KMeans
import pandas as pd
import matplotlib.pylab as plt
# recover serialized dataframe
df = pd.read_pickle("last.pkl")
# remove labels
del df["c"]
kmeans = KMeans(n_clusters=3, random_state=0).fit(df)
# recreate labels
df["c"]=kmeans.labels_
ax=df.plot.scatter(x="x1",y="x2",c="c", colormap='viridis')
plt.show()