2022-08-18

Compile C with CMAKE - A Machine Learning Example

CMAKE can generate specific Makefiles for a variety of systems, on Windows-MinGW-w64 it can be installed with:

pacman -S mingw-w64-x86_64-cmake

It can also supports libraries or flags to use when linking object-files. The C code below, for example, requires the Gnu Scientific Library (GSL) for generating a flat distribution of random numbers.

When compiled it outputs data for a 3-class classification problem, where points are uniformly distributed around three predefined centers.

    /* filename: kmeansdata.c */
    #include <stdio.h>
    #include <math.h>
    #include <gsl/gsl_rng.h>
    #include <gsl/gsl_randist.h>
    int
    main (void)
    {   
        int numClusts;
        int i,j;
        int totalPoints = 200;
        
        // GSL random numbers generator
        const gsl_rng_type * T;
        gsl_rng * r;

        // Initialize the random number generator (Tausworthe)
        T = gsl_rng_taus;
        r = gsl_rng_alloc (T);
        gsl_rng_set (r, gsl_rng_default_seed);
        
        //number of centers
        numClusts =3;
        double centers[numClusts][2];
        centers[0][0]=0.3 ;
        centers[0][1]=0.4 ;
        centers[1][0]=0.7 ;
        centers[1][1]=0.8 ;
        centers[2][0]=0.5 ;
        centers[2][1]=0.6 ;
        double points[totalPoints];
        for (j =0 ; j< totalPoints; j++){
            for (i = 0 ; i< numClusts; i++)
            {
            printf("%0.6e, %0.6e, %d\n",centers[i][0]+(gsl_ran_flat(r, -0.1,0.1)),
                centers[i][1]+(gsl_ran_flat(r, -0.1,0.1)), i);
            }
        }
      gsl_rng_free (r);
      return 0;
    }

For this simple project, CMAKE requires a single file named CMakeLists.txt shown below.

    cmake_minimum_required(VERSION 3.18)
    project(kmeansdata
        VERSION 0.0.2
        LANGUAGES C
        DESCRIPTION "test kmeans data for unsupervised machine learning"
        HOMEPAGE_URL https://tessarinseve.pythonanywhere.com/nws/index.html
        )

    add_executable(${PROJECT_NAME})
    target_sources(${PROJECT_NAME}
        PUBLIC
        kmeansdata.c
        )

    target_link_libraries(${PROJECT_NAME} -lgsl -lgc -lgslcblas -lm)

The project Makefile is then created from the command line with:

$ cmake -G "Unix Makefiles" .

Followed by:

$ make

and it produces the kmeansdata.exe in the current directory. To display the points' distribution, I can reuse the script csvgui.py described here.

The figure below shows the generated points around the initial centers.

Unsupervised Machine Learning with KMeans

When the labels are removed from the dataset, it's possible to reclassify the data with an unsupervised algorithm. This example is particularly simple, where the numbers of centroids can be easily guessed and the three blobs are well separated. KMeans should be able to perfectly recreate the initial classification for this dataset, as demonstrated in the video below.

    # filename: testkmeans.py
    from sklearn.cluster import KMeans
    import pandas as pd
    import matplotlib.pylab as plt
    
    # recover serialized dataframe
    df = pd.read_pickle("last.pkl")

    # remove labels
    del df["c"]

    kmeans = KMeans(n_clusters=3, random_state=0).fit(df)
    
    # recreate labels
    df["c"]=kmeans.labels_
    ax=df.plot.scatter(x="x1",y="x2",c="c", colormap='viridis')
    plt.show()