2022-08-22

Command Line Machine Learning - Gnu Octave

In my previous note, I compiled a C program that generates a dataset for a simple classification problem. More complex patterns can be obtained with this MATLAB/Octave function.

It's rather straightforward to turn this function into a script that I can call from the command line. I cloned this repository and then added at the top and at the bottom of generateData.m file the code below.

    #! /c/Users/Seve/workplace/octave-6.4.0-w64/mingw64/bin/octave-cli.exe -qf
    %  here's the path to the octave-cli.exe
    %  see https://tessarinseve.pythonanywhere.com/nws/2022-01-10.wiki.html
    arg_list = argv ();
    if nargin>9
        disp("Too many input parameters");
        quit;  
        else
      angleMean    =str2num(arg_list{1});
      angleStd     =str2num(arg_list{2});
      numClusts    =str2num(arg_list{3});
      xClustAvgSep =str2num(arg_list{4});
      yClustAvgSep =str2num(arg_list{5});
      lengthMean  = str2num(arg_list{6});
      lengthStd   = str2num(arg_list{7});
      lateralStd  = str2num(arg_list{8});
      totalPoints = str2num(arg_list{9});
    endif
    
    %generateData.m
    ...
       
    [data cp idx] = generateData(pi/angleMean,pi/angleStd,numClusts,xClustAvgSep,yClustAvgSep, ...
        lengthMean,lengthStd,lateralStd,totalPoints);

    numpoints = sum(cp);
    for i = 1:numpoints
        printf("%f, %f, %d\n", data(i,1), data(i,2), idx(i));
    endfor


    

The generated 2D clusters are streamed to standard output and then redirected to the csvgui.py script. The standard error stream, eventual GNU Octave warnings, can be suppressed by redirecting the output to the stream termination file.

   $ ./generateData.m 2 8 4 0.4 0.4 0.1 0.04 0.06 500 2>/dev/null | ./csvgui.py

PandasGui allows to visualise the data aggregated into four clusters, as shown below.

It's also easy to filter out the last column containing the index as follows:

   $ ./generateData.m 2 8 4 0.4 0.4 0.1 0.04 0.06 500 2>/dev/null |awk -F"," '{print $1", " $2}'>testmlpack.csv

and save it into a comma-separated values file.

The number of centroids and their positions, unknown a priori, can be obtained with mlpack_mean_shift binary from the command line with:

   $ ~/workplace/vimfastml/mlpack/mlpack_mean_shift.exe -i ./testmlpack.csv -C centroids.csv -m 5000 -v

Finally, this script shows the centroids (large red dots) above each group of data points.

    # filename: showcentroids.py
    import pandas as pd
    import matplotlib.pylab as plt
    df = pd.read_pickle("last.pkl")
    centroids = pd.read_csv("./centroids.csv",header=None)

    ax=df.plot.scatter(x = "x1",y = "x2",c = "c", colormap = 'viridis')
    centroids.plot.scatter(ax = ax,x = 0,y = 1,s = 75,c = "r")
    plt.show()