2022-02-10

VIM For Fast Machine Learning

A common misperception is that VIM isn't particularly suitable for machine learning projects. A standard Makefile, for instance, can entirely handle a generic ML problem that consists of a collection of Python scripts and command-line instructions. VIM has several useful tools for solving these kinds of problems.

A synthetic dataset in CSV format for a multivariate classification problem can be generated from the script below.

    #!/usr/bin/env python
    import pandas as pd
    from sklearn.datasets import make_blobs
    X, y = make_blobs(n_samples=300, centers=5 ,random_state=0, center_box=(-1,1)
            ,cluster_std=0.2)

    # features matrix 
    fm = pd.DataFrame(data = X)

    # labels vector
    lv = pd.DataFrame(data = y)

    df = pd.concat([fm,lv],axis=1)

    df.to_csv("./dataset.csv",header=False,index=False)

In VIM, it's possible to test the previous script when it's displayed inside the current buffer with:

:!py %

As described in one of my previous note, I can use Gnuplot included in GNU-OCTAVE to display the distribution of data points and their labels.

Label 4, for instance, appears to be well separated from other groups and therefore I expect that a ML algorithm will classify those points correctly.

MLPACK

MLPACK is a library written in C++ for supervised and unsupervised machine learning, it provides precompiled executables for a set of ML algorithms. They can be extracted from the latest MLPACK zip file (mlpack-3.4.2.zip at the time of writing) onto a local directory and called directly from a Windows terminal or, more conveniently, from a Makefile.

As expected from C++ compiled binaries, Mlpack is really fast. The only known limitation is that CSV files must have a size smaller than half the available virtual memory. It's possible to check the dimension of the available virtual memory with:

$ systeminfo

Makefile

The Makefile automates the various steps required to solve a ML problem. Csvkit was used to separate features and labels from my initial dataset.

The dataset was then split into 80/20 % training-test subsets, a standard practice for a supervised ML classification problem. The decision-tree classifier finally fits the training data and produces the corresponding labels for the test data.

    X=features.csv
    y=labels.csv
    Xtraing=features_training.csv
    ytraing=labels_training.csv
    Xtest=features_test.csv
    ytest=labels_test.csv
    MLPACK=.

    all: *.csv

    dataset.csv: ./create_dataset.py 
        ./create_dataset.py

    %.csv: dataset.csv
        csvcut -c 1,2 dataset.csv >$(X)
        csvcut -c 3 dataset.csv> $(y)
        $(MLPACK)/mlpack_preprocess_split.exe \
            --input_file $(X) \
            --input_labels_file $(y) \
            --training_file $(Xtraing) \
            --training_labels_file $(ytraing) \
            --test_file $(Xtest) \
            --test_labels_file $(ytest)
        $(MLPACK)/mlpack_decision_tree.exe \
            --training_file $(Xtraing) \
            --labels_file $(ytraing) \
            --test_file $(Xtest) \
            --test_labels_file $(ytest) \
            -p labels_pred.csv 
            ./metrics.py


    .PHONY: clean all

    clean:
        rm *.csv

A python script compares the inferred labels with the test labels and shows how successful the model predicts the test data.

    #!/usr/bin/env python
    from sklearn import metrics
    import numpy as np
    ypred = np.loadtxt("./labels_pred.csv", dtype=int)
    ytest = np.loadtxt("./labels_test.csv", dtype=int)
    print(metrics.classification_report(ypred, ytest))

With Dispatch.vim I can start the makefile with:

:Make

and display the results in a QuickFix list with

:Copen

In case you need to test several models for a single dataset, this repository provides a simple example of a project with more than one Makefile.

Appendix

2D scatter plot with discrete color palette.

    set key off
    set ylabel 'x2'
    set xlabel 'x1'
    set datafile sep ","
    set palette defined (0 "red", 1 "green", 2 "blue", 3 "yellow", 4 "black")
    set palette maxcolors 5
    plot 'C:\Users\seve\workplace\vimfastml\data\dataset.csv'  u 1:2:3  w p pt 5 ps 3 palette