2022-02-10

VIM For Fast Machine Learning

A common misperception is that VIM isn't particularly suitable for machine learning projects. A standard Makefile, for instance, can entirely handle a generic ML problem that consists of a collection of Python scripts and command-line instructions. VIM has several useful tools for solving these kinds of problems.

A synthetic dataset in CSV format for a multivariate classification problem can be generated from the script below.

    #!/usr/bin/env python
    import pandas as pd
    from sklearn.datasets import make_blobs
    X, y = make_blobs(n_samples=300, centers=5 ,random_state=0, center_box=(-1,1)
            ,cluster_std=0.2)

    # features matrix 
    fm = pd.DataFrame(data = X)

    # labels vector
    lv = pd.DataFrame(data = y)

    df = pd.concat([fm,lv],axis=1)

    df.to_csv("./dataset.csv",header=False,index=False)
    

In VIM, it's possible to test the previous script when it's displayed inside the current buffer with:

   :!py %

As described in one of my previous note, I can use Gnuplot included in GNU-OCTAVE to display the distribution of data points and their labels.

Label 4, for instance, appears to be well separated from other groups and therefore I expect that a ML algorithm will classify those points correctly.

MLPACK

MLPACK is a library written in C++ for supervised and unsupervised machine learning, it provides precompiled executables for a set of ML algorithms. They can be extracted from the latest MLPACK zip file (mlpack-3.4.2.zip at the time of writing) onto a local directory and called directly from a Windows terminal or, more conveniently, from a Makefile.

As expected from C++ compiled binaries, Mlpack is really fast. The only known limitation is that CSV files must have a size smaller than half the available virtual memory. It's possible to check the dimension of the available virtual memory with:

   $ systeminfo

Makefile

The Makefile automates the various steps required to solve a ML problem. Csvkit was used to separate features and labels from my initial dataset.

The dataset was then split into 80/20 % training-test subsets, a standard practice for a supervised ML classification problem. The decision-tree classifier finally fits the training data and produces the corresponding labels for the test data.

    X=features.csv
    y=labels.csv
    Xtraing=features_training.csv
    ytraing=labels_training.csv
    Xtest=features_test.csv
    ytest=labels_test.csv
    MLPACK=.

    all: *.csv

    dataset.csv: ./create_dataset.py 
        ./create_dataset.py

    %.csv: dataset.csv
        csvcut -c 1,2 dataset.csv >$(X)
        csvcut -c 3 dataset.csv> $(y)
        $(MLPACK)/mlpack_preprocess_split.exe \
            --input_file $(X) \
            --input_labels_file $(y) \
            --training_file $(Xtraing) \
            --training_labels_file $(ytraing) \
            --test_file $(Xtest) \
            --test_labels_file $(ytest)
        $(MLPACK)/mlpack_decision_tree.exe \
            --training_file $(Xtraing) \
            --labels_file $(ytraing) \
            --test_file $(Xtest) \
            --test_labels_file $(ytest) \
            -p labels_pred.csv 
            ./metrics.py


    .PHONY: clean all

    clean:
        rm *.csv
    

A python script compares the inferred labels with the test labels and shows how successful the model predicts the test data.

    #!/usr/bin/env python
    from sklearn import metrics
    import numpy as np
    ypred = np.loadtxt("./labels_pred.csv", dtype=int)
    ytest = np.loadtxt("./labels_test.csv", dtype=int)
    print(metrics.classification_report(ypred, ytest))
    

With Dispatch.vim I can start the makefile with:

   :Make

and display the results in a QuickFix list with

   :Copen

In case you need to test several models for a single dataset, this repository provides a simple example of a project with more than one Makefile.

Appendix

  • 2D scatter plot with discrete color palette.
    set key off
    set ylabel 'x2'
    set xlabel 'x1'
    set datafile sep ","
    set palette defined (0 "red", 1 "green", 2 "blue", 3 "yellow", 4 "black")
    set palette maxcolors 5
    plot 'C:\Users\seve\workplace\vimfastml\data\dataset.csv'  u 1:2:3  w p pt 5 ps 3 palette