VIM For Fast Machine Learning
A common misperception is that VIM isn't particularly suitable for machine learning projects. A standard Makefile, for instance, can entirely handle a generic ML problem that consists of a collection of Python scripts and command-line instructions. VIM has several useful tools for solving these kinds of problems.
A synthetic dataset in CSV format for a multivariate classification problem can be generated from the script below.
#!/usr/bin/env python
import pandas as pd
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, centers=5 ,random_state=0, center_box=(-1,1)
,cluster_std=0.2)
# features matrix
fm = pd.DataFrame(data = X)
# labels vector
lv = pd.DataFrame(data = y)
df = pd.concat([fm,lv],axis=1)
df.to_csv("./dataset.csv",header=False,index=False)
In VIM, it's possible to test the previous script when it's displayed inside the current buffer with:
:!py %
As described in one of my previous note, I can use Gnuplot included in GNU-OCTAVE to display the distribution of data points and their labels.
Label 4, for instance, appears to be well separated from other groups and therefore I expect that a ML algorithm will classify those points correctly.
MLPACK
MLPACK is a library written in C++ for supervised and unsupervised machine learning, it provides precompiled executables for a set of ML algorithms. They can be extracted from the latest MLPACK zip file (mlpack-3.4.2.zip at the time of writing) onto a local directory and called directly from a Windows terminal or, more conveniently, from a Makefile.
As expected from C++ compiled binaries, Mlpack is really fast. The only known limitation is that CSV files must have a size smaller than half the available virtual memory. It's possible to check the dimension of the available virtual memory with:
$ systeminfo
Makefile
The Makefile automates the various steps required to solve a ML problem. Csvkit was used to separate features and labels from my initial dataset.
The dataset was then split into 80/20 % training-test subsets, a standard practice for a supervised ML classification problem. The decision-tree classifier finally fits the training data and produces the corresponding labels for the test data.
X=features.csv
y=labels.csv
Xtraing=features_training.csv
ytraing=labels_training.csv
Xtest=features_test.csv
ytest=labels_test.csv
MLPACK=.
all: *.csv
dataset.csv: ./create_dataset.py
./create_dataset.py
%.csv: dataset.csv
csvcut -c 1,2 dataset.csv >$(X)
csvcut -c 3 dataset.csv> $(y)
$(MLPACK)/mlpack_preprocess_split.exe \
--input_file $(X) \
--input_labels_file $(y) \
--training_file $(Xtraing) \
--training_labels_file $(ytraing) \
--test_file $(Xtest) \
--test_labels_file $(ytest)
$(MLPACK)/mlpack_decision_tree.exe \
--training_file $(Xtraing) \
--labels_file $(ytraing) \
--test_file $(Xtest) \
--test_labels_file $(ytest) \
-p labels_pred.csv
./metrics.py
.PHONY: clean all
clean:
rm *.csv
A python script compares the inferred labels with the test labels and shows how successful the model predicts the test data.
#!/usr/bin/env python
from sklearn import metrics
import numpy as np
ypred = np.loadtxt("./labels_pred.csv", dtype=int)
ytest = np.loadtxt("./labels_test.csv", dtype=int)
print(metrics.classification_report(ypred, ytest))
With Dispatch.vim I can start the makefile with:
:Make
and display the results in a QuickFix list with
:Copen
In case you need to test several models for a single dataset, this repository provides a simple example of a project with more than one Makefile.
Appendix
- 2D scatter plot with discrete color palette.
set key off
set ylabel 'x2'
set xlabel 'x1'
set datafile sep ","
set palette defined (0 "red", 1 "green", 2 "blue", 3 "yellow", 4 "black")
set palette maxcolors 5
plot 'C:\Users\seve\workplace\vimfastml\data\dataset.csv' u 1:2:3 w p pt 5 ps 3 palette