2022-03-23

Machine Learning without Pandas

Pandas is an amazing library but not irreplaceable as it seems, thanks to Python's vast ecosystem there are several alternatives, here's one of my favorite.

As an example, I can reuse the same synthetic dataset described here.

SQLITE Database

I can easily create a SQLITE database in the working directory with CSVKIT:

csvsql -H --db sqlite:///mlpack.db --tables dataset --insert --overwrite data/dataset.csv

PyDBLite

PyDBLite provides a pythonic interface to SQLite and after I have installed the module with:

pip install pydblite

I can:

1. Inspect the dataset with dataset.field_info:

    mldb = Database("../../Data/mlpack.db")
    dataset = Table("dataset",mldb)
    dataset.field_info

2. Extract features for a particular group with a simple list comprehension, e.g. for the group with the label 4:

    x1=[r['a'] for r in dataset(c=4.0)]
    x2=[r['b'] for r in dataset(c=4.0)]

x1 and x2 are Lists, a built-in datatype and therefore they can be directly used to represent the data graphically.

3. Explore my data by displaying features values distribution, scatterplots and correlation coefficients.

See Uni/Bivariate analysis