2022-07-11

Data Tricks

Let's suppose to start with a csv data file (testdata.csv) structured in three columns as shown below, where each row represents an item that belongs to a particular category, and has a specific value.

    item,cat,value
    7656866085,D,46998772
    6516448034,A,58853106
    1588654381,A,538375067
    5222817048,C,83453539
    475012984,C,39846344
    1532273337,H,39754601
    ...
    

GNU datamash latest version (1.7) can be compiled with Mingw 64 bit toolchain on Windows, and it can be used to check if all rows have the same number of fields:

   $ datamash.exe -t, check <testdata.csv 
   454 lines, 3 fields

The sum of all the values in the last column for each category can be obtained with:

   $ cat testdata.csv | datamash.exe -t, -H  -s  groupby 2 sum 3 >pareto.csv

and saved to a new csv file.

    GroupBy(cat),sum(value)
    A,5502572868
    B,408375237
    C,3887575164
    D,1455346293
    F,836073125
    H,560329867
    M,110014
    R,17583634
    S,14034088
    T,23165466
    W,13736123
    

This file can then be imported inside the Ipython console directly from the command line with csvpy as shown in the video below:

To create a Pareto chart from the previous data, it's necessary to extend the dataframe with a column that contains the categorical cumulative percentage (right vertical axis).

    ...
    In [8]: df=pd.DataFrame(data[1:],columns=data[0])                                                                     
    In [9]: df = df.apply(pd.to_numeric, errors='ignore')                                                                 
    In [10]: df=df.sort_values(by='sum(value)',ascending=False)
    In [11]: df["per_c"]=100*df['sum(value)'].cumsum()/df['sum(value)'].sum()
    ...
    

The below Pareto chart is a bar and line graph that displays the relative importance attributed to each category. Bars are arranged in decreasing magnitude from left to right along the x-axis, while the solid red line shows the cumulative percentage. Applied to the Business Model Canvas, a Pareto diagram shows that the majority of costs, for instance, are due to a small subset of resources and activities. This may not be trivial to identify by just looking at the data.

Pandas GUI

From the Ipython console, it's also possible to export all the dataframes to the Pandas-GUI with this simple set of instructions.