Binning demonstration on locally generated fake data

In this example, we generate a table with random data simulating a single event dataset. We showcase the binning method, first on a simple single table using the bin_partition method and then in the distributed method bin_dataframe, using daks dataframes. The first method is never really called directly, as it is simply the function called by the bin_dataframe on each partition of the dask dataframe.

[1]:
import sys

import dask
import numpy as np
import pandas as pd
import dask.dataframe

import matplotlib.pyplot as plt

sys.path.append("../")
from sed.binning import bin_partition, bin_dataframe

Generate Fake Data

[2]:
n_pts = 100000
cols = ["posx", "posy", "energy"]
df = pd.DataFrame(np.random.randn(n_pts, len(cols)), columns=cols)
df
[2]:
posx posy energy
0 -1.597617 -0.543767 0.729540
1 -1.765663 -1.610388 -1.131555
2 2.330603 2.251475 -1.256111
3 -0.729069 -0.330417 1.571411
4 1.003576 1.423138 0.859336
... ... ... ...
99995 -0.063640 0.004117 -0.801435
99996 -1.198671 -0.618133 1.084128
99997 -0.085724 2.608801 0.226714
99998 0.237666 1.075434 -0.435475
99999 -0.633201 0.019438 -0.329833

100000 rows × 3 columns

Define the binning range

[3]:
binAxes = ["posx", "posy", "energy"]
nBins = [120, 120, 120]
binRanges = [(-2, 2), (-2, 2), (-2, 2)]
coords = {ax: np.linspace(r[0], r[1], n) for ax, r, n in zip(binAxes, binRanges, nBins)}

Compute the binning along the pandas dataframe

[4]:
%%time
res = bin_partition(
    part=df,
    bins=nBins,
    axes=binAxes,
    ranges=binRanges,
    hist_mode="numba",
)
CPU times: user 1.24 s, sys: 39.5 ms, total: 1.28 s
Wall time: 1.43 s
[5]:
fig, axs = plt.subplots(1, 3, figsize=(8, 2.5), constrained_layout=True)
for i in range(3):
    axs[i].imshow(res.sum(i))
../_images/tutorial_1_binning_fake_data_8_0.png

Transform to dask dataframe

[6]:
ddf = dask.dataframe.from_pandas(df, npartitions=50)
ddf
[6]:
Dask DataFrame Structure:
posx posy energy
npartitions=50
0 float64 float64 float64
2000 ... ... ...
... ... ... ...
98000 ... ... ...
99999 ... ... ...
Dask Name: from_pandas, 1 graph layer

Compute distributed binning on the partitioned dask dataframe

In this example, the small dataset does not give significant improvement over the pandas implementation, at least using this number of partitions. A single partition would be faster (you can try…) but we use multiple for demonstration purposes.

[7]:
%%time
res = bin_dataframe(
    df=ddf,
    bins=nBins,
    axes=binAxes,
    ranges=binRanges,
    hist_mode="numba",
)
CPU times: user 401 ms, sys: 521 ms, total: 922 ms
Wall time: 509 ms
[8]:
fig, axs = plt.subplots(1, 3, figsize=(8, 2.5), constrained_layout=True)
for dim, ax in zip(binAxes, axs):
    res.sum(dim).plot(ax=ax)
../_images/tutorial_1_binning_fake_data_13_0.png
[ ]: