Binning demonstration on locally generated fake data#

In this example, we generate a table with random data simulating a single event dataset. We showcase the binning method, first on a simple single table using the bin_partition method and then in the distributed method bin_dataframe, using daks dataframes. The first method is never really called directly, as it is simply the function called by the bin_dataframe on each partition of the dask dataframe.

[1]:
import dask
import numpy as np
import pandas as pd
import dask.dataframe

import matplotlib.pyplot as plt

from sed.binning import bin_partition, bin_dataframe

%matplotlib widget
/opt/hostedtoolcache/Python/3.10.17/x64/lib/python3.10/site-packages/dask/dataframe/__init__.py:42: FutureWarning:
Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.

  warnings.warn(msg, FutureWarning)

Generate Fake Data#

[2]:
n_pts = 100000
cols = ["posx", "posy", "energy"]
df = pd.DataFrame(np.random.randn(n_pts, len(cols)), columns=cols)
df
[2]:
posx posy energy
0 -0.525636 1.736934 0.488467
1 -0.361820 1.829446 -1.176020
2 -0.458677 0.480398 0.523094
3 -1.085659 -0.619722 2.176555
4 -0.307780 -0.407004 -1.708962
... ... ... ...
99995 -0.550153 0.753477 -0.014569
99996 1.439914 0.181370 -1.634102
99997 1.846806 1.458536 0.941022
99998 -0.162405 -0.792115 -2.135910
99999 1.332117 0.407654 0.683675

100000 rows × 3 columns

Define the binning range#

[3]:
binAxes = ["posx", "posy", "energy"]
nBins = [120, 120, 120]
binRanges = [(-2, 2), (-2, 2), (-2, 2)]
coords = {ax: np.linspace(r[0], r[1], n) for ax, r, n in zip(binAxes, binRanges, nBins)}

Compute the binning along the pandas dataframe#

[4]:
%%time
res = bin_partition(
    part=df,
    bins=nBins,
    axes=binAxes,
    ranges=binRanges,
    hist_mode="numba",
)
CPU times: user 1.14 s, sys: 25.4 ms, total: 1.17 s
Wall time: 1.17 s
[5]:
fig, axs = plt.subplots(1, 3, figsize=(6, 1.875), constrained_layout=True)
for i in range(3):
    axs[i].imshow(res.sum(i))

Transform to dask dataframe#

[6]:
ddf = dask.dataframe.from_pandas(df, npartitions=50)
ddf
[6]:
Dask DataFrame Structure:
posx posy energy
npartitions=50
0 float64 float64 float64
2000 ... ... ...
... ... ... ...
98000 ... ... ...
99999 ... ... ...
Dask Name: from_pandas, 1 graph layer

Compute distributed binning on the partitioned dask dataframe#

In this example, the small dataset does not give significant improvement over the pandas implementation, at least using this number of partitions. A single partition would be faster (you can try…) but we use multiple for demonstration purposes.

[7]:
%%time
res = bin_dataframe(
    df=ddf,
    bins=nBins,
    axes=binAxes,
    ranges=binRanges,
    hist_mode="numba",
)
CPU times: user 669 ms, sys: 196 ms, total: 865 ms
Wall time: 766 ms
[8]:
fig, axs = plt.subplots(1, 3, figsize=(6, 1.875), constrained_layout=True)
for dim, ax in zip(binAxes, axs):
    res.sum(dim).plot(ax=ax)
[ ]: