Binning demonstration on locally generated fake data#
In this example, we generate a table with random data simulating a single event dataset. We showcase the binning method, first on a simple single table using the bin_partition method and then in the distributed method bin_dataframe, using daks dataframes. The first method is never really called directly, as it is simply the function called by the bin_dataframe on each partition of the dask dataframe.
[1]:
import dask
import numpy as np
import pandas as pd
import dask.dataframe
import matplotlib.pyplot as plt
from sed.binning import bin_partition, bin_dataframe
%matplotlib widget
/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/site-packages/dask/dataframe/__init__.py:42: FutureWarning:
Dask dataframe query planning is disabled because dask-expr is not installed.
You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.
warnings.warn(msg, FutureWarning)
Generate Fake Data#
[2]:
n_pts = 100000
cols = ["posx", "posy", "energy"]
df = pd.DataFrame(np.random.randn(n_pts, len(cols)), columns=cols)
df
[2]:
posx | posy | energy | |
---|---|---|---|
0 | 0.163472 | -0.909197 | -0.207413 |
1 | -0.574284 | 0.080758 | -2.146202 |
2 | -0.746272 | 0.180816 | 0.359223 |
3 | 0.896156 | 0.060686 | 0.209593 |
4 | 0.246686 | -0.145365 | -0.042317 |
... | ... | ... | ... |
99995 | 1.222646 | 0.584658 | -1.180370 |
99996 | 0.417908 | 0.248488 | -0.250342 |
99997 | -1.653766 | 1.610031 | 0.904236 |
99998 | 0.595726 | 0.244187 | -1.182691 |
99999 | -1.991101 | -0.389323 | 0.167254 |
100000 rows × 3 columns
Define the binning range#
[3]:
binAxes = ["posx", "posy", "energy"]
nBins = [120, 120, 120]
binRanges = [(-2, 2), (-2, 2), (-2, 2)]
coords = {ax: np.linspace(r[0], r[1], n) for ax, r, n in zip(binAxes, binRanges, nBins)}
Compute the binning along the pandas dataframe#
[4]:
%%time
res = bin_partition(
part=df,
bins=nBins,
axes=binAxes,
ranges=binRanges,
hist_mode="numba",
)
CPU times: user 1.16 s, sys: 22 ms, total: 1.18 s
Wall time: 1.18 s
[5]:
fig, axs = plt.subplots(1, 3, figsize=(6, 1.875), constrained_layout=True)
for i in range(3):
axs[i].imshow(res.sum(i))
Transform to dask dataframe#
[6]:
ddf = dask.dataframe.from_pandas(df, npartitions=50)
ddf
[6]:
Dask DataFrame Structure:
posx | posy | energy | |
---|---|---|---|
npartitions=50 | |||
0 | float64 | float64 | float64 |
2000 | ... | ... | ... |
... | ... | ... | ... |
98000 | ... | ... | ... |
99999 | ... | ... | ... |
Dask Name: from_pandas, 1 graph layer
Compute distributed binning on the partitioned dask dataframe#
In this example, the small dataset does not give significant improvement over the pandas implementation, at least using this number of partitions. A single partition would be faster (you can try…) but we use multiple for demonstration purposes.
[7]:
%%time
res = bin_dataframe(
df=ddf,
bins=nBins,
axes=binAxes,
ranges=binRanges,
hist_mode="numba",
)
CPU times: user 633 ms, sys: 192 ms, total: 825 ms
Wall time: 708 ms
[8]:
fig, axs = plt.subplots(1, 3, figsize=(6, 1.875), constrained_layout=True)
for dim, ax in zip(binAxes, axs):
res.sum(dim).plot(ax=ax)
[ ]: