Binning demonstration on locally generated fake data
In this example, we generate a table with random data simulating a single event dataset. We showcase the binning method, first on a simple single table using the bin_partition method and then in the distributed method bin_dataframe, using daks dataframes. The first method is never really called directly, as it is simply the function called by the bin_dataframe on each partition of the dask dataframe.
[1]:
import sys
import dask
import numpy as np
import pandas as pd
import dask.dataframe
import matplotlib.pyplot as plt
sys.path.append("../")
from sed.binning import bin_partition, bin_dataframe
Generate Fake Data
[2]:
n_pts = 100000
cols = ["posx", "posy", "energy"]
df = pd.DataFrame(np.random.randn(n_pts, len(cols)), columns=cols)
df
[2]:
posx | posy | energy | |
---|---|---|---|
0 | -1.597617 | -0.543767 | 0.729540 |
1 | -1.765663 | -1.610388 | -1.131555 |
2 | 2.330603 | 2.251475 | -1.256111 |
3 | -0.729069 | -0.330417 | 1.571411 |
4 | 1.003576 | 1.423138 | 0.859336 |
... | ... | ... | ... |
99995 | -0.063640 | 0.004117 | -0.801435 |
99996 | -1.198671 | -0.618133 | 1.084128 |
99997 | -0.085724 | 2.608801 | 0.226714 |
99998 | 0.237666 | 1.075434 | -0.435475 |
99999 | -0.633201 | 0.019438 | -0.329833 |
100000 rows × 3 columns
Define the binning range
[3]:
binAxes = ["posx", "posy", "energy"]
nBins = [120, 120, 120]
binRanges = [(-2, 2), (-2, 2), (-2, 2)]
coords = {ax: np.linspace(r[0], r[1], n) for ax, r, n in zip(binAxes, binRanges, nBins)}
Compute the binning along the pandas dataframe
[4]:
%%time
res = bin_partition(
part=df,
bins=nBins,
axes=binAxes,
ranges=binRanges,
hist_mode="numba",
)
CPU times: user 1.24 s, sys: 39.5 ms, total: 1.28 s
Wall time: 1.43 s
[5]:
fig, axs = plt.subplots(1, 3, figsize=(8, 2.5), constrained_layout=True)
for i in range(3):
axs[i].imshow(res.sum(i))
![../_images/tutorial_1_binning_fake_data_8_0.png](../_images/tutorial_1_binning_fake_data_8_0.png)
Transform to dask dataframe
[6]:
ddf = dask.dataframe.from_pandas(df, npartitions=50)
ddf
[6]:
Dask DataFrame Structure:
posx | posy | energy | |
---|---|---|---|
npartitions=50 | |||
0 | float64 | float64 | float64 |
2000 | ... | ... | ... |
... | ... | ... | ... |
98000 | ... | ... | ... |
99999 | ... | ... | ... |
Dask Name: from_pandas, 1 graph layer
Compute distributed binning on the partitioned dask dataframe
In this example, the small dataset does not give significant improvement over the pandas implementation, at least using this number of partitions. A single partition would be faster (you can try…) but we use multiple for demonstration purposes.
[7]:
%%time
res = bin_dataframe(
df=ddf,
bins=nBins,
axes=binAxes,
ranges=binRanges,
hist_mode="numba",
)
CPU times: user 401 ms, sys: 521 ms, total: 922 ms
Wall time: 509 ms
[8]:
fig, axs = plt.subplots(1, 3, figsize=(8, 2.5), constrained_layout=True)
for dim, ax in zip(binAxes, axs):
res.sum(dim).plot(ax=ax)
![../_images/tutorial_1_binning_fake_data_13_0.png](../_images/tutorial_1_binning_fake_data_13_0.png)
[ ]: