# Binning demonstration on locally generated fake data
In this example, we generate a table with random data simulating a single event dataset.
We showcase the binning method, first on a simple single table using the bin_partition method and then in the distributed method bin_dataframe, using daks dataframes.
The first method is never really called directly, as it is simply the function called by the bin_dataframe on each partition of the dask dataframe.

In [None]:
import dask
import numpy as np
import pandas as pd
import dask.dataframe

import matplotlib.pyplot as plt

from sed.binning import bin_partition, bin_dataframe

%matplotlib widget

## Generate Fake Data

In [None]:
n_pts = 100000
cols = ["posx", "posy", "energy"]
df = pd.DataFrame(np.random.randn(n_pts, len(cols)), columns=cols)
df

## Define the binning range

In [None]:
binAxes = ["posx", "posy", "energy"]
nBins = [120, 120, 120]
binRanges = [(-2, 2), (-2, 2), (-2, 2)]
coords = {ax: np.linspace(r[0], r[1], n) for ax, r, n in zip(binAxes, binRanges, nBins)}

## Compute the binning along the pandas dataframe

In [None]:
%%time
res = bin_partition(
 part=df,
 bins=nBins,
 axes=binAxes,
 ranges=binRanges,
 hist_mode="numba",
)

In [None]:
fig, axs = plt.subplots(1, 3, figsize=(8, 2.5), constrained_layout=True)
for i in range(3):
 axs[i].imshow(res.sum(i))

## Transform to dask dataframe

In [None]:
ddf = dask.dataframe.from_pandas(df, npartitions=50)
ddf

## Compute distributed binning on the partitioned dask dataframe
In this example, the small dataset does not give significant improvement over the pandas implementation, at least using this number of partitions.
A single partition would be faster (you can try...) but we use multiple for demonstration purposes.

In [None]:
%%time
res = bin_dataframe(
 df=ddf,
 bins=nBins,
 axes=binAxes,
 ranges=binRanges,
 hist_mode="numba",
)

In [None]:
fig, axs = plt.subplots(1, 3, figsize=(8, 2.5), constrained_layout=True)
for dim, ax in zip(binAxes, axs):
 res.sum(dim).plot(ax=ax)