{ "cells": [ { "cell_type": "markdown", "id": "8ad4167a-e4e7-498d-909a-c04da9f177ed", "metadata": { "tags": [] }, "source": [ "# Binning demonstration on locally generated fake data\n", "In this example, we generate a table with random data simulating a single event dataset.\n", "We showcase the binning method, first on a simple single table using the bin_partition method and then in the distributed method bin_dataframe, using daks dataframes.\n", "The first method is never really called directly, as it is simply the function called by the bin_dataframe on each partition of the dask dataframe." ] }, { "cell_type": "code", "execution_count": null, "id": "fb045e17-fa89-4c11-9d51-7f06e80d96d5", "metadata": {}, "outputs": [], "source": [ "import dask\n", "import numpy as np\n", "import pandas as pd\n", "import dask.dataframe\n", "\n", "import matplotlib.pyplot as plt\n", "\n", "from sed.binning import bin_partition, bin_dataframe\n", "\n", "%matplotlib widget" ] }, { "cell_type": "markdown", "id": "42a6afaa-17dd-4637-ba75-a28c4ead1adf", "metadata": {}, "source": [ "## Generate Fake Data" ] }, { "cell_type": "code", "execution_count": null, "id": "2aa8df59-224a-46a2-bb77-0277ff504996", "metadata": {}, "outputs": [], "source": [ "n_pts = 100000\n", "cols = [\"posx\", \"posy\", \"energy\"]\n", "df = pd.DataFrame(np.random.randn(n_pts, len(cols)), columns=cols)\n", "df" ] }, { "cell_type": "markdown", "id": "6902fd56-1456-4da6-83a4-0f3f6b831eb6", "metadata": {}, "source": [ "## Define the binning range" ] }, { "cell_type": "code", "execution_count": null, "id": "a7601cd7-cd51-40a9-8fc7-8b7d32ff15d0", "metadata": {}, "outputs": [], "source": [ "binAxes = [\"posx\", \"posy\", \"energy\"]\n", "nBins = [120, 120, 120]\n", "binRanges = [(-2, 2), (-2, 2), (-2, 2)]\n", "coords = {ax: np.linspace(r[0], r[1], n) for ax, r, n in zip(binAxes, binRanges, nBins)}" ] }, { "cell_type": "markdown", "id": "00054b5d-fc96-4959-b562-7cb8545a9535", "metadata": {}, "source": [ "## Compute the binning along the pandas dataframe" ] }, { "cell_type": "code", "execution_count": null, "id": "758a0e95-7a03-4d44-9dae-e6bd2334554c", "metadata": {}, "outputs": [], "source": [ "%%time\n", "res = bin_partition(\n", " part=df,\n", " bins=nBins,\n", " axes=binAxes,\n", " ranges=binRanges,\n", " hist_mode=\"numba\",\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "c4f2b55f-11b3-4456-abd6-b0865749df96", "metadata": {}, "outputs": [], "source": [ "fig, axs = plt.subplots(1, 3, figsize=(8, 2.5), constrained_layout=True)\n", "for i in range(3):\n", " axs[i].imshow(res.sum(i))" ] }, { "cell_type": "markdown", "id": "e632dc1d-5eb5-4621-8bef-4438ce2c6a0c", "metadata": {}, "source": [ "## Transform to dask dataframe" ] }, { "cell_type": "code", "execution_count": null, "id": "ba0416b3-b4b6-4b18-8ed3-a76ab4889892", "metadata": {}, "outputs": [], "source": [ "ddf = dask.dataframe.from_pandas(df, npartitions=50)\n", "ddf" ] }, { "cell_type": "markdown", "id": "01066d40-010a-490b-9033-7339e5a21b26", "metadata": {}, "source": [ "## Compute distributed binning on the partitioned dask dataframe\n", "In this example, the small dataset does not give significant improvement over the pandas implementation, at least using this number of partitions.\n", "A single partition would be faster (you can try...) but we use multiple for demonstration purposes." ] }, { "cell_type": "code", "execution_count": null, "id": "cbed3261-187c-498d-8ee0-0c3a3c8a8c1e", "metadata": {}, "outputs": [], "source": [ "%%time\n", "res = bin_dataframe(\n", " df=ddf,\n", " bins=nBins,\n", " axes=binAxes,\n", " ranges=binRanges,\n", " hist_mode=\"numba\",\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "99d7d136-b677-4c16-bc8f-31ba8216579c", "metadata": {}, "outputs": [], "source": [ "fig, axs = plt.subplots(1, 3, figsize=(8, 2.5), constrained_layout=True)\n", "for dim, ax in zip(binAxes, axs):\n", " res.sum(dim).plot(ax=ax)" ] }, { "cell_type": "code", "execution_count": null, "id": "b8adf386-669f-4b77-920f-6dfec7b637c9", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "interpreter": { "hash": "728003ee06929e5fa5ff815d1b96bf487266025e4b7440930c6bf4536d02d243" }, "kernelspec": { "display_name": "python3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 5 }