Tutorial for binning data from the SXP instrument at the European XFEL#

Preparation#

Import necessary libraries#

[1]:

%load_ext autoreload
%autoreload 2
from pathlib import Path
import os
import xarray as xr
import numpy as np

from sed import SedProcessor
from sed.dataset import dataset

%matplotlib widget
import matplotlib.pyplot as plt

Get data paths#

The paths are such that if you are on Maxwell, it uses those. Otherwise data is downloaded in current directory from Zenodo.

Generally, if it is your beamtime, you can both read the raw data and write to processed directory. However, for the public data, you can not write to processed directory.

[2]:

beamtime_dir = "/gpfs/exfel/exp/SXP/202302/p004316/" # on Maxwell
if os.path.exists(beamtime_dir) and os.access(beamtime_dir, os.R_OK):
    path = beamtime_dir + "/raw/"
    buffer_path = "Au_Mica/processed/"
else:
    # data_path can be defined and used to store the data in a specific location
    dataset.get("Au_Mica") # Put in Path to a storage of at least 10 GByte free space.
    path = dataset.dir
    buffer_path = path + "/processed/"

INFO - Not downloading Au_Mica data as it already exists at "/home/runner/work/sed/sed/docs/tutorial/datasets/Au_Mica".
Set 'use_existing' to False if you want to download to a new location.
INFO - Using existing data path for "Au_Mica": "/home/runner/work/sed/sed/docs/tutorial/datasets/Au_Mica"
INFO - Au_Mica data is already present.

Config setup#

Here we get the path to the config file and setup the relevant directories. This can also be done directly in the config file.

[3]:

# pick the default configuration file for SXP@XFEL
config_file = Path('../src/sed/config/sxp_example_config.yaml')
assert config_file.exists()

[4]:

# here we setup a dictionary that will be used to override the path configuration
config_override = {
    "core": {
        "paths": {
            "raw": path,
            "processed": buffer_path,
        },
    },
}

cleanup previous config files#

In this notebook, we will show how calibration parameters can be generated. Therefore we want to clean the local directory of previously generated files.

WARNING running the cell below will delete the “sed_config.yaml” file in the local directory. If these contain precious calibration parameters, DO NOT RUN THIS CELL.

[5]:

local_folder_config = Path('./sed_config.yaml')
if local_folder_config.exists():
    os.remove(local_folder_config)
    print(f'deleted local config file {local_folder_config}')
assert not local_folder_config.exists()

deleted local config file sed_config.yaml

Load Au/Mica data#

Now we load a couple of scans from Au 4f core levels. Data will be processed to parquet format first, if not existing yet, and then loaded into the processor.

[6]:

sp = SedProcessor(
    runs=["0058", "0059", "0060", "0061"],
    config=config_override,
    system_config=config_file,
    collect_metadata=False,
    verbose=True,
)

INFO - System config loaded from: [/home/runner/work/sed/sed/docs/src/sed/config/sxp_example_config.yaml]
INFO - Default config loaded from: [/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/site-packages/sed/config/default.yaml]
Reading files: 0 new files of 65 total.
All files converted successfully!
Filling nan values...
loading complete in  0.14 s

Inspect the dataframe#

We first try to get an overview of the structure of the data. For this, we look at the loaded dataframe:

[7]:

sp.dataframe.head()

[7]:

	trainId	pulseId	timeStamp	dldPosX	dldPosY	dldTimeSteps	delayStage
0	1862196735	20	1.700983e+09	1745	3024	6066	-125.495093
1	1862196735	25	1.700983e+09	524	1708	4830	-125.495093
2	1862196735	37	1.700983e+09	3385	3403	9148	-125.495093
3	1862196735	41	1.700983e+09	2099	2309	6542	-125.495093
4	1862196735	47	1.700983e+09	1713	850	18838	-125.495093

Train IDs in scans#

Next, let’s look at the trainIDs contained in these runs

[8]:

plt.figure()
ids=sp.dataframe.trainId.compute().values
plt.plot(ids)
plt.show()

Channel Histograms#

Let’s look at the single histograms of the main data channels

[9]:

sp.view_event_histogram(dfpid=3)

PulseIds, ElectronIds#

To get a better understanding of the structure of the data, lets look at the histograms of microbunches and electrons. We see that we have more hits at later microbunches, and only few multi-hits.

[10]:

axes = ["pulseId", "electronId"]
bins = [101, 11]
ranges = [(-0.5, 800.5), (-0.5, 10.5)]
sp.view_event_histogram(dfpid=1, axes=axes, bins=bins, ranges=ranges)

We can also inspect the counts per train as function of the trainId and the pulseId, which gives us a good idea about the evolution of the count rate over the run(s)

[11]:

plt.figure()
axes = ["trainId", "pulseId"]
bins = [100, 100]
ranges = [(ids.min()+1, ids.max()), (0, 800)]
res = sp.compute(bins=bins, axes=axes, ranges=ranges)
res.plot()

[11]:

<matplotlib.collections.QuadMesh at 0x7fd930e99f60>

Spectrum vs. MicrobunchId#

Let’s check the TOF spectrum as function of microbunch ID, to understand if the increasing hit probability has any influence on the spectrum.

[12]:

axes = ["dldTimeSteps", "pulseId"]
bins = [200, 800]
ranges = [(8000, 28000), (0, 800)]
res = sp.compute(bins=bins, axes=axes, ranges=ranges)
plt.figure()
res.plot(robust=True)

[12]:

<matplotlib.collections.QuadMesh at 0x7fd944289ae0>

We see that the background below the Au 4f core levels slightly changes with microbunch ID. The origin of this is not quite clear yet.

[13]:

plt.figure()
(res.loc[{"pulseId":slice(0,50)}].sum(axis=1)/res.loc[{"pulseId":slice(0,50)}].sum(axis=1).mean()).plot()
(res.loc[{"pulseId":slice(700,750)}].sum(axis=1)/res.loc[{"pulseId":slice(700,750)}].sum(axis=1).mean()).plot()
plt.legend(("mbID=0-50", "mbID=700-750"))

[13]:

<matplotlib.legend.Legend at 0x7fd94c4f8e50>

Energy Calibration#

We now load a bias series, where the sample bias was varied, effectively shifting the energy spectra. This allows us to calibrate the conversion between the digital values of the dld and the energy.

time-of-flight spectrum#

to compare with what we see on the measurement computer, we might want to plot the time-of-flight spectrum. This is done here.

[14]:

sp.append_tof_ns_axis()

INFO - Adding time-of-flight column in nanoseconds to dataframe.
INFO - Dask DataFrame Structure:
               trainId pulseId electronId timeStamp dldPosX dldPosY dldTimeSteps delayStage  dldTime
npartitions=65
                uint64   int64      int64   float64  uint16  uint16       uint64    float64  float64
                   ...     ...        ...       ...     ...     ...          ...        ...      ...
...                ...     ...        ...       ...     ...     ...          ...        ...      ...
                   ...     ...        ...       ...     ...     ...          ...        ...      ...
                   ...     ...        ...       ...     ...     ...          ...        ...      ...
Dask Name: assign, 12 graph layers

Now, to determine proper binning ranges, let’s have again a look at the event histograms:

[15]:

axes = ["dldTime"]
bins = [150]
ranges = [(-0.5, 150.5)]
sp.view_event_histogram(dfpid=1, axes=axes, bins=bins, ranges=ranges)

Load energy calibration files#

We now load a range of runs sequentially, that were recorded with different sample bias values, and load them afterwards into an xarray

[16]:

runs = ["0074", "0073", "0072", "0071", "0070", "0064", "0065", "0066", "0067", "0068", "0069"]
biases = np.arange(962, 951, -1)
data = []
for run in runs:
    sp.load(runs=[run])
    axes = ["dldTimeSteps"]
    bins = [2000]
    ranges = [(1000, 25000)]
    res = sp.compute(bins=bins, axes=axes, ranges=ranges)
    data.append(res)

biasSeries = xr.concat(data, dim=xr.DataArray(biases, dims="sampleBias", name="sampleBias"))

Reading files: 0 new files of 4 total.
All files converted successfully!
Filling nan values...
loading complete in  0.03 s

Reading files: 0 new files of 4 total.
All files converted successfully!
Filling nan values...
loading complete in  0.03 s

Reading files: 0 new files of 4 total.
All files converted successfully!
Filling nan values...
loading complete in  0.03 s

Reading files: 0 new files of 4 total.
All files converted successfully!
Filling nan values...
loading complete in  0.03 s

Reading files: 0 new files of 4 total.
All files converted successfully!
Filling nan values...
loading complete in  0.03 s

Reading files: 0 new files of 11 total.
All files converted successfully!
Filling nan values...
loading complete in  0.04 s

Reading files: 0 new files of 11 total.
All files converted successfully!
Filling nan values...
loading complete in  0.04 s

Reading files: 0 new files of 4 total.
All files converted successfully!
Filling nan values...
loading complete in  0.03 s

Reading files: 0 new files of 4 total.
All files converted successfully!
Filling nan values...
loading complete in  0.03 s

Reading files: 0 new files of 4 total.
All files converted successfully!
Filling nan values...
loading complete in  0.03 s

Reading files: 0 new files of 4 total.
All files converted successfully!
Filling nan values...
loading complete in  0.03 s

Load bias series#

Now we load the bias series xarray into the processor for calibration

[17]:

sp.load_bias_series(binned_data=biasSeries)

find calibration parameters#

We now will fit the tof-energy relation. This is done by finding the maxima of a peak in the tof spectrum, and then fitting the square root relation to obtain the calibration parameters.

[18]:

ranges=(6380, 6700)
ref_id=6
sp.find_bias_peaks(ranges=ranges, ref_id=ref_id, apply=True)

INFO - Use feature ranges: [(6784.0, 7096.0), (6736.0, 7024.0), (6664.0, 6964.0), (6580.0, 6892.0), (6496.0, 6832.0), (6448.0, 6772.0), (6376.0, 6700.0), (6316.0, 6640.0), (6196.0, 6568.0), (6088.0, 6508.0), (6052.0, 6448.0)].
INFO - Extracted energy features: [[6.892e+03 1.000e+00]
 [6.820e+03 1.000e+00]
 [6.760e+03 1.000e+00]
 [6.688e+03 1.000e+00]
 [6.628e+03 1.000e+00]
 [6.556e+03 1.000e+00]
 [6.496e+03 1.000e+00]
 [6.436e+03 1.000e+00]
 [6.376e+03 1.000e+00]
 [6.304e+03 1.000e+00]
 [6.244e+03 1.000e+00]].

[19]:

sp.calibrate_energy_axis(
    ref_energy=-33,
    method="lmfit",
    energy_scale='kinetic',
    d={'value':1.1,'min': .2, 'max':5.0, 'vary':False},
    t0={'value':-1E-8, 'min': -1E-6, 'max': 1e-4, 'vary':True},
    E0={'value': 0., 'min': -1500, 'max': 1500, 'vary': True},
)

INFO - [[Fit Statistics]]
    # fitting method   = leastsq
    # function evals   = 25
    # data points      = 11
    # variables        = 2
    chi-square         = 0.02957200
    reduced chi-square = 0.00328578
    Akaike info crit   = -61.1070499
    Bayesian info crit = -60.3112593
[[Variables]]
    d:   1.1 (fixed)
    t0: -9.9902e-08 +/- 2.6372e-10 (0.26%) (init = -1e-08)
    E0: -1120.58964 +/- 0.59620132 (0.05%) (init = 0)
[[Correlations]] (unreported correlations are < 0.100)
    C(t0, E0) = -0.9996

Save calibration#

Now we save the calibration parameters into a local configuration file, that will be loaded in the next step

[20]:

sp.save_energy_calibration()

INFO - Saved energy calibration parameters to "sed_config.yaml".

Bin data with energy axis#

Now that we have the calibration parameters, we can generate the energy axis for our dataset. We need to load it again, and apply the calibration

[21]:

sp.load(runs=np.arange(58, 62))
sp.add_jitter()
sp.filter_column("pulseId", max_value=756)
sp.append_energy_axis(bias_voltage=957)

Reading files: 0 new files of 65 total.
All files converted successfully!
Filling nan values...
loading complete in  0.07 s
INFO - add_jitter: Added jitter to columns ['dldPosX', 'dldPosY', 'dldTimeSteps'].
INFO - Adding energy column to dataframe:
INFO - Using energy calibration parameters generated on 07/20/2025, 13:33:50
INFO - Dask DataFrame Structure:
               trainId pulseId electronId timeStamp  dldPosX  dldPosY dldTimeSteps delayStage   energy
npartitions=65
                uint64   int64      int64   float64  float64  float64      float64    float64  float64
                   ...     ...        ...       ...      ...      ...          ...        ...      ...
...                ...     ...        ...       ...      ...      ...          ...        ...      ...
                   ...     ...        ...       ...      ...      ...          ...        ...      ...
                   ...     ...        ...       ...      ...      ...          ...        ...      ...
Dask Name: assign, 25 graph layers

Now, we can bin as function fo energy and delay stage position

[22]:

axes = ['energy', "delayStage"]
bins = [200, 100]
ranges = [[-37,-31], [-135, -115]]
res = sp.compute(bins=bins, axes=axes, ranges=ranges, normalize_to_acquisition_time="delayStage")

INFO - Calculate normalization histogram for axis 'delayStage'...

[23]:

fig, axs = plt.subplots(1, 1, figsize=(4, 3), constrained_layout=True)
res.plot(ax=axs)

[23]:

<matplotlib.collections.QuadMesh at 0x7fd930c41600>

Correct delay stage offset.#

We can also offset the zero delay of the delay stage

[24]:

sp.add_delay_offset(constant=126.9)

INFO - Adding delay offset to dataframe:
INFO - Delay offset parameters:
   Constant: 126.9
INFO - Dask DataFrame Structure:
               trainId pulseId electronId timeStamp  dldPosX  dldPosY dldTimeSteps delayStage   energy
npartitions=65
                uint64   int64      int64   float64  float64  float64      float64    float64  float64
                   ...     ...        ...       ...      ...      ...          ...        ...      ...
...                ...     ...        ...       ...      ...      ...          ...        ...      ...
                   ...     ...        ...       ...      ...      ...          ...        ...      ...
                   ...     ...        ...       ...      ...      ...          ...        ...      ...
Dask Name: assign, 27 graph layers

[25]:

axes = ['energy', "delayStage"]
bins = [200, 100]
ranges = [[-37,-31], [-8, 8]]
res = sp.compute(bins=bins, axes=axes, ranges=ranges, normalize_to_acquisition_time="delayStage")

INFO - Calculate normalization histogram for axis 'delayStage'...

[26]:

res_sub = res - res.loc[{"delayStage": slice(-8, -1)}].mean(axis=1)
fig, axs = plt.subplots(3, 1, figsize=(4, 8), constrained_layout=True)
res.plot(ax=axs[0])
res_sub.plot(ax=axs[1])
res_sub.loc[{"energy":slice(-32.5,-32)}].sum(axis=0).plot(ax=axs[2])

[26]:

[<matplotlib.lines.Line2D at 0x7fd9302ad030>]

[ ]: