Dataset

SED comes with the ability to download and extract any URL based datasets. By default, user can the “WSe2”, “TaS2” and “Gd_W110” datasets but easy to extend this list.

Getting datasets

import os
from sed.dataset import dataset

Get

The “get” just needs the data name, but another root_dir can be provided.

Try to interrupt the download process and restart to see that it continues the download from where it stopped

dataset.get("WSe2", remove_zip = False)

Using default data path for "WSe2": "<user_path>/datasets/WSe2"

3%|▎         | 152M/5.73G [00:02<01:24, 71.3MB/s]

Using default data path for "WSe2": "<user_path>/datasets/WSe2"

100%|██████████| 5.73G/5.73G [01:09<00:00, 54.3MB/s]

Download complete.

Not providing “remove_zip” at all will by default delete the zip file after extraction

dataset.get("WSe2")

Setting the “use_existing” keyword to False allows to download the data in another location. Default is to use existing data

dataset.get("WSe2", root_dir = "new_datasets", use_existing=False)

Using specified data path for "WSe2": "<user_path>/new_datasets/datasets/WSe2"
Created new directory at <user_path>/new_datasets/datasets/WSe2


  3%|▎         | 152M/5.73G [00:02<01:24, 71.3MB/s]

Interrupting extraction has similar behavior to download and just continues from where it stopped.

Or if user deletes the extracted documents, it re-extracts from zip file

dataset.get("WSe2", remove_zip = False)

## Try to remove some files and rerun this command.

Using default data path for "WSe2": "<user_path>/datasets/WSe2"
WSe2 data is already fully downloaded.

5.73GB [00:00, 12.6MB/s]

Download complete.
Extracting WSe2 data...

100%|██████████| 113/113 [02:41<00:00,  1.43s/file]

WSe2 data extracted successfully.

“remove” allows removal of some or all instances of existing data

This would remove only one of the two existing paths

dataset.remove("WSe2", instance = dataset.existing_data_paths[0])

Removed <user_path>/datasets/WSe2

This removes all instances, if any present

dataset.remove("WSe2")

WSe2 data is not present.

Attributes useful for user

dataset.available

['WSe2', 'TaS2', 'Gd_W110']

dataset.dir

'<user_path>/datasets/WSe2'

dataset.subdirs

['<user_path>/datasets/WSe2/Scan049_1',
 '<user_path>/datasets/WSe2/energycal_2019_01_08']

dataset.existing_data_paths

['<user_path>/new_dataset/datasets/WSe2',
 '<user_path>/datasets/WSe2']

Example of adding custom datasets

import os
from sed.dataset import DatasetsManager

example_dset_name = "Example"
example_dset_info = {}

example_dset_info["url"] = "https://example-dataset.com/download" # not a real path
example_dset_info["subdirs"] = ["Example_subdir"]
example_dset_info["rearrange_files"] = True

DatasetsManager.add(data_name=example_dset_name, info=example_dset_info, levels=["folder", "user"])

Added Example dataset to folder datasets.json
Added Example dataset to user datasets.json

assert os.path.exists("./datasets.json")
dataset.available

['Example', 'WSe2', 'TaS2', 'Gd_W110']

DatasetsManager.remove(data_name=example_dset_name, levels=["user"])

Removed Example dataset from user datasets.json

# This should give an error
DatasetsManager.add(data_name=example_dset_name, info=example_dset_info, levels=["folder"])

ValueError: Dataset Example already exists in folder datasets.json.

dataset.get("Example")

Using default data path for "Example": "<user_path>/datasets/Example"
Created new directory at <user_path>/datasets/Example
Download complete.
Extracting Example data...


100%|██████████| 4/4 [00:00<00:00, 28.10file/s]

Example data extracted successfully.
Removed Example.zip file.
Rearranging files in Example_subdir.



100%|██████████| 3/3 [00:00<00:00, 696.11file/s]

File movement complete.
Rearranging complete.

print(dataset.dir)
print(dataset.subdirs)

<user_path>/datasets/Example
[]

dataset.get("Example", root_dir = "new_datasets", use_existing = False)

print(dataset.existing_data_paths)
path_to_remove = dataset.existing_data_paths[0]

['<user_path>/new_datasets/datasets/Example', '<user_path>/datasets/Example']

dataset.remove(data_name="Example", instance=path_to_remove)

Removed <user_path>/new_datasets/datasets/Example

assert not os.path.exists(path_to_remove)

print(dataset.existing_data_paths)

['<user_path>/datasets/Example']

Default datasets.json

{
    "WSe2": {
        "url": "https://zenodo.org/record/6369728/files/WSe2.zip",
        "subdirs": [
            "Scan049_1",
            "energycal_2019_01_08"
        ]
    },
    "Gd_W110": {
        "url": "https://zenodo.org/records/10658470/files/single_event_data.zip",
        "subdirs": [
            "analysis_data",
            "calibration_data"
        ],
        "rearrange_files": true
    },
    "TaS2": {
        "url": "https://zenodo.org/records/10160182/files/TaS2.zip",
        "subdirs": [
            "Scan0121_1",
            "energycal_2020_07_20"
        ]
    },
    "Test": {
        "url": "http://test.com/files/file.zip",
        "subdirs": [
            "subdir"
        ],
        "rearrange_files": true
    }
}

This module provides a Dataset class to download and extract datasets from web. These datasets are defined in a JSON file. The Dataset class implements these features Easy API: from sed.dataset import datasets datasets.get(“NAME”)

class sed.dataset.dataset.DatasetsManager

Bases: object

Class to manage adding and removing datasets from the JSON file.

NAME = 'datasets'

FILENAME = 'datasets.json'

json_path = {'folder': './datasets.json', 'module': '/home/runner/work/sed/sed/sed/dataset/datasets.json', 'user': '/home/runner/.config/sed/datasets.json'}

static load_datasets_dict()

Loads the datasets configuration dictionary from the user’s datasets JSON file.

If the file does not exist, it copies the default datasets JSON file from the module directory to the user’s datasets directory.

Returns:: The datasets dict loaded from the user’s datasets JSON file.
Return type:: dict

static add(data_name, info, levels=['user'])

Adds a new dataset to the datasets JSON file.

Parameters:

data_name (str) – Name of the dataset.
info (dict) – Information about the dataset.
levels (list) – List of levels to add the dataset to. Default is [“user”].

static remove(data_name, levels=['user'])

Removes a dataset from the datasets JSON file.

Parameters:

data_name (str) – Name of the dataset.
levels (list) – List of levels to remove the dataset from. Default is [“user”].

class sed.dataset.dataset.Dataset

Bases: object

property available: list

Returns a list of available datasets.

Returns:: List of available datasets.
Return type:: list

property data_name: str: Get the data name.

property existing_data_paths: list: Get paths where dataset exists.

get(data_name, **kwargs)

Fetches the specified data and extracts it to the given data path.

Parameters:

data_name (str) – Name of the data to fetch.
root_dir (str) – Path where the data should be stored. Default is the current directory.
use_existing (bool) – Whether to use the existing data path. Default is True.
remove_zip (bool) – Whether to remove the ZIP file after extraction. Default is True.
ignore_zip (bool) – Whether to ignore ZIP files when listing files. Default is True.

remove(data_name, instance='all')

Removes directories of all or defined instances of the specified dataset.

Parameters:

data_name (str) – Name of the dataset.
instance (str) – Name of the instance to remove. Default is “all”.

Dataset

Getting datasets

Get

The “get” just needs the data name, but another root_dir can be provided.

Try to interrupt the download process and restart to see that it continues the download from where it stopped

Not providing “remove_zip” at all will by default delete the zip file after extraction

Setting the “use_existing” keyword to False allows to download the data in another location. Default is to use existing data

Interrupting extraction has similar behavior to download and just continues from where it stopped.

Or if user deletes the extracted documents, it re-extracts from zip file

“remove” allows removal of some or all instances of existing data

This would remove only one of the two existing paths

This removes all instances, if any present

Attributes useful for user

Example of adding custom datasets

Default datasets.json

API