Dataset

SED comes with the ability to download and extract any URL based datasets. By default, user can the “WSe2”, “TaS2” and “Gd_W110” datasets but easy to extend this list.

Getting datasets

import os
from sed.dataset import dataset

Get

The “get” just needs the data name, but another root_dir can be provided.

Try to interrupt the download process and restart to see that it continues the download from where it stopped
dataset.get("WSe2", remove_zip = False)
Using default data path for "WSe2": "<user_path>/datasets/WSe2"

3%|▎         | 152M/5.73G [00:02<01:24, 71.3MB/s]

Using default data path for "WSe2": "<user_path>/datasets/WSe2"

100%|██████████| 5.73G/5.73G [01:09<00:00, 54.3MB/s]

Download complete.
Not providing “remove_zip” at all will by default delete the zip file after extraction
dataset.get("WSe2")
Setting the “use_existing” keyword to False allows to download the data in another location. Default is to use existing data
dataset.get("WSe2", root_dir = "new_datasets", use_existing=False)
Using specified data path for "WSe2": "<user_path>/new_datasets/datasets/WSe2"
Created new directory at <user_path>/new_datasets/datasets/WSe2


  3%|▎         | 152M/5.73G [00:02<01:24, 71.3MB/s]
Interrupting extraction has similar behavior to download and just continues from where it stopped.
Or if user deletes the extracted documents, it re-extracts from zip file
dataset.get("WSe2", remove_zip = False)

## Try to remove some files and rerun this command.
Using default data path for "WSe2": "<user_path>/datasets/WSe2"
WSe2 data is already fully downloaded.


5.73GB [00:00, 12.6MB/s]

Download complete.
Extracting WSe2 data...



100%|██████████| 113/113 [02:41<00:00,  1.43s/file]

WSe2 data extracted successfully.

“remove” allows removal of some or all instances of existing data

This would remove only one of the two existing paths
dataset.remove("WSe2", instance = dataset.existing_data_paths[0])
Removed <user_path>/datasets/WSe2
This removes all instances, if any present
dataset.remove("WSe2")
WSe2 data is not present.

Attributes useful for user

dataset.available
['WSe2', 'TaS2', 'Gd_W110']
dataset.dir
'<user_path>/datasets/WSe2'
dataset.subdirs
['<user_path>/datasets/WSe2/Scan049_1',
 '<user_path>/datasets/WSe2/energycal_2019_01_08']
dataset.existing_data_paths
['<user_path>/new_dataset/datasets/WSe2',
 '<user_path>/datasets/WSe2']

Example of adding custom datasets

import os
from sed.dataset import DatasetsManager
example_dset_name = "Example"
example_dset_info = {}

example_dset_info["url"] = "https://example-dataset.com/download" # not a real path
example_dset_info["subdirs"] = ["Example_subdir"]
example_dset_info["rearrange_files"] = True

DatasetsManager.add(data_name=example_dset_name, info=example_dset_info, levels=["folder", "user"])
Added Example dataset to folder datasets.json
Added Example dataset to user datasets.json
assert os.path.exists("./datasets.json")
dataset.available
['Example', 'WSe2', 'TaS2', 'Gd_W110']
DatasetsManager.remove(data_name=example_dset_name, levels=["user"])
Removed Example dataset from user datasets.json
# This should give an error
DatasetsManager.add(data_name=example_dset_name, info=example_dset_info, levels=["folder"])
ValueError: Dataset Example already exists in folder datasets.json.
dataset.get("Example")
Using default data path for "Example": "<user_path>/datasets/Example"
Created new directory at <user_path>/datasets/Example
Download complete.
Extracting Example data...


100%|██████████| 4/4 [00:00<00:00, 28.10file/s]

Example data extracted successfully.
Removed Example.zip file.
Rearranging files in Example_subdir.



100%|██████████| 3/3 [00:00<00:00, 696.11file/s]

File movement complete.
Rearranging complete.
print(dataset.dir)
print(dataset.subdirs)
<user_path>/datasets/Example
[]
dataset.get("Example", root_dir = "new_datasets", use_existing = False)
print(dataset.existing_data_paths)
path_to_remove = dataset.existing_data_paths[0]
['<user_path>/new_datasets/datasets/Example', '<user_path>/datasets/Example']
dataset.remove(data_name="Example", instance=path_to_remove)
Removed <user_path>/new_datasets/datasets/Example
assert not os.path.exists(path_to_remove)
print(dataset.existing_data_paths)
['<user_path>/datasets/Example']

Default datasets.json

{
    "WSe2": {
        "url": "https://zenodo.org/record/6369728/files/WSe2.zip",
        "subdirs": [
            "Scan049_1",
            "energycal_2019_01_08"
        ]
    },
    "Gd_W110": {
        "url": "https://zenodo.org/records/10658470/files/single_event_data.zip",
        "subdirs": [
            "analysis_data",
            "calibration_data"
        ],
        "rearrange_files": true
    },
    "TaS2": {
        "url": "https://zenodo.org/records/10160182/files/TaS2.zip",
        "subdirs": [
            "Scan0121_1",
            "energycal_2020_07_20"
        ]
    },
    "Test": {
        "url": "http://test.com/files/file.zip",
        "subdirs": [
            "subdir"
        ],
        "rearrange_files": true
    }
}

API

This module provides a Dataset class to download and extract datasets from web. These datasets are defined in a JSON file. The Dataset class implements these features Easy API: from sed.dataset import datasets datasets.get(“NAME”)

class sed.dataset.dataset.DatasetsManager

Bases: object

Class to manage adding and removing datasets from the JSON file.

NAME = 'datasets'
FILENAME = 'datasets.json'
json_path = {'folder': './datasets.json', 'module': '/home/runner/work/sed/sed/sed/dataset/datasets.json', 'user': '/home/runner/.config/sed/datasets.json'}
static load_datasets_dict()

Loads the datasets configuration dictionary from the user’s datasets JSON file.

If the file does not exist, it copies the default datasets JSON file from the module directory to the user’s datasets directory.

Returns:

The datasets dict loaded from the user’s datasets JSON file.

Return type:

dict

static add(data_name, info, levels=['user'])

Adds a new dataset to the datasets JSON file.

Parameters:
  • data_name (str) – Name of the dataset.

  • info (dict) – Information about the dataset.

  • levels (list) – List of levels to add the dataset to. Default is [“user”].

static remove(data_name, levels=['user'])

Removes a dataset from the datasets JSON file.

Parameters:
  • data_name (str) – Name of the dataset.

  • levels (list) – List of levels to remove the dataset from. Default is [“user”].

class sed.dataset.dataset.Dataset

Bases: object

property available: list

Returns a list of available datasets.

Returns:

List of available datasets.

Return type:

list

property data_name: str

Get the data name.

property existing_data_paths: list

Get paths where dataset exists.

get(data_name, **kwargs)

Fetches the specified data and extracts it to the given data path.

Parameters:
  • data_name (str) – Name of the data to fetch.

  • root_dir (str) – Path where the data should be stored. Default is the current directory.

  • use_existing (bool) – Whether to use the existing data path. Default is True.

  • remove_zip (bool) – Whether to remove the ZIP file after extraction. Default is True.

  • ignore_zip (bool) – Whether to ignore ZIP files when listing files. Default is True.

remove(data_name, instance='all')

Removes directories of all or defined instances of the specified dataset.

Parameters:
  • data_name (str) – Name of the dataset.

  • instance (str) – Name of the instance to remove. Default is “all”.