Datasets API
culicidaelab.datasets
Dataset management components for the CulicidaeLab library.
This module provides the DatasetsManager, a high-level interface for accessing, loading, and managing datasets as defined in the application's configuration. It simplifies interactions with different data sources and providers.
__all__ = ['DatasetsManager']
module-attribute
DatasetsManager
Manages access, loading, and caching of configured datasets.
This manager provides a high-level interface that uses the global settings for configuration and a dedicated provider service for the actual data loading. This decouples the logic of what datasets are available from how they are loaded and sourced.
Attributes:
Name | Type | Description |
---|---|---|
settings |
The main settings object for the library. |
|
provider_service |
The service for resolving and using data providers. |
|
loaded_datasets |
dict[str, str | Path]
|
A cache for storing the paths of downloaded datasets. |
Source code in culicidaelab/datasets/datasets_manager.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
|
settings = settings
instance-attribute
provider_service = provider_service
instance-attribute
loaded_datasets: dict[str, str | Path] = {}
instance-attribute
__init__(settings: Settings, provider_service: ProviderService)
Initializes the DatasetsManager with its dependencies.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
settings
|
Settings
|
The main Settings object for the library. |
required |
provider_service
|
ProviderService
|
The ProviderService for resolving dataset paths and loading data. |
required |
Source code in culicidaelab/datasets/datasets_manager.py
get_dataset_info(dataset_name: str) -> DatasetConfig
Retrieves the configuration for a specific dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_name
|
str
|
The name of the dataset (e.g., 'classification'). |
required |
Returns:
Name | Type | Description |
---|---|---|
DatasetConfig |
DatasetConfig
|
A Pydantic model instance containing the dataset's validated configuration. |
Raises:
Type | Description |
---|---|
KeyError
|
If the specified dataset is not found in the configuration. |
Example
manager = DatasetsManager(settings, provider_service) try: ... info = manager.get_dataset_info('classification') ... print(info.provider_name) ... except KeyError as e: ... print(e)
Source code in culicidaelab/datasets/datasets_manager.py
list_datasets() -> list[str]
Lists all available dataset names from the configuration.
Returns:
Type | Description |
---|---|
list[str]
|
list[str]: A list of configured dataset names. |
Example
manager = DatasetsManager(settings, provider_service) available_datasets = manager.list_datasets() print(available_datasets)
Source code in culicidaelab/datasets/datasets_manager.py
list_loaded_datasets() -> list[str]
Lists all datasets that have been loaded during the session.
Returns:
Type | Description |
---|---|
list[str]
|
list[str]: A list of names for datasets that are currently cached. |
Example
manager = DatasetsManager(settings, provider_service) _ = manager.load_dataset('classification', split='train') loaded = manager.list_loaded_datasets() print(loaded) ['classification']
Source code in culicidaelab/datasets/datasets_manager.py
load_dataset(dataset_name: str, split: str | None = None, **kwargs: Any) -> Any
Loads a specific dataset, downloading it if not already cached.
This method first checks a local cache for the dataset path. If the dataset is not cached, it resolves the path using the settings, instructs the appropriate provider to download it, and caches the path. Finally, it uses the provider to load the dataset into memory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_name
|
str
|
The name of the dataset to load. |
required |
split
|
str
|
The specific dataset split to load (e.g., 'train', 'test'). Defaults to None. |
None
|
**kwargs
|
Any
|
Additional keyword arguments to pass to the provider's dataset loading function. |
{}
|
Returns:
Name | Type | Description |
---|---|---|
Any |
Any
|
The loaded dataset object, with its type depending on the provider. |
Raises:
Type | Description |
---|---|
KeyError
|
If the dataset configuration does not exist. |
Source code in culicidaelab/datasets/datasets_manager.py
handler: python selection: members: true