Datasets API
culicidaelab.datasets
    Dataset management components for the CulicidaeLab library.
This module provides the DatasetsManager, a high-level interface for accessing, loading, and managing datasets as defined in the application's configuration. It simplifies interactions with different data sources and providers.
__all__ = ['DatasetsManager']
  
      module-attribute
  
    
DatasetsManager
    Manages access, loading, and caching of configured datasets.
This manager provides a high-level interface that uses the global settings for configuration and a dedicated provider service for the actual data loading. This decouples the logic of what datasets are available from how they are loaded and sourced.
Attributes:
| Name | Type | Description | 
|---|---|---|
| settings | The main settings object for the library. | |
| provider_service | The service for resolving and using data providers. | |
| loaded_datasets | dict[str, str | Path] | A cache for storing the paths of downloaded datasets. | 
Source code in culicidaelab\datasets\datasets_manager.py
                | 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |  | 
settings = settings
  
      instance-attribute
  
    
provider_service = ProviderService(settings)
  
      instance-attribute
  
    
loaded_datasets: dict[str, str | Path] = {}
  
      instance-attribute
  
    
__init__(settings: Settings)
    Initializes the DatasetsManager with its dependencies.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| settings | Settings | The main Settings object for the library. | required | 
Source code in culicidaelab\datasets\datasets_manager.py
              
get_dataset_info(dataset_name: str) -> DatasetConfig
    Retrieves the configuration for a specific dataset.
Example
from culicidaelab.settings import Settings from culicidaelab.datasets import DatasetsManager settings = Settings() manager = DatasetsManager(settings) try: ... info = manager.get_dataset_info('classification') ... print(info.provider_name) ... except KeyError as e: ... print(e)
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| dataset_name | str | The name of the dataset (e.g., 'classification'). | required | 
Returns:
| Type | Description | 
|---|---|
| DatasetConfig | A Pydantic model instance containing the dataset's | 
| DatasetConfig | validated configuration. | 
Raises:
| Type | Description | 
|---|---|
| KeyError | If the specified dataset is not found in the configuration. | 
Source code in culicidaelab\datasets\datasets_manager.py
              
list_datasets() -> list[str]
    Lists all available dataset names from the configuration.
Example
from culicidaelab.settings import Settings from culicidaelab.datasets import DatasetsManager settings = Settings() manager = DatasetsManager(settings) available_datasets = manager.list_datasets() print(available_datasets)
Returns:
| Type | Description | 
|---|---|
| list[str] | A list of configured dataset names. | 
Source code in culicidaelab\datasets\datasets_manager.py
              
list_loaded_datasets() -> list[str]
    Lists all datasets that have been loaded during the session.
Example
from culicidaelab.settings import Settings from culicidaelab.datasets import DatasetsManager settings = Settings() manager = DatasetsManager(settings) _ = manager.load_dataset('classification', split='train') loaded = manager.list_loaded_datasets() print(loaded) ['classification']
Returns:
| Type | Description | 
|---|---|
| list[str] | A list of names for datasets that are currently cached. | 
Source code in culicidaelab\datasets\datasets_manager.py
              
load_dataset(name: str, split: str | list[str] | None = None, config_name: str | None = 'default') -> Any
    Loads a dataset, handling complex splits and caching automatically.
Example
from culicidaelab.settings import Settings from culicidaelab.datasets import DatasetsManager
This example assumes you have a configured settings object
settings = Settings() manager = DatasetsManager(settings)
Load the training split of the classification dataset
train_dataset = manager.load_dataset('classification', split='train')
Load all splits
all_splits = manager.load_dataset('classification')
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| name | str | The name of the dataset to load. | required | 
| split | str | list[str] | None | The split(s) to load.
- str: A single split name (e.g., "train", "test").
- None: Loads ALL available splits into a  | None | 
| config_name | str | None | The name of the dataset configuration to use. Defaults to "default". | 'default' | 
Returns:
| Type | Description | 
|---|---|
| Any | The loaded dataset object, which could be a  | 
| Any | depending on the provider and splits requested. | 
Source code in culicidaelab\datasets\datasets_manager.py
              handler: python selection: members: true