Skip to content

Datasets API

culicidaelab.datasets

Dataset management components for the CulicidaeLab library.

This module provides the DatasetsManager, a high-level interface for accessing, loading, and managing datasets as defined in the application's configuration. It simplifies interactions with different data sources and providers.

__all__ = ['DatasetsManager'] module-attribute
DatasetsManager

Manages access, loading, and caching of configured datasets.

This manager provides a high-level interface that uses the global settings for configuration and a dedicated provider service for the actual data loading. This decouples the logic of what datasets are available from how they are loaded and sourced.

Attributes:

Name Type Description
settings

The main settings object for the library.

provider_service

The service for resolving and using data providers.

loaded_datasets dict[str, str | Path]

A cache for storing the paths of downloaded datasets.

Source code in culicidaelab/datasets/datasets_manager.py
class DatasetsManager:
    """Manages access, loading, and caching of configured datasets.

    This manager provides a high-level interface that uses the global settings
    for configuration and a dedicated provider service for the actual data
    loading. This decouples the logic of what datasets are available from how
    they are loaded and sourced.

    Attributes:
        settings: The main settings object for the library.
        provider_service: The service for resolving and using data providers.
        loaded_datasets: A cache for storing the paths of downloaded datasets.
    """

    def __init__(self, settings: Settings, provider_service: ProviderService):
        """Initializes the DatasetsManager with its dependencies.

        Args:
            settings (Settings): The main Settings object for the library.
            provider_service (ProviderService): The ProviderService for resolving
                dataset paths and loading data.
        """
        self.settings = settings
        self.provider_service = provider_service
        self.loaded_datasets: dict[str, str | Path] = {}

    def get_dataset_info(self, dataset_name: str) -> DatasetConfig:
        """Retrieves the configuration for a specific dataset.

        Args:
            dataset_name (str): The name of the dataset (e.g., 'classification').

        Returns:
            DatasetConfig: A Pydantic model instance containing the dataset's
                validated configuration.

        Raises:
            KeyError: If the specified dataset is not found in the configuration.

        Example:
            >>> manager = DatasetsManager(settings, provider_service)
            >>> try:
            ...     info = manager.get_dataset_info('classification')
            ...     print(info.provider_name)
            ... except KeyError as e:
            ...     print(e)
        """
        dataset_config = self.settings.get_config(f"datasets.{dataset_name}")
        if not dataset_config:
            raise KeyError(f"Dataset '{dataset_name}' not found in configuration.")
        return dataset_config

    def list_datasets(self) -> list[str]:
        """Lists all available dataset names from the configuration.

        Returns:
            list[str]: A list of configured dataset names.

        Example:
            >>> manager = DatasetsManager(settings, provider_service)
            >>> available_datasets = manager.list_datasets()
            >>> print(available_datasets)
        """
        return self.settings.list_datasets()

    def list_loaded_datasets(self) -> list[str]:
        """Lists all datasets that have been loaded during the session.

        Returns:
            list[str]: A list of names for datasets that are currently cached.

        Example:
            >>> manager = DatasetsManager(settings, provider_service)
            >>> _ = manager.load_dataset('classification', split='train')
            >>> loaded = manager.list_loaded_datasets()
            >>> print(loaded)
            ['classification']
        """
        return list(self.loaded_datasets.keys())

    def load_dataset(self, dataset_name: str, split: str | None = None, **kwargs: Any) -> Any:
        """Loads a specific dataset, downloading it if not already cached.

        This method first checks a local cache for the dataset path. If the
        dataset is not cached, it resolves the path using the settings,
        instructs the appropriate provider to download it, and caches the path.
        Finally, it uses the provider to load the dataset into memory.

        Args:
            dataset_name (str): The name of the dataset to load.
            split (str, optional): The specific dataset split to load (e.g.,
                'train', 'test'). Defaults to None.
            **kwargs (Any): Additional keyword arguments to pass to the provider's
                dataset loading function.

        Returns:
            Any: The loaded dataset object, with its type depending on the provider.

        Raises:
            KeyError: If the dataset configuration does not exist.
        """
        dataset_config = self.get_dataset_info(dataset_name)
        provider = self.provider_service.get_provider(dataset_config.provider_name)

        if dataset_name in self.loaded_datasets:
            dataset_path = self.loaded_datasets[dataset_name]
        else:
            print(f"Dataset '{dataset_name}' not in cache. Downloading...")
            dataset_path = provider.download_dataset(dataset_name, split=split, **kwargs)
            self.loaded_datasets[dataset_name] = dataset_path
            print(f"Dataset '{dataset_name}' downloaded and path cached.")

        print(f"Loading '{dataset_name}' from path: {dataset_path}")
        dataset = provider.load_dataset(dataset_path, split=split, **kwargs)
        print(f"Dataset '{dataset_name}' loaded successfully.")

        return dataset
settings = settings instance-attribute
provider_service = provider_service instance-attribute
loaded_datasets: dict[str, str | Path] = {} instance-attribute
__init__(settings: Settings, provider_service: ProviderService)

Initializes the DatasetsManager with its dependencies.

Parameters:

Name Type Description Default
settings Settings

The main Settings object for the library.

required
provider_service ProviderService

The ProviderService for resolving dataset paths and loading data.

required
Source code in culicidaelab/datasets/datasets_manager.py
def __init__(self, settings: Settings, provider_service: ProviderService):
    """Initializes the DatasetsManager with its dependencies.

    Args:
        settings (Settings): The main Settings object for the library.
        provider_service (ProviderService): The ProviderService for resolving
            dataset paths and loading data.
    """
    self.settings = settings
    self.provider_service = provider_service
    self.loaded_datasets: dict[str, str | Path] = {}
get_dataset_info(dataset_name: str) -> DatasetConfig

Retrieves the configuration for a specific dataset.

Parameters:

Name Type Description Default
dataset_name str

The name of the dataset (e.g., 'classification').

required

Returns:

Name Type Description
DatasetConfig DatasetConfig

A Pydantic model instance containing the dataset's validated configuration.

Raises:

Type Description
KeyError

If the specified dataset is not found in the configuration.

Example

manager = DatasetsManager(settings, provider_service) try: ... info = manager.get_dataset_info('classification') ... print(info.provider_name) ... except KeyError as e: ... print(e)

Source code in culicidaelab/datasets/datasets_manager.py
def get_dataset_info(self, dataset_name: str) -> DatasetConfig:
    """Retrieves the configuration for a specific dataset.

    Args:
        dataset_name (str): The name of the dataset (e.g., 'classification').

    Returns:
        DatasetConfig: A Pydantic model instance containing the dataset's
            validated configuration.

    Raises:
        KeyError: If the specified dataset is not found in the configuration.

    Example:
        >>> manager = DatasetsManager(settings, provider_service)
        >>> try:
        ...     info = manager.get_dataset_info('classification')
        ...     print(info.provider_name)
        ... except KeyError as e:
        ...     print(e)
    """
    dataset_config = self.settings.get_config(f"datasets.{dataset_name}")
    if not dataset_config:
        raise KeyError(f"Dataset '{dataset_name}' not found in configuration.")
    return dataset_config
list_datasets() -> list[str]

Lists all available dataset names from the configuration.

Returns:

Type Description
list[str]

list[str]: A list of configured dataset names.

Example

manager = DatasetsManager(settings, provider_service) available_datasets = manager.list_datasets() print(available_datasets)

Source code in culicidaelab/datasets/datasets_manager.py
def list_datasets(self) -> list[str]:
    """Lists all available dataset names from the configuration.

    Returns:
        list[str]: A list of configured dataset names.

    Example:
        >>> manager = DatasetsManager(settings, provider_service)
        >>> available_datasets = manager.list_datasets()
        >>> print(available_datasets)
    """
    return self.settings.list_datasets()
list_loaded_datasets() -> list[str]

Lists all datasets that have been loaded during the session.

Returns:

Type Description
list[str]

list[str]: A list of names for datasets that are currently cached.

Example

manager = DatasetsManager(settings, provider_service) _ = manager.load_dataset('classification', split='train') loaded = manager.list_loaded_datasets() print(loaded) ['classification']

Source code in culicidaelab/datasets/datasets_manager.py
def list_loaded_datasets(self) -> list[str]:
    """Lists all datasets that have been loaded during the session.

    Returns:
        list[str]: A list of names for datasets that are currently cached.

    Example:
        >>> manager = DatasetsManager(settings, provider_service)
        >>> _ = manager.load_dataset('classification', split='train')
        >>> loaded = manager.list_loaded_datasets()
        >>> print(loaded)
        ['classification']
    """
    return list(self.loaded_datasets.keys())
load_dataset(dataset_name: str, split: str | None = None, **kwargs: Any) -> Any

Loads a specific dataset, downloading it if not already cached.

This method first checks a local cache for the dataset path. If the dataset is not cached, it resolves the path using the settings, instructs the appropriate provider to download it, and caches the path. Finally, it uses the provider to load the dataset into memory.

Parameters:

Name Type Description Default
dataset_name str

The name of the dataset to load.

required
split str

The specific dataset split to load (e.g., 'train', 'test'). Defaults to None.

None
**kwargs Any

Additional keyword arguments to pass to the provider's dataset loading function.

{}

Returns:

Name Type Description
Any Any

The loaded dataset object, with its type depending on the provider.

Raises:

Type Description
KeyError

If the dataset configuration does not exist.

Source code in culicidaelab/datasets/datasets_manager.py
def load_dataset(self, dataset_name: str, split: str | None = None, **kwargs: Any) -> Any:
    """Loads a specific dataset, downloading it if not already cached.

    This method first checks a local cache for the dataset path. If the
    dataset is not cached, it resolves the path using the settings,
    instructs the appropriate provider to download it, and caches the path.
    Finally, it uses the provider to load the dataset into memory.

    Args:
        dataset_name (str): The name of the dataset to load.
        split (str, optional): The specific dataset split to load (e.g.,
            'train', 'test'). Defaults to None.
        **kwargs (Any): Additional keyword arguments to pass to the provider's
            dataset loading function.

    Returns:
        Any: The loaded dataset object, with its type depending on the provider.

    Raises:
        KeyError: If the dataset configuration does not exist.
    """
    dataset_config = self.get_dataset_info(dataset_name)
    provider = self.provider_service.get_provider(dataset_config.provider_name)

    if dataset_name in self.loaded_datasets:
        dataset_path = self.loaded_datasets[dataset_name]
    else:
        print(f"Dataset '{dataset_name}' not in cache. Downloading...")
        dataset_path = provider.download_dataset(dataset_name, split=split, **kwargs)
        self.loaded_datasets[dataset_name] = dataset_path
        print(f"Dataset '{dataset_name}' downloaded and path cached.")

    print(f"Loading '{dataset_name}' from path: {dataset_path}")
    dataset = provider.load_dataset(dataset_path, split=split, **kwargs)
    print(f"Dataset '{dataset_name}' loaded successfully.")

    return dataset

handler: python selection: members: true