Datasets¶
This document provides comprehensive information about the datasets used in the CulicidaeLab Server platform, including sample data, training datasets, and data collection methodologies.
Overview¶
CulicidaeLab utilizes multiple datasets to support mosquito species identification, disease mapping, and ecological research. The platform combines curated sample data with real-world observations to provide comprehensive mosquito surveillance capabilities.
Sample Datasets¶
Species Dataset¶
The species dataset contains comprehensive information about mosquito species worldwide, with multilingual support and detailed taxonomic information.
Dataset Structure: - Records: 17 mosquito species across 4 genera - Languages: English and Russian localization - Fields: 16 attributes per species including taxonomy, ecology, and disease relationships
Key Species Included:
Aedes Genus (8 species)¶
- Aedes aegypti - Yellow Fever Mosquito
- Aedes albopictus - Asian Tiger Mosquito
- Aedes canadensis - Canada Mosquito
- Aedes dorsalis - Coastal Rock Pool Mosquito
- Aedes geniculatus - Treehole Mosquito
- Aedes koreicus - Korean Bush Mosquito
- Aedes triseriatus - Eastern Treehole Mosquito
- Aedes vexans - Inland Floodwater Mosquito
Anopheles Genus (3 species)¶
- Anopheles arabiensis - Arabian Malaria Mosquito
- Anopheles freeborni - Western Malaria Mosquito
- Anopheles sinensis - Chinese Malaria Mosquito
Culex Genus (4 species)¶
- Culex inatomii
- Culex pipiens - Common House Mosquito
- Culex quinquefasciatus - Southern House Mosquito
- Culex tritaeniorhynchus - Japanese Encephalitis Mosquito
Culiseta Genus (2 species)¶
- Culiseta annulata - Ringed Mosquito
- Culiseta longiareolata - Striped Mosquito
Data Attributes:
{
"id": "species_identifier",
"scientific_name": "Genus species",
"vector_status": "High|Moderate|Low",
"image_url": "path/to/species/image",
"common_name_en": "English common name",
"common_name_ru": "Russian common name",
"description_en": "English description",
"description_ru": "Russian description",
"key_characteristics_en": ["characteristic1", "characteristic2"],
"key_characteristics_ru": ["характеристика1", "характеристика2"],
"habitat_preferences_en": ["habitat1", "habitat2"],
"habitat_preferences_ru": ["среда1", "среда2"],
"geographic_regions": ["region1", "region2"],
"related_diseases": ["disease_id1", "disease_id2"]
}
Disease Dataset¶
The disease dataset contains information about mosquito-borne diseases with comprehensive medical and epidemiological data.
Dataset Structure: - Records: 13 major mosquito-borne diseases - Languages: English and Russian medical terminology - Coverage: Global disease distribution and vector relationships
Diseases Included:
Viral Diseases¶
- Dengue Fever - Transmitted by Aedes aegypti, Aedes albopictus
- Zika Virus - Transmitted by Aedes aegypti, Aedes albopictus
- Chikungunya - Transmitted by Aedes aegypti, Aedes albopictus
- Yellow Fever - Transmitted by Aedes aegypti
- West Nile Virus - Transmitted by Culex pipiens, Culex quinquefasciatus
- Japanese Encephalitis - Transmitted by Culex tritaeniorhynchus
- Eastern Equine Encephalitis - Transmitted by Aedes canadensis
- St. Louis Encephalitis - Transmitted by Culex pipiens, Culex quinquefasciatus
- La Crosse Encephalitis - Transmitted by Aedes triseriatus
- Rift Valley Fever - Transmitted by multiple Culex and Aedes species
Parasitic Diseases¶
- Malaria - Transmitted by Anopheles species
- Filariasis - Transmitted by Culex quinquefasciatus, Aedes aegypti
- Avian Malaria - Transmitted by Culex species
Medical Information Fields: - Symptoms and clinical presentation - Treatment protocols and medications - Prevention strategies - Epidemiological data and prevalence - Geographic distribution - Vector species relationships
Observation Dataset¶
The observation dataset contains field observation records with geospatial information and metadata.
Dataset Format: GeoJSON Feature Collection Coordinate System: WGS84 (EPSG:4326) Temporal Coverage: Configurable date ranges
GeoJSON Structure:
{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"properties": {
"id": "unique_observation_id",
"species_scientific_name": "Genus species",
"observed_at": "ISO_datetime_string",
"count": "number_of_specimens",
"observer_id": "observer_identifier",
"data_source": "source_information",
"location_accuracy_m": "accuracy_in_meters",
"notes": "observation_notes",
"image_filename": "associated_image_file",
"model_id": "ai_model_identifier",
"confidence": "prediction_confidence_score",
"metadata": "additional_json_metadata"
},
"geometry": {
"type": "Point",
"coordinates": [longitude, latitude]
}
}
]
}
Geographic Datasets¶
Regions Dataset¶
- Administrative Boundaries: Country, state/province, and local regions
- Ecological Zones: Biomes, climate zones, and habitat classifications
- Multilingual Names: English and Russian region names
- Hierarchical Structure: Nested geographic relationships
Data Sources Dataset¶
- Research Institutions: Universities and research organizations
- Government Agencies: Health departments and environmental agencies
- Citizen Science: Community-contributed observation platforms
- Literature Sources: Published research and survey data
Training Datasets¶
Image Classification Dataset¶
The AI model training utilizes curated image datasets from the culicidaelab library:
Dataset Characteristics: - Species Coverage: 17+ mosquito species - Image Quality: High-resolution microscopy and field photography - Standardization: Consistent lighting, background, and orientation - Augmentation: Synthetic variations for improved model robustness
Training/Validation Split: - Training Set: 70% of images for model learning - Validation Set: 15% for hyperparameter tuning - Test Set: 15% for final performance evaluation
Data Augmentation Techniques: - Rotation and flipping transformations - Color space adjustments - Noise injection and blur effects - Scale and crop variations
Model Performance Datasets¶
Benchmark Datasets: - Accuracy Testing: Curated test sets with expert annotations - Confidence Calibration: Datasets for confidence score validation - Cross-Validation: Multiple dataset splits for robust evaluation - Real-World Testing: Field images for practical performance assessment
Data Collection Methodology¶
Field Observation Protocols¶
Standardized Collection¶
- GPS Coordinates: Precise location recording (±5m accuracy)
- Temporal Data: Date, time, and environmental conditions
- Specimen Counts: Quantitative abundance measurements
- Photography: Standardized imaging protocols
- Metadata: Observer information and collection methods
Quality Assurance¶
- Expert Validation: Taxonomic verification by specialists
- Data Verification: Cross-checking of observation records
- Outlier Detection: Statistical analysis for anomalous data
- Completeness Checks: Validation of required fields
Image Dataset Curation¶
Collection Standards¶
- Resolution Requirements: Minimum pixel dimensions for analysis
- Focus Quality: Sharpness and clarity standards
- Lighting Conditions: Consistent illumination protocols
- Background Standards: Neutral backgrounds for feature extraction
Annotation Process¶
- Expert Labeling: Species identification by taxonomists
- Multi-Reviewer Validation: Independent verification process
- Confidence Scoring: Annotation certainty levels
- Morphological Features: Detailed anatomical annotations
Data Quality and Validation¶
Quality Metrics¶
Completeness¶
- Field Coverage: Percentage of required fields populated
- Geographic Coverage: Spatial distribution of observations
- Temporal Coverage: Time series completeness
- Species Representation: Balanced coverage across taxa
Accuracy¶
- Taxonomic Validation: Expert verification of species identifications
- Coordinate Accuracy: GPS precision and validation
- Temporal Accuracy: Date/time verification protocols
- Image Quality: Technical quality assessments
Consistency¶
- Naming Conventions: Standardized taxonomic nomenclature
- Unit Standardization: Consistent measurement units
- Format Compliance: Schema adherence validation
- Cross-Reference Integrity: Relationship consistency checks
Validation Procedures¶
Automated Validation¶
- Schema Validation: PyArrow schema compliance checking
- Range Validation: Acceptable value range verification
- Format Validation: Data type and structure verification
- Relationship Validation: Foreign key integrity checks
Manual Review¶
- Expert Review: Specialist validation of complex records
- Statistical Analysis: Outlier detection and trend analysis
- Cross-Validation: Independent verification processes
- Feedback Integration: User-reported corrections and updates
Data Usage and Licensing¶
Usage Guidelines¶
Research Applications¶
- Academic Research: Open access for educational institutions
- Commercial Use: Licensing terms for commercial applications
- Attribution Requirements: Proper citation and acknowledgment
- Modification Rights: Permissions for data enhancement
Privacy and Ethics¶
- Personal Data: Protection of observer personal information
- Location Privacy: Coordinate precision limitations for sensitive areas
- Consent Management: Observer consent for data sharing
- Ethical Guidelines: Compliance with research ethics standards
Data Sharing Protocols¶
API Access¶
- Rate Limiting: Request throttling for fair usage
- Authentication: Secure access control mechanisms
- Format Options: Multiple export formats (JSON, CSV, GeoJSON)
- Filtering Capabilities: Query-based data subset access
Bulk Downloads¶
- Dataset Packages: Complete dataset downloads
- Version Control: Timestamped dataset releases
- Change Logs: Documentation of dataset updates
- Integrity Verification: Checksums and validation tools
Future Dataset Enhancements¶
Planned Expansions¶
Species Coverage¶
- Additional Genera: Expansion to other mosquito genera
- Regional Variants: Subspecies and geographic variants
- Life Stages: Egg, larva, pupa, and adult stage data
- Morphological Variants: Sexual dimorphism and seasonal variations
Geographic Expansion¶
- Global Coverage: Worldwide species distribution data
- Climate Data Integration: Environmental parameter correlation
- Habitat Modeling: Ecological niche modeling datasets
- Temporal Dynamics: Seasonal and annual variation data
Technology Integration¶
- Molecular Data: Genetic sequences and phylogenetic information
- Acoustic Data: Wing beat frequency and sound signatures
- Behavioral Data: Flight patterns and feeding behavior
- Environmental Sensors: Real-time environmental monitoring data
Data Infrastructure Improvements¶
Performance Optimization¶
- Indexing Strategies: Advanced database indexing for faster queries
- Caching Systems: Intelligent data caching for improved response times
- Compression Techniques: Efficient storage of large datasets
- Distributed Storage: Scalable storage architecture
Integration Capabilities¶
- External APIs: Integration with global biodiversity databases
- Real-time Feeds: Live data streaming from monitoring networks
- Collaborative Platforms: Integration with citizen science platforms
- Research Networks: Connection to international research consortiums