scitex-dataset - Multi-domain Scientific Dataset Fetcher
scitex-dataset is a unified Python interface for discovering and fetching scientific datasets across 11 repositories — neuroscience (OpenNeuro, DANDI, PhysioNet), general science (Zenodo, Figshare, OpenML, HuggingFace Hub), biology (GEO), pharmacology (MoleculeNet, ChEMBL), and medical (ClinicalTrials.gov).
Getting Started
User Guide
API Reference
- scitex_dataset API Reference
db_build()db_search()db_show_stats()filter_results()list_sources()openneuro_fetch()dandi_fetch()physionet_fetch()gin_fetch()gin_search()gin_info()gin_download()zenodo_fetch()figshare_fetch()openml_fetch()moleculenet_fetch()geo_fetch()chembl_fetch()clinicaltrials_fetch()huggingface_fetch()huggingface_search()huggingface_info()huggingface_download_file()download_dataset()- Search Module
- Database Module
- Neuroscience Sources
- General Sources
- Biology Sources
- Pharmacology Sources
- Medical Sources
- CLI
Key Features
Unified Interface: Single API across 10 catalog sources + HuggingFace Hub (on-demand)
Multi-domain: Neuroscience, biology, pharmacology, medical, and general science
Local SQLite + FTS5 Index: Build an offline-searchable cache via
db_build()CLI & Python API: Use from command line or import as a library
MCP Integration: AI agents can discover and fetch datasets via MCP server
Supported Sources
Neuroscience - OpenNeuro: BIDS-formatted neuroimaging (MRI, EEG, MEG, iEEG, PET) - DANDI Archive: NWB-formatted neurophysiology data - PhysioNet: Physiological signal databases (ECG, EEG, waveforms)
General Science - Zenodo: General scientific data repository (CERN) - Figshare: Research data sharing platform - OpenML: Machine learning datasets and benchmarks - HuggingFace Hub: ML datasets and models (on-demand)
Biology - GEO: Gene Expression Omnibus (NCBI) — transcriptomics, microarray
Pharmacology - MoleculeNet: Molecular ML benchmark suite - ChEMBL: Bioactivity database (EBI) — IC50/Ki/EC50 assays
Medical - ClinicalTrials.gov: NIH clinical study registry
Quick Example
Python API:
from scitex_dataset import openneuro_fetch, filter_results, db_build, db_search
# Fetch from OpenNeuro
records = openneuro_fetch(max_datasets=100)
# Filter + rank in memory
top = filter_results(records, modality="eeg", min_subjects=20, sort_by="downloads", limit=10)
# Build the local SQLite + FTS5 index for offline queries
db_build()
results = db_search("Alzheimer EEG")
CLI:
# Fetch datasets
scitex-dataset neuroscience openneuro fetch -n 50
# Search for datasets
scitex-dataset general huggingface search "biology" -n 20
# Build/search the local database
scitex-dataset db build
scitex-dataset db search "Alzheimer EEG"
# List available sources (embedded in --help)
scitex-dataset --help