Quickstart
This guide gets you to a first dataset fetch in under a minute, then shows the local index and the HuggingFace flow.
Install
pip install scitex-dataset
# Optional extras:
pip install "scitex-dataset[huggingface]" # HF Hub access
pip install "scitex-dataset[mcp]" # MCP server
Python API
The package exposes per-source <src>_fetch aliases plus a unified
filter_results helper. Every alias has a matching MCP tool of the
same name (under the dataset namespace).
from scitex_dataset import (
openneuro_fetch, dandi_fetch, hf_search,
filter_results, list_sources,
db_build, db_search,
)
# 1) Fetch from a catalog source.
records = openneuro_fetch(max_datasets=200)
# 2) Filter + rank in memory.
top = filter_results(
records, modality="eeg", min_subjects=20,
sort_by="downloads", limit=10,
)
# 3) Search HuggingFace Hub directly.
hf_hits = hf_search("biology", limit=20)
# 4) Build the local SQLite + FTS5 index for offline queries.
db_build()
db_search("Alzheimer EEG")
# 5) Inspect the supported sources.
list_sources()["count"] # 11
CLI
The CLI uses the SciTeX 3-level grammar
scitex-dataset <domain> <dataset> <action>:
# Catalog fetch
scitex-dataset neuroscience openneuro fetch -n 50 -o openneuro.json
scitex-dataset general zenodo fetch -q "neuroscience" -n 20
# HuggingFace
scitex-dataset general huggingface fetch Anthropic/BioMysteryBench-full
scitex-dataset general huggingface search "biology" -n 10 --json
# Local index
scitex-dataset db build
scitex-dataset db search "Alzheimer EEG"
scitex-dataset db show-stats
See CLI Reference for the full subtree, or run
scitex-dataset --help-recursive.
Configuration
Project scope wins over user scope, per the SciTeX local-state layout:
--config <path>
$SCITEX_DATASET_CONFIG
<project>/.scitex/dataset/config.yaml
$SCITEX_DIR/dataset/config.yaml (default ~/.scitex/dataset/config.yaml)
Runtime artifacts (the SQLite index, snapshots) live under
<scope-root>/runtime/.
Next steps
CLI Reference — full CLI grammar and per-action flags.
Data Sources — every supported repository.
scitex_dataset API Reference — generated Python API reference.