scitex_dataset API Reference
SciTeX Dataset - Unified interface for scientific dataset discovery.
Domains: - neuroscience: OpenNeuro, DANDI, PhysioNet - general: Scientific Data, Zenodo - biology: GEO (Gene Expression Omnibus) - pharmacology: ChEMBL - medical: ClinicalTrials.gov
- Usage:
>>> from scitex_dataset import neuroscience >>> datasets = neuroscience.fetch_all_datasets(max_datasets=10)
>>> # Or direct import for convenience >>> from scitex_dataset import fetch_all_datasets, search_datasets
>>> # Local database for fast searching >>> from scitex_dataset import database as db >>> db.build() # Fetch all sources and index >>> results = db.search("alzheimer EEG", min_subjects=20)
- scitex_dataset.db_build(sources=None, db_path=None, logger=None)
Build the local database from all sources.
- Parameters:
sources (list, optional) – Sources to fetch: [“openneuro”, “dandi”, “physionet”]. Default: all sources.
db_path (Path, optional) – Database file path. Default: $SCITEX_DIR/dataset/runtime/datasets.db (~/.scitex/dataset/runtime/datasets.db when SCITEX_DIR is unset).
logger (optional) – Logger for progress messages.
- Returns:
Count of datasets indexed per source.
- Return type:
- scitex_dataset.db_search(query=None, source=None, modality=None, min_subjects=None, max_subjects=None, min_downloads=None, has_readme=False, limit=50, offset=0, order_by='downloads', db_path=None)
Search the local database.
- Parameters:
query (str, optional) – Full-text search query (searches name, readme, tasks).
source (str, optional) – Filter by source: “openneuro”, “dandi”, “physionet”.
modality (str, optional) – Filter by modality (e.g., “mri”, “eeg”).
min_subjects (int, optional) – Minimum number of subjects.
max_subjects (int, optional) – Maximum number of subjects.
min_downloads (int, optional) – Minimum download count.
has_readme (bool) – Only include datasets with readme.
limit (int) – Maximum results (default: 50).
offset (int) – Skip first N results (for pagination).
order_by (str) – Order by: downloads, views, n_subjects, size_gb, name.
db_path (Path, optional) – Database file path.
- Returns:
List of matching datasets.
- Return type:
- scitex_dataset.db_show_stats(db_path=None)
Get database statistics.
- Returns:
Statistics including counts per source, last build time, etc.
- Return type:
- scitex_dataset.filter_results(datasets, **kwargs)[source]
Filter and rank dataset dicts — matches
dataset_filter_resultsMCP tool.
- scitex_dataset.list_sources()[source]
Return the 11-source registry — matches
dataset_list_sourcesMCP tool.- Return type:
- scitex_dataset.openneuro_fetch(batch_size=100, max_datasets=None, logger=None)
Fetch every dataset record from OpenNeuro by paginating GraphQL.
Walks the public
crn/graphqlendpoint with cursor-based pagination until exhausted (ormax_datasetsis reached). Useformat_datasetto project each raw record into the package’s common dataset schema.- Parameters:
batch_size (int, default 100) – Records per HTTP request. The OpenNeuro server caps this; the function does not validate the upper bound.
max_datasets (int, optional) – Stop after this many records.
None(default) fetches the entire catalog.logger (logging.Logger, optional) – If provided, HTTP and GraphQL errors are logged. Errors are otherwise silent (the function returns whatever it has so far).
- Returns:
Raw GraphQL
nodedicts, in catalog order. Pass each throughformat_datasetfor the normalized schema.- Return type:
Examples
>>> records = fetch_all_datasets(max_datasets=10) >>> len(records) <= 10 True
- scitex_dataset.dandi_fetch(max_datasets=None, page_size=100, logger=None)
Fetch all dandisets from DANDI Archive with pagination.
- scitex_dataset.physionet_fetch(max_datasets=None, logger=None)
Fetch all databases from PhysioNet with pagination.
- scitex_dataset.zenodo_fetch(query='', max_datasets=None, page_size=25, type_filter='dataset', logger=None)
Fetch all datasets from Zenodo with pagination.
- Parameters:
- Returns:
List of raw record dictionaries.
- Return type:
Fetch all datasets from Figshare with pagination.
- scitex_dataset.openml_fetch(max_datasets=None, page_size=100, logger=None)
Fetch all datasets from OpenML with pagination.
- scitex_dataset.moleculenet_fetch(max_datasets=None, logger=None)
Fetch all MoleculeNet datasets.
- scitex_dataset.geo_fetch(max_datasets=None, logger=None)
Fetch all datasets from GEO with pagination.
- scitex_dataset.chembl_fetch(max_datasets=None, logger=None)
Fetch all assays from ChEMBL with pagination.
- scitex_dataset.clinicaltrials_fetch(max_datasets=None, logger=None)
Fetch all studies from ClinicalTrials.gov with pagination.
- scitex_dataset.huggingface_fetch(query='', max_datasets=None, logger=None, **_unused)
Catalog-style adapter so HuggingFace can plug into
database.build.Unlike OpenNeuro/DANDI/etc., HuggingFace has no bounded catalog —
queryis required for meaningful results. Without one this callssearch_hub("")which lists by recency up tomax_datasets.
- scitex_dataset.huggingface_search(query, limit=50)
Search for datasets on HuggingFace.
- scitex_dataset.huggingface_info(repo_id, repo_type='dataset')
Get metadata about a HuggingFace dataset or model.
- scitex_dataset.huggingface_download_file(repo_id, filename, local_dir=None, repo_type='dataset')
Download a single file from a HuggingFace repository.
- Parameters:
repo_id (str) – Repository ID (e.g., “username/dataset_name”).
filename (str) – Path within the repository (e.g., “data/train.csv”).
local_dir (str, optional) – Local directory for download. If None, uses ~/.scitex/dataset/huggingface/<repo_id>/.
repo_type (str) – Repository type: “dataset” (default) or “model”.
- Returns:
Path to the downloaded file.
- Return type:
Path
- Raises:
Exception – If download fails.
Search Module
Unified search interface for neuroscience datasets.
Currently supports: - OpenNeuro (BIDS neuroimaging)
Future sources: - DANDI (NWB neurophysiology) - PhysioNet (EEG/ECG/physiology) - Zenodo (general scientific)
- scitex_dataset.search.search_datasets(datasets, modality=None, min_subjects=None, max_subjects=None, task_contains=None, text_query=None, min_downloads=None, has_readme=False)[source]
Filter datasets by various criteria.
- Parameters:
- Return type:
- Returns:
Filtered list of datasets
Example
>>> from scitex_dataset import fetch_all_datasets, format_dataset >>> from scitex_dataset.search import search_datasets >>> raw = fetch_all_datasets(max_datasets=100) >>> datasets = [format_dataset(d) for d in raw] >>> eeg_data = search_datasets(datasets, modality="eeg", min_subjects=20)
Database Module
Local SQLite database for fast dataset searching.
- Usage:
>>> from scitex_dataset import database as db >>> db.build() # Fetch all sources and build database >>> results = db.search("alzheimer EEG", min_subjects=20)
- scitex_dataset.database.build(sources=None, db_path=None, logger=None)[source]
Build the local database from all sources.
- Parameters:
sources (list, optional) – Sources to fetch: [“openneuro”, “dandi”, “physionet”]. Default: all sources.
db_path (Path, optional) – Database file path. Default: $SCITEX_DIR/dataset/runtime/datasets.db (~/.scitex/dataset/runtime/datasets.db when SCITEX_DIR is unset).
logger (optional) – Logger for progress messages.
- Returns:
Count of datasets indexed per source.
- Return type:
- scitex_dataset.database.update(source, db_path=None, logger=None)[source]
Update a single source in the database.
- scitex_dataset.database.search(query=None, source=None, modality=None, min_subjects=None, max_subjects=None, min_downloads=None, has_readme=False, limit=50, offset=0, order_by='downloads', db_path=None)[source]
Search the local database.
- Parameters:
query (str, optional) – Full-text search query (searches name, readme, tasks).
source (str, optional) – Filter by source: “openneuro”, “dandi”, “physionet”.
modality (str, optional) – Filter by modality (e.g., “mri”, “eeg”).
min_subjects (int, optional) – Minimum number of subjects.
max_subjects (int, optional) – Maximum number of subjects.
min_downloads (int, optional) – Minimum download count.
has_readme (bool) – Only include datasets with readme.
limit (int) – Maximum results (default: 50).
offset (int) – Skip first N results (for pagination).
order_by (str) – Order by: downloads, views, n_subjects, size_gb, name.
db_path (Path, optional) – Database file path.
- Returns:
List of matching datasets.
- Return type:
Neuroscience Sources
OpenNeuro
OpenNeuro dataset fetcher using GraphQL API.
Example
>>> from scitex_dataset import fetch_all_datasets, format_dataset
>>> datasets = fetch_all_datasets(max_datasets=10)
>>> formatted = [format_dataset(ds) for ds in datasets]
- scitex_dataset.neuroscience.openneuro.fetch_datasets(first=10, after=None)[source]
Fetch a single page of datasets from OpenNeuro.
- Return type:
- scitex_dataset.neuroscience.openneuro.fetch_all_datasets(batch_size=100, max_datasets=None, logger=None)[source]
Fetch every dataset record from OpenNeuro by paginating GraphQL.
Walks the public
crn/graphqlendpoint with cursor-based pagination until exhausted (ormax_datasetsis reached). Useformat_datasetto project each raw record into the package’s common dataset schema.- Parameters:
batch_size (int, default 100) – Records per HTTP request. The OpenNeuro server caps this; the function does not validate the upper bound.
max_datasets (int, optional) – Stop after this many records.
None(default) fetches the entire catalog.logger (logging.Logger, optional) – If provided, HTTP and GraphQL errors are logged. Errors are otherwise silent (the function returns whatever it has so far).
- Returns:
Raw GraphQL
nodedicts, in catalog order. Pass each throughformat_datasetfor the normalized schema.- Return type:
Examples
>>> records = fetch_all_datasets(max_datasets=10) >>> len(records) <= 10 True
- scitex_dataset.neuroscience.openneuro.format_dataset(node)[source]
Project a raw OpenNeuro GraphQL node into the common dataset schema.
Every catalog source exposes
format_datasetreturning the same shape so they can plug intodatabase.buildandsearch.search_datasetsuniformly.- Parameters:
node (dict) – A single
edges[].nodeelement from the OpenNeuro GraphQL response (thedraft/analyticskeys are read; missing fields fall back toNone/ 0).- Returns:
Normalized record with keys:
id, name, n_subjects, modalities, tasks, size_gb, downloads, views, readme, license, doi, url, source.- Return type:
DANDI
DANDI Archive dataset fetcher.
DANDI (Distributed Archives for Neurophysiology Data Integration) hosts neurophysiology data in NWB (Neurodata Without Borders) format.
API: https://api.dandiarchive.org/api
Example
>>> from scitex_dataset.neuroscience import dandi
>>> datasets = dandi.fetch_all_datasets(max_datasets=10)
>>> formatted = [dandi.format_dataset(ds) for ds in datasets]
- scitex_dataset.neuroscience.dandi.fetch_datasets(page=1, page_size=100, ordering='-modified')[source]
Fetch a single page of dandisets from DANDI Archive.
- Return type:
PhysioNet
PhysioNet dataset fetcher.
PhysioNet hosts physiological signal databases including EEG, ECG, EMG, and other biomedical signals.
API: https://physionet.org/api/v1/
Example
>>> from scitex_dataset.neuroscience import physionet
>>> datasets = physionet.fetch_all_datasets(max_datasets=10)
>>> formatted = [physionet.format_dataset(ds) for ds in datasets]
- scitex_dataset.neuroscience.physionet.fetch_datasets(page=1)[source]
Fetch a single page of databases from PhysioNet.
- Return type:
General Sources
Zenodo
Zenodo API client for scientific dataset discovery.
Zenodo is a general-purpose open repository developed under the European OpenAIRE program and operated by CERN. It allows researchers to deposit research papers, data sets, research software, reports, and any other research related digital artifacts.
API Documentation: https://developers.zenodo.org/
- scitex_dataset.general.zenodo.fetch_datasets(query='', page=1, size=25, sort='mostrecent', type_filter='dataset')[source]
Fetch datasets from Zenodo.
- Parameters:
query (str) – Search query string (Elasticsearch query syntax).
page (int) – Page number (1-indexed).
size (int) – Number of results per page (max 10000).
sort (str) – Sort order: ‘bestmatch’, ‘mostrecent’, ‘-mostrecent’.
type_filter (str) – Resource type filter: ‘dataset’, ‘software’, ‘publication’, etc.
- Returns:
API response with ‘hits’ containing records.
- Return type:
CLI
Command-line interface for scitex-dataset.
The command grammar is:
scitex-dataset <domain> <dataset> <action> [OPTIONS]
For example:
scitex-dataset neuroscience openneuro fetch -n 50
scitex-dataset general huggingface fetch Anthropic/BioMysteryBench-full
scitex-dataset pharmacology chembl fetch
scitex-dataset db build
The flat fetch-<source> and hf <verb> shapes from earlier
versions are kept as hidden deprecation aliases that print the new path
and exit with status 2.
See general/03_interface_02_cli/02_subcommand-structure-noun-verb.md
for the SciTeX CLI grammar.