scitex_dataset API Reference

SciTeX Dataset - Unified interface for scientific dataset discovery.

Domains: - neuroscience: OpenNeuro, DANDI, PhysioNet - general: Scientific Data, Zenodo - biology: GEO (Gene Expression Omnibus) - pharmacology: ChEMBL - medical: ClinicalTrials.gov

Usage:
>>> from scitex_dataset import neuroscience
>>> datasets = neuroscience.fetch_all_datasets(max_datasets=10)
>>> # Or direct import for convenience
>>> from scitex_dataset import fetch_all_datasets, search_datasets
>>> # Local database for fast searching
>>> from scitex_dataset import database as db
>>> db.build()  # Fetch all sources and index
>>> results = db.search("alzheimer EEG", min_subjects=20)
scitex_dataset.db_build(sources=None, db_path=None, logger=None)

Build the local database from all sources.

Parameters:
  • sources (list, optional) – Sources to fetch: [“openneuro”, “dandi”, “physionet”]. Default: all sources.

  • db_path (Path, optional) – Database file path. Default: $SCITEX_DIR/dataset/runtime/datasets.db (~/.scitex/dataset/runtime/datasets.db when SCITEX_DIR is unset).

  • logger (optional) – Logger for progress messages.

Returns:

Count of datasets indexed per source.

Return type:

dict

Search the local database.

Parameters:
  • query (str, optional) – Full-text search query (searches name, readme, tasks).

  • source (str, optional) – Filter by source: “openneuro”, “dandi”, “physionet”.

  • modality (str, optional) – Filter by modality (e.g., “mri”, “eeg”).

  • min_subjects (int, optional) – Minimum number of subjects.

  • max_subjects (int, optional) – Maximum number of subjects.

  • min_downloads (int, optional) – Minimum download count.

  • has_readme (bool) – Only include datasets with readme.

  • limit (int) – Maximum results (default: 50).

  • offset (int) – Skip first N results (for pagination).

  • order_by (str) – Order by: downloads, views, n_subjects, size_gb, name.

  • db_path (Path, optional) – Database file path.

Returns:

List of matching datasets.

Return type:

list

scitex_dataset.db_show_stats(db_path=None)

Get database statistics.

Returns:

Statistics including counts per source, last build time, etc.

Return type:

dict

scitex_dataset.filter_results(datasets, **kwargs)[source]

Filter and rank dataset dicts — matches dataset_filter_results MCP tool.

scitex_dataset.list_sources()[source]

Return the 11-source registry — matches dataset_list_sources MCP tool.

Return type:

dict

scitex_dataset.openneuro_fetch(batch_size=100, max_datasets=None, logger=None)

Fetch every dataset record from OpenNeuro by paginating GraphQL.

Walks the public crn/graphql endpoint with cursor-based pagination until exhausted (or max_datasets is reached). Use format_dataset to project each raw record into the package’s common dataset schema.

Parameters:
  • batch_size (int, default 100) – Records per HTTP request. The OpenNeuro server caps this; the function does not validate the upper bound.

  • max_datasets (int, optional) – Stop after this many records. None (default) fetches the entire catalog.

  • logger (logging.Logger, optional) – If provided, HTTP and GraphQL errors are logged. Errors are otherwise silent (the function returns whatever it has so far).

Returns:

Raw GraphQL node dicts, in catalog order. Pass each through format_dataset for the normalized schema.

Return type:

list[dict]

Examples

>>> records = fetch_all_datasets(max_datasets=10)
>>> len(records) <= 10
True
scitex_dataset.dandi_fetch(max_datasets=None, page_size=100, logger=None)

Fetch all dandisets from DANDI Archive with pagination.

Return type:

list[dict]

scitex_dataset.physionet_fetch(max_datasets=None, logger=None)

Fetch all databases from PhysioNet with pagination.

Return type:

list[dict]

scitex_dataset.zenodo_fetch(query='', max_datasets=None, page_size=25, type_filter='dataset', logger=None)

Fetch all datasets from Zenodo with pagination.

Parameters:
  • query (str) – Search query string.

  • max_datasets (int, optional) – Maximum number of datasets to fetch.

  • page_size (int) – Datasets per request.

  • type_filter (str) – Resource type filter (default: ‘dataset’).

  • logger (optional) – Logger for progress messages.

Returns:

List of raw record dictionaries.

Return type:

list[dict]

scitex_dataset.figshare_fetch(query='', max_datasets=None, page_size=25, logger=None)

Fetch all datasets from Figshare with pagination.

Parameters:
  • query (str) – Search query string.

  • max_datasets (int, optional) – Maximum number of datasets to fetch.

  • page_size (int) – Datasets per request.

  • logger (optional) – Logger for progress messages.

Returns:

List of raw article dictionaries.

Return type:

list[dict]

scitex_dataset.openml_fetch(max_datasets=None, page_size=100, logger=None)

Fetch all datasets from OpenML with pagination.

Parameters:
  • max_datasets (int, optional) – Maximum number of datasets to fetch.

  • page_size (int) – Datasets per request.

  • logger (optional) – Logger for progress messages.

Returns:

List of raw dataset dictionaries.

Return type:

list[dict]

scitex_dataset.moleculenet_fetch(max_datasets=None, logger=None)

Fetch all MoleculeNet datasets.

Parameters:
  • max_datasets (int, optional) – Maximum number of datasets to return.

  • logger (optional) – Logger for progress messages.

Returns:

List of MoleculeNet dataset records.

Return type:

list[dict]

scitex_dataset.geo_fetch(max_datasets=None, logger=None)

Fetch all datasets from GEO with pagination.

Return type:

list[dict]

scitex_dataset.chembl_fetch(max_datasets=None, logger=None)

Fetch all assays from ChEMBL with pagination.

Return type:

list[dict]

scitex_dataset.clinicaltrials_fetch(max_datasets=None, logger=None)

Fetch all studies from ClinicalTrials.gov with pagination.

Return type:

list[dict]

scitex_dataset.huggingface_fetch(query='', max_datasets=None, logger=None, **_unused)

Catalog-style adapter so HuggingFace can plug into database.build.

Unlike OpenNeuro/DANDI/etc., HuggingFace has no bounded catalog — query is required for meaningful results. Without one this calls search_hub("") which lists by recency up to max_datasets.

Parameters:
  • query (str) – Search query. Empty string lists by recency (HF default).

  • max_datasets (int, optional) – Cap on results. Default 1000 to avoid runaway indexing.

Return type:

List[Dict]

Search for datasets on HuggingFace.

Parameters:
  • query (str) – Search query string.

  • limit (int) – Maximum number of results (default: 50).

Returns:

List of search result dictionaries with fields: id, name, description, likes, downloads, private, gated, etc.

Return type:

list[dict]

scitex_dataset.huggingface_info(repo_id, repo_type='dataset')

Get metadata about a HuggingFace dataset or model.

Parameters:
  • repo_id (str) – Repository ID (e.g., “username/dataset_name”).

  • repo_type (str) – Repository type: “dataset” (default) or “model”.

Returns:

Dataset metadata: id, name, description, downloads, likes, private, gated, size_gb, created_at, last_modified, etc.

Return type:

dict

scitex_dataset.huggingface_download_file(repo_id, filename, local_dir=None, repo_type='dataset')

Download a single file from a HuggingFace repository.

Parameters:
  • repo_id (str) – Repository ID (e.g., “username/dataset_name”).

  • filename (str) – Path within the repository (e.g., “data/train.csv”).

  • local_dir (str, optional) – Local directory for download. If None, uses ~/.scitex/dataset/huggingface/<repo_id>/.

  • repo_type (str) – Repository type: “dataset” (default) or “model”.

Returns:

Path to the downloaded file.

Return type:

Path

Raises:

Exception – If download fails.

Search Module

Unified search interface for neuroscience datasets.

Currently supports: - OpenNeuro (BIDS neuroimaging)

Future sources: - DANDI (NWB neurophysiology) - PhysioNet (EEG/ECG/physiology) - Zenodo (general scientific)

scitex_dataset.search.search_datasets(datasets, modality=None, min_subjects=None, max_subjects=None, task_contains=None, text_query=None, min_downloads=None, has_readme=False)[source]

Filter datasets by various criteria.

Parameters:
  • datasets (list[dict]) – List of formatted dataset dictionaries

  • modality (Optional[str]) – Filter by modality (e.g., “mri”, “eeg”, “meg”)

  • min_subjects (Optional[int]) – Minimum number of subjects

  • max_subjects (Optional[int]) – Maximum number of subjects

  • task_contains (Optional[str]) – Filter by task name substring

  • text_query (Optional[str]) – Search in name and readme text

  • min_downloads (Optional[int]) – Minimum download count

  • has_readme (bool) – Only include datasets with readme

Return type:

list[dict]

Returns:

Filtered list of datasets

Example

>>> from scitex_dataset import fetch_all_datasets, format_dataset
>>> from scitex_dataset.search import search_datasets
>>> raw = fetch_all_datasets(max_datasets=100)
>>> datasets = [format_dataset(d) for d in raw]
>>> eeg_data = search_datasets(datasets, modality="eeg", min_subjects=20)
scitex_dataset.search.sort_datasets(datasets, by='downloads', descending=True)[source]

Sort datasets by a field.

Parameters:
  • datasets (list[dict]) – List of formatted dataset dictionaries

  • by (str) – Field to sort by (downloads, views, n_subjects, size_gb, created)

  • descending (bool) – Sort in descending order

Return type:

list[dict]

Returns:

Sorted list of datasets

Database Module

Local SQLite database for fast dataset searching.

Usage:
>>> from scitex_dataset import database as db
>>> db.build()  # Fetch all sources and build database
>>> results = db.search("alzheimer EEG", min_subjects=20)
scitex_dataset.database.build(sources=None, db_path=None, logger=None)[source]

Build the local database from all sources.

Parameters:
  • sources (list, optional) – Sources to fetch: [“openneuro”, “dandi”, “physionet”]. Default: all sources.

  • db_path (Path, optional) – Database file path. Default: $SCITEX_DIR/dataset/runtime/datasets.db (~/.scitex/dataset/runtime/datasets.db when SCITEX_DIR is unset).

  • logger (optional) – Logger for progress messages.

Returns:

Count of datasets indexed per source.

Return type:

dict

scitex_dataset.database.update(source, db_path=None, logger=None)[source]

Update a single source in the database.

Parameters:
  • source (str) – Source to update: “openneuro”, “dandi”, or “physionet”.

  • db_path (Path, optional) – Database file path.

  • logger (optional) – Logger for progress messages.

Returns:

Number of datasets indexed.

Return type:

int

scitex_dataset.database.search(query=None, source=None, modality=None, min_subjects=None, max_subjects=None, min_downloads=None, has_readme=False, limit=50, offset=0, order_by='downloads', db_path=None)[source]

Search the local database.

Parameters:
  • query (str, optional) – Full-text search query (searches name, readme, tasks).

  • source (str, optional) – Filter by source: “openneuro”, “dandi”, “physionet”.

  • modality (str, optional) – Filter by modality (e.g., “mri”, “eeg”).

  • min_subjects (int, optional) – Minimum number of subjects.

  • max_subjects (int, optional) – Maximum number of subjects.

  • min_downloads (int, optional) – Minimum download count.

  • has_readme (bool) – Only include datasets with readme.

  • limit (int) – Maximum results (default: 50).

  • offset (int) – Skip first N results (for pagination).

  • order_by (str) – Order by: downloads, views, n_subjects, size_gb, name.

  • db_path (Path, optional) – Database file path.

Returns:

List of matching datasets.

Return type:

list

scitex_dataset.database.get_stats(db_path=None)[source]

Get database statistics.

Returns:

Statistics including counts per source, last build time, etc.

Return type:

dict

scitex_dataset.database.get_db_path()[source]

Get the database file path.

Return type:

Path

scitex_dataset.database.clear(db_path=None)[source]

Delete the database file.

Returns:

True if deleted, False if didn’t exist.

Return type:

bool

Neuroscience Sources

OpenNeuro

OpenNeuro dataset fetcher using GraphQL API.

Example

>>> from scitex_dataset import fetch_all_datasets, format_dataset
>>> datasets = fetch_all_datasets(max_datasets=10)
>>> formatted = [format_dataset(ds) for ds in datasets]
scitex_dataset.neuroscience.openneuro.fetch_datasets(first=10, after=None)[source]

Fetch a single page of datasets from OpenNeuro.

Return type:

dict

scitex_dataset.neuroscience.openneuro.fetch_all_datasets(batch_size=100, max_datasets=None, logger=None)[source]

Fetch every dataset record from OpenNeuro by paginating GraphQL.

Walks the public crn/graphql endpoint with cursor-based pagination until exhausted (or max_datasets is reached). Use format_dataset to project each raw record into the package’s common dataset schema.

Parameters:
  • batch_size (int, default 100) – Records per HTTP request. The OpenNeuro server caps this; the function does not validate the upper bound.

  • max_datasets (int, optional) – Stop after this many records. None (default) fetches the entire catalog.

  • logger (logging.Logger, optional) – If provided, HTTP and GraphQL errors are logged. Errors are otherwise silent (the function returns whatever it has so far).

Returns:

Raw GraphQL node dicts, in catalog order. Pass each through format_dataset for the normalized schema.

Return type:

list[dict]

Examples

>>> records = fetch_all_datasets(max_datasets=10)
>>> len(records) <= 10
True
scitex_dataset.neuroscience.openneuro.format_dataset(node)[source]

Project a raw OpenNeuro GraphQL node into the common dataset schema.

Every catalog source exposes format_dataset returning the same shape so they can plug into database.build and search.search_datasets uniformly.

Parameters:

node (dict) – A single edges[].node element from the OpenNeuro GraphQL response (the draft / analytics keys are read; missing fields fall back to None / 0).

Returns:

Normalized record with keys: id, name, n_subjects, modalities, tasks, size_gb, downloads, views, readme, license, doi, url, source.

Return type:

dict

DANDI

DANDI Archive dataset fetcher.

DANDI (Distributed Archives for Neurophysiology Data Integration) hosts neurophysiology data in NWB (Neurodata Without Borders) format.

API: https://api.dandiarchive.org/api

Example

>>> from scitex_dataset.neuroscience import dandi
>>> datasets = dandi.fetch_all_datasets(max_datasets=10)
>>> formatted = [dandi.format_dataset(ds) for ds in datasets]
scitex_dataset.neuroscience.dandi.fetch_datasets(page=1, page_size=100, ordering='-modified')[source]

Fetch a single page of dandisets from DANDI Archive.

Return type:

dict

scitex_dataset.neuroscience.dandi.fetch_all_datasets(max_datasets=None, page_size=100, logger=None)[source]

Fetch all dandisets from DANDI Archive with pagination.

Return type:

list[dict]

scitex_dataset.neuroscience.dandi.format_dataset(dandiset)[source]

Extract and format dandiset information.

Return type:

dict

PhysioNet

PhysioNet dataset fetcher.

PhysioNet hosts physiological signal databases including EEG, ECG, EMG, and other biomedical signals.

API: https://physionet.org/api/v1/

Example

>>> from scitex_dataset.neuroscience import physionet
>>> datasets = physionet.fetch_all_datasets(max_datasets=10)
>>> formatted = [physionet.format_dataset(ds) for ds in datasets]
scitex_dataset.neuroscience.physionet.fetch_datasets(page=1)[source]

Fetch a single page of databases from PhysioNet.

Return type:

dict

scitex_dataset.neuroscience.physionet.fetch_all_datasets(max_datasets=None, logger=None)[source]

Fetch all databases from PhysioNet with pagination.

Return type:

list[dict]

scitex_dataset.neuroscience.physionet.format_dataset(database)[source]

Extract and format PhysioNet database information.

Return type:

dict

General Sources

Zenodo

Zenodo API client for scientific dataset discovery.

Zenodo is a general-purpose open repository developed under the European OpenAIRE program and operated by CERN. It allows researchers to deposit research papers, data sets, research software, reports, and any other research related digital artifacts.

API Documentation: https://developers.zenodo.org/

scitex_dataset.general.zenodo.fetch_datasets(query='', page=1, size=25, sort='mostrecent', type_filter='dataset')[source]

Fetch datasets from Zenodo.

Parameters:
  • query (str) – Search query string (Elasticsearch query syntax).

  • page (int) – Page number (1-indexed).

  • size (int) – Number of results per page (max 10000).

  • sort (str) – Sort order: ‘bestmatch’, ‘mostrecent’, ‘-mostrecent’.

  • type_filter (str) – Resource type filter: ‘dataset’, ‘software’, ‘publication’, etc.

Returns:

API response with ‘hits’ containing records.

Return type:

dict

scitex_dataset.general.zenodo.fetch_all_datasets(query='', max_datasets=None, page_size=25, type_filter='dataset', logger=None)[source]

Fetch all datasets from Zenodo with pagination.

Parameters:
  • query (str) – Search query string.

  • max_datasets (int, optional) – Maximum number of datasets to fetch.

  • page_size (int) – Datasets per request.

  • type_filter (str) – Resource type filter (default: ‘dataset’).

  • logger (optional) – Logger for progress messages.

Returns:

List of raw record dictionaries.

Return type:

list[dict]

scitex_dataset.general.zenodo.format_dataset(record)[source]

Format a Zenodo record into a standardized dataset dictionary.

Parameters:

record (dict) – Raw Zenodo record from API.

Returns:

Standardized dataset dictionary.

Return type:

dict

CLI

Command-line interface for scitex-dataset.

The command grammar is:

scitex-dataset <domain> <dataset> <action> [OPTIONS]

For example:

scitex-dataset neuroscience openneuro fetch -n 50
scitex-dataset general huggingface fetch Anthropic/BioMysteryBench-full
scitex-dataset pharmacology chembl fetch
scitex-dataset db build

The flat fetch-<source> and hf <verb> shapes from earlier versions are kept as hidden deprecation aliases that print the new path and exit with status 2.

See general/03_interface_02_cli/02_subcommand-structure-noun-verb.md for the SciTeX CLI grammar.

scitex_dataset._cli._repositories_block()[source]

Render the per-domain bullet list shown in top-level --help.

Return type:

str