Dataloaders¶

`manify.utils.dataloaders` ¶

Dataloaders Submodule.¶

The dataloaders module allows users to load datasets from Manify's datasets repo on Hugging Face.

We provide a summary of the data types available, and their original sources, here.

Earlier versions of Manify included scripts to process raw data, which we have replaced with a single, centralized Hugging Face repo and the function load_hf. For transparency, we have preserved the data generation code in the Dataset-Generation branch of Manify.

Dataset	Task	Distance Matrix	Features	Labels	Adjacency Matrix	Source/Citation
cities	none	✅	❌	❌	❌	Network Repository: Cities
cs_phds	regression	✅	❌	✅	✅	Network Repository: CS PhDs
polblogs	classification	✅	❌	✅	✅	Network Repository: Polblogs
polbooks	classification	✅	❌	✅	✅	Network Repository: Polbooks
cora	classification	✅	❌	✅	✅	Network Repository: Cora
citeseer	classification	✅	❌	✅	✅	Network Repository: Citeseer
karate_club	none	✅	❌	❌	✅	Network Repository: Karate
lesmis	none	✅	❌	❌	✅	Network Repository: Lesmis
adjnoun	none	✅	❌	❌	✅	Network Repository: Adjnoun
football	none	✅	❌	❌	✅	Network Repository: Football
dolphins	none	✅	❌	❌	✅	Network Repository: Dolphins
blood_cells	classification	❌	✅	✅	❌	See datasets from Zheng et al (2017): Massively parallel digital transcriptional profiling of single cells. - CD8+ Cytotoxic T-cells - CD8+/CD45RA+ Naive Cytotoxic T Cells - CD56+ Natural Killer Cells - CD4+ Helper T Cells - CD4+/CD45RO+ Memory T Cells - CD4+/CD45RA+/CD25- Naive T Cells - CD4+/CD25+ Regulatory T Cells - CD34+ Cells - CD19+ B Cells - CD14+ Monocytes
lymphoma	classification	❌	✅	✅	❌	See datasets from 10x Genomics: - Hodgkin's Lymphoma - Healthy Donor PBMCs
cifar_100	classification	❌	✅	✅	❌	Hugging Face Datasets: CIFAR-100
mnist	classification	❌	✅	✅	❌	Hugging Face Datasets: MNIST
temperature	regression	❌	✅	✅	❌	[Citation]
landmasses	classification	❌	✅	✅	❌	Generated using basemap.is_land
neuron_33	classification	❌	✅	✅	❌	Allen Brain Atlas
neuron_46	classification	❌	✅	✅	❌	Allen Brain Atlas
traffic	regression	❌	✅	✅	❌	Kaggle: Traffic Prediction Dataset
qiita	none	✅	✅	❌	❌	NeuroSEED Git Repo

`load_hf(name, namespace='manify')` ¶

Load a dataset from HuggingFace Hub at {namespace}/{name}.

Returns:

features( Float[Tensor, 'n_points ...'] | None ) –

The features for each node, if any
dists( Float[Tensor, 'n_points n_points'] | None ) –

The pairwise distance matrix over all nodes, if any
adj( Float[Tensor, 'n_points n_points'] | None ) –

The adjacency matrix over all nodes, if any
labels( Real[Tensor, 'n_points'] | None ) –

The (classification or regression) labels for each node, if any

Source code in manify/utils/dataloaders.py

def load_hf(
    name: str, namespace: str = "manify"
) -> tuple[
    Float[torch.Tensor, "n_points ..."] | None,  # features
    Float[torch.Tensor, "n_points n_points"] | None,  # pairwise dists
    Float[torch.Tensor, "n_points n_points"] | None,  # adjacency labels
    Real[torch.Tensor, "n_points"] | None,  # labels
]:
    """Load a dataset from HuggingFace Hub at {namespace}/{name}.

    Returns:
        features: The features for each node, if any
        dists: The pairwise distance matrix over all nodes, if any
        adj: The adjacency matrix over all nodes, if any
        labels: The (classification or regression) labels for each node, if any
    """
    # 1) fetch the single‑row dataset
    ds = load_dataset(f"{namespace}/{name}")
    data = ds.get("train", ds)  # use "train" split if available, else the only split
    row = data[0]

    # 2) helper to turn lists → torch (or None)
    def to_tensor(key: str, dtype: torch.dtype) -> torch.Tensor | None:
        vals = row.get(key, [])
        if not vals:
            return None
        return torch.tensor(vals, dtype=dtype)

    # 3) reconstruct everything
    dists = to_tensor("distances", torch.float32)
    feats = to_tensor("features", torch.float32)
    adj = to_tensor("adjacency", torch.float32)

    cls_ls = row.get("classification_labels", [])
    reg_ls = row.get("regression_labels", [])
    if cls_ls:
        labels = torch.tensor(cls_ls, dtype=torch.int64)
    elif reg_ls:
        labels = torch.tensor(reg_ls, dtype=torch.float32)
    else:
        labels = None

    return feats, dists, adj, labels

Dataloaders¶

manify.utils.dataloaders ¶

Dataloaders Submodule.¶

load_hf(name, namespace='manify') ¶

`manify.utils.dataloaders` ¶

`load_hf(name, namespace='manify')` ¶