Dataloaders

manify.utils.dataloaders

Dataloaders Submodule.

The dataloaders module allows users to load datasets from Manify's datasets repo on Hugging Face.

We provide a summary of the data types available, and their original sources, here.

Earlier versions of Manify included scripts to process raw data, which we have replaced with a single, centralized Hugging Face repo and the function load_hf. For transparency, we have preserved the data generation code in the Dataset-Generation branch of Manify.

Dataset Task Distance Matrix Features Labels Adjacency Matrix Source/Citation
cities none Network Repository: Cities
cs_phds regression Network Repository: CS PhDs
polblogs classification Network Repository: Polblogs
polbooks classification Network Repository: Polbooks
cora classification Network Repository: Cora
citeseer classification Network Repository: Citeseer
karate_club none Network Repository: Karate
lesmis none Network Repository: Lesmis
adjnoun none Network Repository: Adjnoun
football none Network Repository: Football
dolphins none Network Repository: Dolphins
blood_cells classification See datasets from Zheng et al (2017): Massively parallel digital transcriptional profiling of single cells.
- CD8+ Cytotoxic T-cells
- CD8+/CD45RA+ Naive Cytotoxic T Cells
- CD56+ Natural Killer Cells
- CD4+ Helper T Cells
- CD4+/CD45RO+ Memory T Cells
- CD4+/CD45RA+/CD25- Naive T Cells
- CD4+/CD25+ Regulatory T Cells
- CD34+ Cells
- CD19+ B Cells
- CD14+ Monocytes
lymphoma classification See datasets from 10x Genomics:
- Hodgkin's Lymphoma
- Healthy Donor PBMCs
cifar_100 classification Hugging Face Datasets: CIFAR-100
mnist classification Hugging Face Datasets: MNIST
temperature regression [Citation]
landmasses classification Generated using basemap.is_land
neuron_33 classification Allen Brain Atlas
neuron_46 classification Allen Brain Atlas
traffic regression Kaggle: Traffic Prediction Dataset
qiita none NeuroSEED Git Repo

load_hf(name, namespace='manify')

Load a dataset from HuggingFace Hub at {namespace}/{name}.

Returns:
  • features( Float[Tensor, 'n_points ...'] | None ) –

    The features for each node, if any

  • dists( Float[Tensor, 'n_points n_points'] | None ) –

    The pairwise distance matrix over all nodes, if any

  • adj( Float[Tensor, 'n_points n_points'] | None ) –

    The adjacency matrix over all nodes, if any

  • labels( Real[Tensor, 'n_points'] | None ) –

    The (classification or regression) labels for each node, if any

Source code in manify/utils/dataloaders.py
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
def load_hf(
    name: str, namespace: str = "manify"
) -> tuple[
    Float[torch.Tensor, "n_points ..."] | None,  # features
    Float[torch.Tensor, "n_points n_points"] | None,  # pairwise dists
    Float[torch.Tensor, "n_points n_points"] | None,  # adjacency labels
    Real[torch.Tensor, "n_points"] | None,  # labels
]:
    """Load a dataset from HuggingFace Hub at {namespace}/{name}.

    Returns:
        features: The features for each node, if any
        dists: The pairwise distance matrix over all nodes, if any
        adj: The adjacency matrix over all nodes, if any
        labels: The (classification or regression) labels for each node, if any
    """
    # 1) fetch the single‑row dataset
    ds = load_dataset(f"{namespace}/{name}")
    data = ds.get("train", ds)  # use "train" split if available, else the only split
    row = data[0]

    # 2) helper to turn lists → torch (or None)
    def to_tensor(key: str, dtype: torch.dtype) -> torch.Tensor | None:
        vals = row.get(key, [])
        if not vals:
            return None
        return torch.tensor(vals, dtype=dtype)

    # 3) reconstruct everything
    dists = to_tensor("distances", torch.float32)
    feats = to_tensor("features", torch.float32)
    adj = to_tensor("adjacency", torch.float32)

    cls_ls = row.get("classification_labels", [])
    reg_ls = row.get("regression_labels", [])
    if cls_ls:
        labels = torch.tensor(cls_ls, dtype=torch.int64)
    elif reg_ls:
        labels = torch.tensor(reg_ls, dtype=torch.float32)
    else:
        labels = None

    return feats, dists, adj, labels