Utilities¶

`manify.utils` ¶

Manify library utilities.

`benchmarks` ¶

Implementation for benchmarking different product space machine learning methods.

`benchmark(X, y, pm, device='cpu', score=None, models=None, max_depth=5, n_estimators=12, min_samples_split=2, min_samples_leaf=1, task='classification', seed=None, use_special_dims=False, n_features='d_choose_2', X_train=None, X_test=None, y_train=None, y_test=None, batch_size=None, adj=None, A_train=None, A_test=None, epochs=4000, lr=0.0001, kappa_gcn_layers=1)` ¶

Benchmarks various machine learning models on Riemannian manifold datasets.

Evaluates and compares different machine learning models on datasets with a product manifold structure, providing metrics for their performance.

Parameters:

X (Float[Tensor, 'batch dim']) –

Tensor of input features with shape (batch, dim).
y (Real[Tensor, 'batch']) –

Tensor of target labels with shape (batch,).
pm (ProductManifold) –

ProductManifold object defining the geometric structure for benchmarks.
device (Literal['cpu', 'cuda', 'mps'], default: 'cpu' ) –

Device for computation. Options: 'cpu', 'cuda', 'mps'. Defaults to 'cpu'.
score (list[SCORETYPE] | None, default: None ) –

List of scoring metrics for model evaluation (e.g., 'accuracy', 'f1-micro'). Defaults to None.
models (list[MODELTYPE] | None, default: None ) –

List of model names to evaluate. Options include: * "sklearn_dt": Decision tree from scikit-learn * "sklearn_rf": Random forest from scikit-learn * "product_dt": Product space decision tree * "product_rf": Product space random forest * "tangent_dt": Decision tree on tangent space * "tangent_rf": Random forest on tangent space * "knn": k-nearest neighbors * "ps_perceptron": Product space perceptron Defaults to None.
max_depth (int, default: 5 ) –

Maximum depth of tree-based models. Defaults to 5.
n_estimators (int, default: 12 ) –

Number of estimators for ensemble models. Defaults to 12.
min_samples_split (int, default: 2 ) –

Minimum samples required to split an internal node. Defaults to 2.
min_samples_leaf (int, default: 1 ) –

Minimum samples required in a leaf node. Defaults to 1.
task (TASKTYPE, default: 'classification' ) –

Type of machine learning task. Options: 'classification' or 'regression'. Defaults to 'classification'.
seed (int | None, default: None ) –

Random seed for reproducibility. Defaults to None.
use_special_dims (bool, default: False ) –

Whether to use special manifold dimensions. Defaults to False.
n_features (Literal['d', 'd_choose_2'], default: 'd_choose_2' ) –

Feature dimensionality type. Options: 'd' or 'd_choose_2'. Defaults to 'd_choose_2'.
X_train (Float[Tensor, 'n_samples n_manifolds'] | None, default: None ) –

Training feature tensor with shape (n_samples, n_manifolds). If provided, overrides split from X. Defaults to None.
X_test (Float[Tensor, 'n_samples n_manifolds'] | None, default: None ) –

Testing feature tensor with shape (n_samples, n_manifolds). If provided, used with X_train. Defaults to None.
y_train (Real[Tensor, 'n_samples'] | None, default: None ) –

Training labels tensor with shape (n_samples,). Must be provided if X_train is given. Defaults to None.
y_test (Real[Tensor, 'n_samples'] | None, default: None ) –

Testing labels tensor with shape (n_samples,). Must be provided if X_test is given. Defaults to None.
batch_size (int | None, default: None ) –

Batch size for neural network models. Defaults to None.
adj (Float[Tensor, 'n_nodes n_nodes'] | None, default: None ) –

Adjacency matrix for graph-based models with shape (n_nodes, n_nodes). Defaults to None.
A_train (Float[Tensor, 'n_samples n_samples'] | None, default: None ) –

Training adjacency matrix with shape (n_samples, n_samples). Defaults to None.
A_test (Float[Tensor, 'n_samples n_samples'] | None, default: None ) –

Testing adjacency matrix with shape (n_samples, n_samples). Defaults to None.
hidden_dims –

List of hidden layer dimensions for neural networks. Defaults to None.
epochs (int, default: 4000 ) –

Number of training epochs for iterative models. Defaults to 4000.
lr (float, default: 0.0001 ) –

Learning rate for gradient-based optimization. Defaults to 1e-4.
kappa_gcn_layers (int, default: 1 ) –

Number of layers in GCN models. Defaults to 1.

Returns:	`dict[str, float]` – Dictionary mapping model names to their corresponding evaluation scores.

Source code in manify/utils/benchmarks.py

def benchmark(
    X: Float[torch.Tensor, "batch dim"],
    y: Real[torch.Tensor, "batch"],
    pm: ProductManifold,
    device: Literal["cpu", "cuda", "mps"] = "cpu",
    score: list[SCORETYPE] | None = None,
    models: list[MODELTYPE] | None = None,
    max_depth: int = 5,
    n_estimators: int = 12,
    min_samples_split: int = 2,
    min_samples_leaf: int = 1,
    task: TASKTYPE = "classification",
    seed: int | None = None,
    use_special_dims: bool = False,
    n_features: Literal["d", "d_choose_2"] = "d_choose_2",
    X_train: Float[torch.Tensor, "n_samples n_manifolds"] | None = None,
    X_test: Float[torch.Tensor, "n_samples n_manifolds"] | None = None,
    y_train: Real[torch.Tensor, "n_samples"] | None = None,
    y_test: Real[torch.Tensor, "n_samples"] | None = None,
    batch_size: int | None = None,
    adj: Float[torch.Tensor, "n_nodes n_nodes"] | None = None,
    A_train: Float[torch.Tensor, "n_samples n_samples"] | None = None,
    A_test: Float[torch.Tensor, "n_samples n_samples"] | None = None,
    epochs: int = 4_000,
    lr: float = 1e-4,
    kappa_gcn_layers: int = 1,
) -> dict[str, float]:
    """Benchmarks various machine learning models on Riemannian manifold datasets.

    Evaluates and compares different machine learning models on datasets with a
    product manifold structure, providing metrics for their performance.

    Args:
        X: Tensor of input features with shape (batch, dim).
        y: Tensor of target labels with shape (batch,).
        pm: ProductManifold object defining the geometric structure for benchmarks.
        device: Device for computation. Options: 'cpu', 'cuda', 'mps'. Defaults to 'cpu'.
        score: List of scoring metrics for model evaluation (e.g., 'accuracy', 'f1-micro').
            Defaults to None.
        models: List of model names to evaluate. Options include:
            * "sklearn_dt": Decision tree from scikit-learn
            * "sklearn_rf": Random forest from scikit-learn
            * "product_dt": Product space decision tree
            * "product_rf": Product space random forest
            * "tangent_dt": Decision tree on tangent space
            * "tangent_rf": Random forest on tangent space
            * "knn": k-nearest neighbors
            * "ps_perceptron": Product space perceptron
            Defaults to None.
        max_depth: Maximum depth of tree-based models. Defaults to 5.
        n_estimators: Number of estimators for ensemble models. Defaults to 12.
        min_samples_split: Minimum samples required to split an internal node. Defaults to 2.
        min_samples_leaf: Minimum samples required in a leaf node. Defaults to 1.
        task: Type of machine learning task. Options: 'classification' or 'regression'.
            Defaults to 'classification'.
        seed: Random seed for reproducibility. Defaults to None.
        use_special_dims: Whether to use special manifold dimensions. Defaults to False.
        n_features: Feature dimensionality type. Options: 'd' or 'd_choose_2'.
            Defaults to 'd_choose_2'.
        X_train: Training feature tensor with shape (n_samples, n_manifolds).
            If provided, overrides split from X. Defaults to None.
        X_test: Testing feature tensor with shape (n_samples, n_manifolds).
            If provided, used with X_train. Defaults to None.
        y_train: Training labels tensor with shape (n_samples,).
            Must be provided if X_train is given. Defaults to None.
        y_test: Testing labels tensor with shape (n_samples,).
            Must be provided if X_test is given. Defaults to None.
        batch_size: Batch size for neural network models. Defaults to None.
        adj: Adjacency matrix for graph-based models with shape (n_nodes, n_nodes).
            Defaults to None.
        A_train: Training adjacency matrix with shape (n_samples, n_samples).
            Defaults to None.
        A_test: Testing adjacency matrix with shape (n_samples, n_samples).
            Defaults to None.
        hidden_dims: List of hidden layer dimensions for neural networks.
            Defaults to None.
        epochs: Number of training epochs for iterative models. Defaults to 4000.
        lr: Learning rate for gradient-based optimization. Defaults to 1e-4.
        kappa_gcn_layers: Number of layers in GCN models. Defaults to 1.

    Returns:
        Dictionary mapping model names to their corresponding evaluation scores.
    """
    score = score or ["accuracy", "f1-micro", "f1-macro"]
    models = models or [
        "sklearn_dt",
        "sklearn_rf",
        "product_dt",
        "product_rf",
        "tangent_dt",
        "tangent_rf",
        "knn",
        "ps_perceptron",
        "svm",
        "ps_svm",
        "tangent_mlp",
        "ambient_mlp",
        "tangent_gcn",
        "ambient_gcn",
        "kappa_gcn",
        "ambient_mlr",
        "tangent_mlr",
        "kappa_mlr",
        "single_manifold_rf",
    ]

    # Input validation on (task, score) pairing
    if task in {"classification", "link_prediction"}:
        assert all(s in {"accuracy", "f1-micro", "f1-macro", "time"} for s in score)
    elif task == "regression":
        assert all(s in {"mse", "rmse", "percent_rmse", "time"} for s in score)

    # Input validation on (task, score) pairing
    if task in {"classification", "link_prediction"}:
        assert all(s in {"accuracy", "f1-micro", "f1-macro", "time"} for s in score)
    elif task == "regression":
        assert all(s in {"mse", "rmse", "percent_rmse", "time"} for s in score)
    else:
        raise ValueError(f"Unknown task: {task}")

    # Make sure we're on the right device
    pm = pm.to(device)

    # Split data
    if X_train is not None and X_test is not None and y_train is not None and y_test is not None:
        # Coerce to tensor as needed
        if not torch.is_tensor(X_train):
            X_train = torch.tensor(X_train)
        if not torch.is_tensor(X_test):
            X_test = torch.tensor(X_test)
        if not torch.is_tensor(y_train):
            y_train = torch.tensor(y_train)
        if not torch.is_tensor(y_test):
            y_test = torch.tensor(y_test)

        # Move to device
        X_train = X_train.to(device)
        X_test = X_test.to(device)
        y_train = y_train.to(device)
        y_test = y_test.to(device)

        # Get X and y
        X = torch.cat([X_train, X_test])
        y = torch.cat([y_train, y_test])
        train_idx = np.arange(len(X_train))
        test_idx = np.arange(len(X_train), len(X))

    else:
        # Coerce to tensor as needed
        if not torch.is_tensor(X):
            X = torch.tensor(X)
        if not torch.is_tensor(y):
            y = torch.tensor(y)

        X = X.to(device)
        y = y.to(device)

        X_train, X_test, y_train, y_test, train_idx, test_idx = train_test_split(X, y, np.arange(len(X)), test_size=0.2)

    # Make sure classification labels are formatted correctly
    if task in {"classification", "link_prediction"}:
        y = torch.unique(y, return_inverse=True)[1]
        y_train = y[train_idx]
        y_test = y[test_idx]

    # Make sure everything is detached
    X, X_train, X_test = X.detach(), X_train.detach(), X_test.detach()
    y, y_train, y_test = y.detach(), y_train.detach(), y_test.detach()

    # Get pdists
    pdists = pm.pdist(X).detach()

    # Get tangent plane
    X_train_tangent = pm.logmap(X_train).detach()
    X_test_tangent = pm.logmap(X_test).detach()

    # Get numpy versions
    X_train_np, X_test_np = X_train.detach().cpu().numpy(), X_test.detach().cpu().numpy()
    y_train_np, y_test_np = y_train.detach().cpu().numpy(), y_test.detach().cpu().numpy()
    X_train_tangent_np, X_test_tangent_np = X_train_tangent.cpu().numpy(), X_test_tangent.cpu().numpy()

    # Get stereographic version
    pm_stereo, X_train_stereo, X_test_stereo = pm.stereographic(X_train, X_test)
    assert isinstance(X_train_stereo, torch.Tensor)
    X_train_stereo = X_train_stereo.detach()
    assert isinstance(X_test_stereo, torch.Tensor)
    X_test_stereo = X_test_stereo.detach()

    # Also euclidean """PM"""
    pm_euc = ProductManifold(signature=[(0.0, X.shape[1])], device=device, stereographic=True)

    # Get A_hat
    if adj is not None:
        A_hat = get_A_hat(adj).detach()
    else:
        dists = pdists**2
        dists_train = dists[train_idx][:, train_idx]
        dists /= dists_train[torch.isfinite(dists_train)].max()
        A_hat = get_A_hat(dists).detach()
    A_hat = A_hat.to(device)

    if A_train is None and A_test is None:
        A_train = A_hat[train_idx][:, train_idx].detach()
        A_test = A_hat[test_idx][:, test_idx].detach()
    else:
        assert A_train is not None
        assert A_test is not None
        A_train = A_train.to(device).detach()
        A_test = A_test.to(device).detach()

    # Aggregate arguments
    tree_kwargs = {"max_depth": max_depth, "min_samples_leaf": min_samples_leaf, "min_samples_split": min_samples_split}
    prod_kwargs = {"use_special_dims": use_special_dims, "n_features": n_features, "batch_size": batch_size}
    rf_kwargs = {"n_estimators": n_estimators, "n_jobs": -1, "random_state": seed}
    nn_outdim = 1 if task == "regression" else len(torch.unique(y))
    nn_kwargs = {"task": task, "output_dim": nn_outdim}
    nn_train_kwargs = {"epochs": epochs, "lr": lr}

    # Define your models
    if task in {"classification", "link_prediction"}:
        dt_class = DecisionTreeClassifier
        rf_class = RandomForestClassifier
        knn_class = KNeighborsClassifier
        svm_class = SVC

    else:  # task == "regression"
        dt_class = DecisionTreeRegressor
        rf_class = RandomForestRegressor
        knn_class = KNeighborsRegressor
        svm_class = SVR

    # Evaluate sklearn
    accs: dict[MODELTYPE, dict[SCORETYPE, float]] = {}
    if "sklearn_dt" in models:
        dt = dt_class(**tree_kwargs)
        t1 = time.time()
        dt.fit(X_train_np, y_train_np)
        t2 = time.time()
        accs["sklearn_dt"] = _score(X_test_np, y_test_np, dt, use_torch=False, score=score)
        accs["sklearn_dt"]["time"] = t2 - t1

    if "sklearn_rf" in models:
        rf = rf_class(**tree_kwargs, **rf_kwargs)
        t1 = time.time()
        rf.fit(X_train_np, y_train_np)
        t2 = time.time()
        accs["sklearn_rf"] = _score(X_test_np, y_test_np, rf, use_torch=False, score=score)
        accs["sklearn_rf"]["time"] = t2 - t1

    if "product_dt" in models:
        psdt = ProductSpaceDT(pm=pm, task=task, **tree_kwargs, **prod_kwargs)  # type: ignore
        t1 = time.time()
        psdt.fit(X_train, y_train)
        t2 = time.time()
        accs["product_dt"] = _score(X_test, y_test_np, psdt, use_torch=True, score=score)
        accs["product_dt"]["time"] = t2 - t1

    if "product_rf" in models:
        psrf = ProductSpaceRF(pm=pm, task=task, **tree_kwargs, **rf_kwargs, **prod_kwargs)  # type: ignore
        t1 = time.time()
        psrf.fit(X_train, y_train)
        t2 = time.time()
        accs["product_rf"] = _score(X_test, y_test_np, psrf, use_torch=True, score=score)
        accs["product_rf"]["time"] = t2 - t1

    # if "single_manifold_rf" in models:
    #     smrf = SingleManifoldEnsembleRF(pm=pm, task=task, n_estimators=n_estimators)
    #     t1 = time.time()
    #     smrf.fit(X_train, y_train)
    #     t2 = time.time()
    #     accs["single_manifold_rf"] = _score(X_test, y_test_np, smrf, torch=True, score=score)
    #     accs["single_manifold_rf"]["time"] = t2 - t1

    if "tangent_dt" in models:
        tdt = dt_class(**tree_kwargs)
        t1 = time.time()
        tdt.fit(X_train_tangent_np, y_train_np)
        t2 = time.time()
        accs["tangent_dt"] = _score(X_test_tangent_np, y_test_np, tdt, use_torch=False, score=score)
        accs["tangent_dt"]["time"] = t2 - t1

    if "tangent_rf" in models:
        trf = rf_class(**tree_kwargs, **rf_kwargs)
        t1 = time.time()
        trf.fit(X_train_tangent_np, y_train_np)
        t2 = time.time()
        accs["tangent_rf"] = _score(X_test_tangent_np, y_test_np, trf, use_torch=False, score=score)
        accs["tangent_rf"]["time"] = t2 - t1

    if "knn" in models:
        # Get dists - max imputation is a workaround for some nan values we occasionally get
        t1 = time.time()
        train_dists = pm.pdist(X_train)
        train_dists = torch.nan_to_num(train_dists, nan=train_dists[~train_dists.isnan()].max().item())
        train_test_dists = pm.dist(X_test, X_train)
        train_test_dists = torch.nan_to_num(
            train_test_dists,
            nan=train_test_dists[~train_test_dists.isnan()].max().item(),
        )

        # Convert to numpy
        train_dists = train_dists.detach().cpu().numpy()
        train_test_dists = train_test_dists.detach().cpu().numpy()

        # Train classifier on distances
        knn = knn_class(metric="precomputed")
        t2 = time.time()
        knn.fit(train_dists, y_train_np)
        t3 = time.time()
        accs["knn"] = _score(train_test_dists, y_test_np, knn, use_torch=False, score=score)
        accs["knn"]["time"] = t3 - t1

    # if "perceptron" in models:
    #     loss = "perceptron" if task == "classification" else "squared_error"
    #     ptron = perceptron_class(
    #         loss=loss,
    #         learning_rate="constant",
    #         fit_intercept=False,
    #         eta0=1.0,
    #         max_iter=10_000,
    #     )  # fit_intercept must be false for ambient coordinates
    #     t1 = time.time()
    #     ptron.fit(X_train_np, y_train_np)
    #     t2 = time.time()
    #     accs["perceptron"] = _score(X_test_np, y_test_np, ptron, torch=False, score=score)
    #     accs["perceptron"]["time"] = t2 - t1

    if "ps_perceptron" in models:
        if task == "classification":
            ps_per = ProductSpacePerceptron(pm=pm)
            t1 = time.time()
            ps_per.fit(X_train, y_train)
            t2 = time.time()
            accs["ps_perceptron"] = _score(X_test, y_test_np, ps_per, use_torch=True, score=score)
            accs["ps_perceptron"]["time"] = t2 - t1
        else:
            warnings.warn("Product Space Perceptron is only implemented for classification tasks.", stacklevel=2)

    if "svm" in models:
        # Get inner products for precomputed kernel matrix
        t1 = time.time()
        train_ips = pm.manifold.component_inner(X_train[:, None], X_train[None, :]).sum(dim=-1)
        train_test_ips = pm.manifold.component_inner(X_test[:, None], X_train[None, :]).sum(dim=-1)

        # Convert to numpy
        train_ips = train_ips.detach().cpu().numpy()
        train_test_ips = train_test_ips.detach().cpu().numpy()

        # Train SVM on precomputed inner products
        svm = svm_class(kernel="precomputed", max_iter=10_000)
        # Need max_iter because it can hang. It can be large, since this doesn't happen often.
        t2 = time.time()
        svm.fit(train_ips, y_train_np)
        t3 = time.time()
        accs["svm"] = _score(train_test_ips, y_test_np, svm, use_torch=False, score=score)
        accs["svm"]["time"] = t3 - t1

    if "ps_svm" in models:
        ps_svm = ProductSpaceSVM(pm=pm, task=task, h_constraints=False, e_constraints=False)  # type: ignore
        t1 = time.time()
        ps_svm.fit(X_train, y_train)
        t2 = time.time()
        accs["ps_svm"] = _score(X_test, y_test_np, ps_svm, use_torch=False, score=score)
        accs["ps_svm"]["time"] = t2 - t1

    if "kappa_mlp" in models:
        assert isinstance(X_test_stereo, torch.Tensor)
        kappa_mlp = KappaGCN(
            pm=pm_stereo,
            num_hidden=kappa_gcn_layers,
            task=task,
            output_dim=nn_outdim,  # type: ignore
        ).to(device)
        t1 = time.time()
        if task == "link_prediction":
            kappa_mlp.fit(X_train_stereo, y_train, A=A_train, tqdm_prefix="kappa_mlp", **nn_train_kwargs)
        else:
            kappa_mlp.fit(X_train_stereo, y_train, A=None, tqdm_prefix="kappa_mlp", **nn_train_kwargs)
        t2 = time.time()
        y_pred = kappa_mlp.predict(X_test_stereo, A=None)
        accs["kappa_mlp"] = _score(None, y_test_np, kappa_mlp, y_pred_override=y_pred, use_torch=True, score=score)
        accs["kappa_mlp"]["time"] = t2 - t1

    if "ambient_mlp" in models:
        ambient_mlp = KappaGCN(pm=pm_euc, num_hidden=kappa_gcn_layers, **nn_kwargs).to(device)  # type: ignore
        t1 = time.time()
        ambient_mlp.fit(X_train, y_train, A=None, tqdm_prefix="ambient_mlp", **nn_train_kwargs)
        t2 = time.time()
        y_pred = ambient_mlp.predict(X_test, A=None)
        accs["ambient_mlp"] = _score(None, y_test_np, ambient_mlp, y_pred_override=y_pred, use_torch=True, score=score)
        accs["ambient_mlp"]["time"] = t2 - t1

    if "tangent_mlp" in models:
        tangent_mlp = KappaGCN(pm=pm_euc, num_hidden=kappa_gcn_layers, **nn_kwargs).to(device)  # type: ignore
        t1 = time.time()
        tangent_mlp.fit(X_train_tangent, y_train, A=None, tqdm_prefix="tangent_mlp", **nn_train_kwargs)
        t2 = time.time()
        y_pred = tangent_mlp.predict(X_test_tangent, A=None)
        accs["tangent_mlp"] = _score(None, y_test_np, tangent_mlp, y_pred_override=y_pred, use_torch=True, score=score)
        accs["tangent_mlp"]["time"] = t2 - t1

    if "ambient_gcn" in models:
        ambient_gcn = KappaGCN(pm=pm_euc, num_hidden=kappa_gcn_layers, **nn_kwargs).to(device)  # type: ignore
        t1 = time.time()
        ambient_gcn.fit(X_train, y_train, A=A_train, **nn_train_kwargs)
        t2 = time.time()
        y_pred = ambient_gcn.predict(X_test, A=A_test)
        accs["ambient_gcn"] = _score(None, y_test_np, None, y_pred_override=y_pred, use_torch=True, score=score)
        accs["ambient_gcn"]["time"] = t2 - t1

    if "tangent_gcn" in models:
        tangent_gcn = KappaGCN(pm=pm_euc, num_hidden=kappa_gcn_layers, **nn_kwargs).to(device)  # type: ignore
        t1 = time.time()
        tangent_gcn.fit(X_train_tangent, y_train, A=A_train, tqdm_prefix="tangent_gcn", **nn_train_kwargs)
        t2 = time.time()
        y_pred = tangent_gcn.predict(X_test_tangent, A=A_test)
        accs["tangent_gcn"] = _score(None, y_test_np, None, y_pred_override=y_pred, use_torch=True, score=score)
        accs["tangent_gcn"]["time"] = t2 - t1

    if "kappa_gcn" in models:
        assert isinstance(X_test_stereo, torch.Tensor)
        kappa_gcn = KappaGCN(pm=pm_stereo, num_hidden=kappa_gcn_layers, task=task, output_dim=nn_outdim).to(device)  # type: ignore
        t1 = time.time()
        kappa_gcn.fit(X_train_stereo, y_train, A=A_train, tqdm_prefix="kappa_gcn", **nn_train_kwargs)
        t2 = time.time()
        y_pred = kappa_gcn.predict(X_test_stereo, A=A_test)
        accs["kappa_gcn"] = _score(None, y_test_np, None, y_pred_override=y_pred, use_torch=True, score=score)
        accs["kappa_gcn"]["time"] = t2 - t1

    if "kappa_mlr" in models:
        kappa_mlr = KappaGCN(pm=pm_stereo, num_hidden=0, task=task, output_dim=nn_outdim).to(device)  # type: ignore
        t1 = time.time()
        kappa_mlr.fit(X_train_stereo, y_train, A=None, tqdm_prefix="kappa_mlr", **nn_train_kwargs)
        t2 = time.time()
        y_pred = kappa_mlr.predict(X_test_stereo, A=None)
        accs["kappa_mlr"] = _score(None, y_test_np, None, y_pred_override=y_pred, use_torch=True, score=score)
        accs["kappa_mlr"]["time"] = t2 - t1

    if "tangent_mlr" in models:
        tangent_mlr = KappaGCN(pm=pm_euc, num_hidden=0, task=task, output_dim=nn_outdim).to(device)  # type: ignore
        t1 = time.time()
        tangent_mlr.fit(X_train_tangent, y_train, A=None, tqdm_prefix="tangent_mlr", **nn_train_kwargs)
        t2 = time.time()
        y_pred = tangent_mlr.predict(X_test_tangent, A=None)
        accs["tangent_mlr"] = _score(None, y_test_np, None, y_pred_override=y_pred, use_torch=True, score=score)
        accs["tangent_mlr"]["time"] = t2 - t1

    if "ambient_mlr" in models:
        ambient_mlr = KappaGCN(pm=pm_euc, num_hidden=0, task=task, output_dim=nn_outdim).to(device)  # type: ignore
        t1 = time.time()
        ambient_mlr.fit(X_train, y_train, A=None, tqdm_prefix="ambient_mlr", **nn_train_kwargs)
        t2 = time.time()
        y_pred = ambient_mlr.predict(X_test, A=None)
        accs["ambient_mlr"] = _score(None, y_test_np, None, y_pred_override=y_pred, use_torch=True, score=score)
        accs["ambient_mlr"]["time"] = t2 - t1

    # return accs
    return {
        **{
            f"{model}_{metric}": value
            for model, metrics in accs.items()
            if isinstance(metrics, dict)
            for metric, value in metrics.items()
        },
        **{k: v for k, v in accs.items() if not isinstance(v, dict)},  # type: ignore
    }

`dataloaders` ¶

Dataloaders Submodule.¶

The dataloaders module allows users to load datasets from Manify's datasets repo on Hugging Face.

We provide a summary of the data types available, and their original sources, here.

Earlier versions of Manify included scripts to process raw data, which we have replaced with a single, centralized Hugging Face repo and the function load_hf. For transparency, we have preserved the data generation code in the Dataset-Generation branch of Manify.

Dataset	Task	Distance Matrix	Features	Labels	Adjacency Matrix	Source/Citation
cities	none	✅	❌	❌	❌	Network Repository: Cities
cs_phds	regression	✅	❌	✅	✅	Network Repository: CS PhDs
polblogs	classification	✅	❌	✅	✅	Network Repository: Polblogs
polbooks	classification	✅	❌	✅	✅	Network Repository: Polbooks
cora	classification	✅	❌	✅	✅	Network Repository: Cora
citeseer	classification	✅	❌	✅	✅	Network Repository: Citeseer
karate_club	none	✅	❌	❌	✅	Network Repository: Karate
lesmis	none	✅	❌	❌	✅	Network Repository: Lesmis
adjnoun	none	✅	❌	❌	✅	Network Repository: Adjnoun
football	none	✅	❌	❌	✅	Network Repository: Football
dolphins	none	✅	❌	❌	✅	Network Repository: Dolphins
blood_cells	classification	❌	✅	✅	❌	See datasets from Zheng et al (2017): Massively parallel digital transcriptional profiling of single cells. - CD8+ Cytotoxic T-cells - CD8+/CD45RA+ Naive Cytotoxic T Cells - CD56+ Natural Killer Cells - CD4+ Helper T Cells - CD4+/CD45RO+ Memory T Cells - CD4+/CD45RA+/CD25- Naive T Cells - CD4+/CD25+ Regulatory T Cells - CD34+ Cells - CD19+ B Cells - CD14+ Monocytes
lymphoma	classification	❌	✅	✅	❌	See datasets from 10x Genomics: - Hodgkin's Lymphoma - Healthy Donor PBMCs
cifar_100	classification	❌	✅	✅	❌	Hugging Face Datasets: CIFAR-100
mnist	classification	❌	✅	✅	❌	Hugging Face Datasets: MNIST
temperature	regression	❌	✅	✅	❌	[Citation]
landmasses	classification	❌	✅	✅	❌	Generated using basemap.is_land
neuron_33	classification	❌	✅	✅	❌	Allen Brain Atlas
neuron_46	classification	❌	✅	✅	❌	Allen Brain Atlas
traffic	regression	❌	✅	✅	❌	Kaggle: Traffic Prediction Dataset
qiita	none	✅	✅	❌	❌	NeuroSEED Git Repo

`load_hf(name, namespace='manify')` ¶

Load a dataset from HuggingFace Hub at {namespace}/{name}.

Returns:

features( Float[Tensor, 'n_points ...'] | None ) –

The features for each node, if any
dists( Float[Tensor, 'n_points n_points'] | None ) –

The pairwise distance matrix over all nodes, if any
adj( Float[Tensor, 'n_points n_points'] | None ) –

The adjacency matrix over all nodes, if any
labels( Real[Tensor, 'n_points'] | None ) –

The (classification or regression) labels for each node, if any

Source code in manify/utils/dataloaders.py

def load_hf(
    name: str, namespace: str = "manify"
) -> tuple[
    Float[torch.Tensor, "n_points ..."] | None,  # features
    Float[torch.Tensor, "n_points n_points"] | None,  # pairwise dists
    Float[torch.Tensor, "n_points n_points"] | None,  # adjacency labels
    Real[torch.Tensor, "n_points"] | None,  # labels
]:
    """Load a dataset from HuggingFace Hub at {namespace}/{name}.

    Returns:
        features: The features for each node, if any
        dists: The pairwise distance matrix over all nodes, if any
        adj: The adjacency matrix over all nodes, if any
        labels: The (classification or regression) labels for each node, if any
    """
    # 1) fetch the single‑row dataset
    ds = load_dataset(f"{namespace}/{name}")
    data = ds.get("train", ds)  # use "train" split if available, else the only split
    row = data[0]

    # 2) helper to turn lists → torch (or None)
    def to_tensor(key: str, dtype: torch.dtype) -> torch.Tensor | None:
        vals = row.get(key, [])
        if not vals:
            return None
        return torch.tensor(vals, dtype=dtype)

    # 3) reconstruct everything
    dists = to_tensor("distances", torch.float32)
    feats = to_tensor("features", torch.float32)
    adj = to_tensor("adjacency", torch.float32)

    cls_ls = row.get("classification_labels", [])
    reg_ls = row.get("regression_labels", [])
    if cls_ls:
        labels = torch.tensor(cls_ls, dtype=torch.int64)
    elif reg_ls:
        labels = torch.tensor(reg_ls, dtype=torch.float32)
    else:
        labels = None

    return feats, dists, adj, labels

`link_prediction` ¶

Preprocessing datasets for link prediction.

`make_link_prediction_dataset(X_embed, pm, adj, add_dists=True)` ¶

Preprocess a graph link prediction task into a binary classification problem on a new product manifold.

This function constructs a dataset for link prediction by creating pairwise embeddings from the input node embeddings, optionally appending pairwise distances, and returning labels from an adjacency matrix. It also updates the manifold signature correspondingly.

Parameters:	`X_embed` (`Float[Tensor, 'batch n_dim']`) – Node embeddings. `pm` – The manifold on which the embeddings lie. `adj` (`Float[Tensor, 'batch batch']`) – A binary adjacency matrix indicating edges between nodes. `add_dists` (`bool`, default: `True` ) – If True, appends pairwise distances to the feature vectors. Default is True.

Returns:	`X`( `Float[Tensor, 'batch*2 n_dim2']` ) – Node-pair embeddings in \(\mathcal{M} \times \mathcal{M}\) `y`( `Float[Tensor, 'batch2']` ) – Edge labels derived from the adjacency matrix. `new_pm`**( `ProductManifold` ) – A new instance of `ProductManifold` with an updated signature reflecting the feature space \(\mathcal{M} \times \mathcal{M}\).

Source code in manify/utils/link_prediction.py

def make_link_prediction_dataset(
    X_embed: Float[torch.Tensor, "batch n_dim"],
    pm: ProductManifold,
    adj: Float[torch.Tensor, "batch batch"],
    add_dists: bool = True,
) -> tuple[Float[torch.Tensor, "batch**2 n_dim*2"], Float[torch.Tensor, "batch**2"], ProductManifold]:
    r"""Preprocess a graph link prediction task into a binary classification problem on a new product manifold.

    This function constructs a dataset for link prediction by creating pairwise embeddings from the input node
    embeddings, optionally appending pairwise distances, and returning labels from an adjacency matrix. It also updates
    the manifold signature correspondingly.

    Args:
        X_embed: Node embeddings.
        pm : The manifold on which the embeddings lie.
        adj: A binary adjacency matrix indicating edges between nodes.
        add_dists: If True, appends pairwise distances to the feature vectors. Default is True.

    Returns:
        X: Node-pair embeddings in $\mathcal{M} \times \mathcal{M}$
        y: Edge labels derived from the adjacency matrix.
        new_pm: A new instance of `ProductManifold` with an updated signature reflecting the feature space
            $\mathcal{M} \times \mathcal{M}$.

    """
    # Stack embeddings
    X = torch.stack([torch.cat([X_i, X_j]) for X_i in X_embed for X_j in X_embed])

    # Add distances
    if add_dists:
        dists = pm.pdist(X_embed)
        X = torch.cat([X, dists.flatten().unsqueeze(1)], dim=1)

    y = adj.flatten()

    # Binarize y
    y = (y > 0).long()

    # Make a new signature
    new_sig = pm.signature + pm.signature
    if add_dists:
        new_sig.append((0.0, 1))
    new_pm = ProductManifold(signature=new_sig)

    return X, y, new_pm

`split_link_prediction_dataset(X, y, test_size=0.2, downsample=None, random_state=None, **kwargs)` ¶

Split a link prediction dataset into train and test sets.

Parameters:

X (Float[Tensor, 'n_pairs n_dims']) –

Node-pair embeddings of shape (n_nodes^2, n_dims).
y (Int[Tensor, 'n_pairs']) –

Edge labels of shape (n_nodes^2,).
test_size (float, default: 0.2 ) –

Proportion of nodes to include in test set.
downsample (int | None, default: None ) –

If provided, downsample to this many pos/neg pairs each.
random_state (int | None, default: None ) –

Random seed for reproducibility.
**kwargs (Any, default: {} ) –

Additional arguments for train_test_split.

Returns:	`tuple[Float[Tensor, '... n_dims'], Float[Tensor, '... n_dims'], Int[Tensor, '...'], Int[Tensor, '...'], Int[Tensor, '...'], Int[Tensor, '...']]` – Tuple of (X_train, X_test, y_train, y_test, idx_train, idx_test).

Source code in manify/utils/link_prediction.py

def split_link_prediction_dataset(
    X: Float[torch.Tensor, "n_pairs n_dims"],
    y: Int[torch.Tensor, "n_pairs"],
    test_size: float = 0.2,
    downsample: int | None = None,
    random_state: int | None = None,
    **kwargs: Any,
) -> tuple[
    Float[torch.Tensor, "... n_dims"],
    Float[torch.Tensor, "... n_dims"],
    Int[torch.Tensor, "..."],
    Int[torch.Tensor, "..."],
    Int[torch.Tensor, "..."],
    Int[torch.Tensor, "..."],
]:
    """Split a link prediction dataset into train and test sets.

    Args:
        X: Node-pair embeddings of shape (n_nodes^2, n_dims).
        y: Edge labels of shape (n_nodes^2,).
        test_size: Proportion of nodes to include in test set.
        downsample: If provided, downsample to this many pos/neg pairs each.
        random_state: Random seed for reproducibility.
        **kwargs: Additional arguments for train_test_split.

    Returns:
        Tuple of (X_train, X_test, y_train, y_test, idx_train, idx_test).
    """
    if random_state is not None:
        torch.manual_seed(random_state)

    n_pairs, n_dims = X.shape
    n_nodes = int(n_pairs**0.5)
    assert n_nodes**2 == n_pairs, f"Expected {n_nodes}^2 = {n_nodes**2} pairs, got {n_pairs}"

    # Downsample if requested (before split to maintain structure)
    if downsample is not None:
        pos_mask = y == 1
        neg_mask = y == 0

        pos_indices = torch.where(pos_mask)[0]
        neg_indices = torch.where(neg_mask)[0]

        # Sample up to 'downsample' examples from each class
        n_pos = min(len(pos_indices), downsample)
        n_neg = min(len(neg_indices), downsample)

        sampled_pos = pos_indices[torch.randperm(len(pos_indices))[:n_pos]]
        sampled_neg = neg_indices[torch.randperm(len(neg_indices))[:n_neg]]

        # Create a mask for selected pairs
        mask = torch.zeros(n_pairs, dtype=torch.bool)
        mask[sampled_pos] = True
        mask[sampled_neg] = True

        # Zero out unselected pairs
        X_filtered = X.clone()
        y_filtered = y.clone()
        X_filtered[~mask] = 0
        y_filtered[~mask] = 0
    else:
        X_filtered = X
        y_filtered = y

    # Reshape to adjacency format
    X_adj = X_filtered.view(n_nodes, n_nodes, n_dims)
    y_adj = y_filtered.view(n_nodes, n_nodes)

    # Split nodes into train/test
    node_indices = torch.arange(n_nodes)
    idx_train, idx_test = train_test_split(node_indices, test_size=test_size, random_state=random_state, **kwargs)

    # Extract train and test subgraphs and flatten
    X_train = X_adj[idx_train][:, idx_train].reshape(-1, n_dims)
    y_train = y_adj[idx_train][:, idx_train].reshape(-1)

    X_test = X_adj[idx_test][:, idx_test].reshape(-1, n_dims)
    y_test = y_adj[idx_test][:, idx_test].reshape(-1)

    return X_train, X_test, y_train, y_test, idx_train, idx_test

`visualization` ¶

Manify visualization utilities.

`hyperboloid_to_poincare(X)` ¶

Convert hyperboloid coordinates to Poincaré ball coordinates.

Parameters:	`X` (`Float[Tensor, 'n_points n_dim']`) – Input coordinates in the hyperboloid model.

Returns:	`poincare_coords`( `Float[Tensor, 'n_points n_dim-1']` ) – Coordinates in the Poincaré ball model.

Source code in manify/utils/visualization.py

def hyperboloid_to_poincare(X: Float[torch.Tensor, "n_points n_dim"]) -> Float[torch.Tensor, "n_points n_dim-1"]:
    """Convert hyperboloid coordinates to Poincaré ball coordinates.

    Args:
        X: Input coordinates in the hyperboloid model.

    Returns:
        poincare_coords: Coordinates in the Poincaré ball model.
    """
    # Spatial components: all columns except the first
    x_space = X[:, 1:]

    # Time-like component: first column, reshaped for broadcasting
    x_time = X[:, 0:1]

    # Convert to Poincaré ball coordinates
    poincare_coords = x_space / (1 + x_time)

    return poincare_coords

`spherical_to_polar(X)` ¶

Convert spherical coordinates to polar coordinates.

Parameters:	`X` (`Float[Tensor, 'n_points n_dim']`) – Input coordinates in spherical form.

Returns:	`polar_coords`( `Float[Tensor, 'n_points n_dim-1']` ) – Coordinates in polar form.

Source code in manify/utils/visualization.py

def spherical_to_polar(X: Float[torch.Tensor, "n_points n_dim"]) -> Float[torch.Tensor, "n_points n_dim-1"]:
    """Convert spherical coordinates to polar coordinates.

    Args:
        X: Input coordinates in spherical form.

    Returns:
        polar_coords: Coordinates in polar form.
    """
    # Radius computation
    r = torch.norm(X, dim=1, keepdim=True)

    # Prepare output tensor
    out = torch.zeros_like(X)
    out[:, 0] = r.squeeze()  # Set the radius

    # Compute angles
    for i in range(1, X.size(1)):
        if i == X.size(1) - 1:
            # Last angle, use atan2 for full 360 degree
            out[:, i] = torch.atan2(X[:, i - 1], X[:, i - 2])
        else:
            # Compute angle from the higher dimension 'hypotenuse'
            hypotenuse = torch.norm(X[:, i:], dim=1, keepdim=True)
            # Prevent division by zero
            safe_hypotenuse = torch.where(hypotenuse > 0, hypotenuse, torch.tensor(1.0).to(X.device))
            # Ensure acos receives values within [-1, 1] and preserve dimensions
            angle = torch.acos(torch.clamp(X[:, i : i + 1] / safe_hypotenuse, -1, 1))
            out[:, i] = angle.squeeze()

    return out[:, 1:]

`S2_to_polar(X)` ¶

Convert S^2 (2-sphere) coordinates to polar coordinates.

Parameters:	`X` (`Float[Tensor, 'n_points 3']`) – Input coordinates on the 2-sphere.

Returns:	`polar_coords`( `Float[Tensor, 'n_points 2']` ) – Coordinates in polar form (elevation, azimuth).

Source code in manify/utils/visualization.py

def S2_to_polar(X: Float[torch.Tensor, "n_points 3"]) -> Float[torch.Tensor, "n_points 2"]:
    """Convert S^2 (2-sphere) coordinates to polar coordinates.

    Args:
        X: Input coordinates on the 2-sphere.

    Returns:
        polar_coords: Coordinates in polar form (elevation, azimuth).
    """
    return torch.stack([torch.acos(X[:, 2]), torch.atan2(X[:, 1], X[:, 0])], dim=1)

Utilities¶

manify.utils ¶

benchmarks ¶

dataloaders ¶

Dataloaders Submodule.¶

load_hf(name, namespace='manify') ¶

link_prediction ¶

make_link_prediction_dataset(X_embed, pm, adj, add_dists=True) ¶

split_link_prediction_dataset(X, y, test_size=0.2, downsample=None, random_state=None, **kwargs) ¶

visualization ¶

hyperboloid_to_poincare(X) ¶

spherical_to_polar(X) ¶

S2_to_polar(X) ¶

`manify.utils` ¶

`benchmarks` ¶

`dataloaders` ¶

`load_hf(name, namespace='manify')` ¶

`link_prediction` ¶

`make_link_prediction_dataset(X_embed, pm, adj, add_dists=True)` ¶

`split_link_prediction_dataset(X, y, test_size=0.2, downsample=None, random_state=None, **kwargs)` ¶

`visualization` ¶

`hyperboloid_to_poincare(X)` ¶

`spherical_to_polar(X)` ¶

`S2_to_polar(X)` ¶