Utilities

manify.utils

Manify library utilities.

benchmarks

Implementation for benchmarking different product space machine learning methods.

benchmark(X, y, pm, device='cpu', score=None, models=None, max_depth=5, n_estimators=12, min_samples_split=2, min_samples_leaf=1, task='classification', seed=None, use_special_dims=False, n_features='d_choose_2', X_train=None, X_test=None, y_train=None, y_test=None, batch_size=None, adj=None, A_train=None, A_test=None, epochs=4000, lr=0.0001, kappa_gcn_layers=1)

Benchmarks various machine learning models on Riemannian manifold datasets.

Evaluates and compares different machine learning models on datasets with a product manifold structure, providing metrics for their performance.

Parameters:
  • X (Float[Tensor, 'batch dim']) –

    Tensor of input features with shape (batch, dim).

  • y (Real[Tensor, 'batch']) –

    Tensor of target labels with shape (batch,).

  • pm (ProductManifold) –

    ProductManifold object defining the geometric structure for benchmarks.

  • device (Literal['cpu', 'cuda', 'mps'], default: 'cpu' ) –

    Device for computation. Options: 'cpu', 'cuda', 'mps'. Defaults to 'cpu'.

  • score (list[SCORETYPE] | None, default: None ) –

    List of scoring metrics for model evaluation (e.g., 'accuracy', 'f1-micro'). Defaults to None.

  • models (list[MODELTYPE] | None, default: None ) –

    List of model names to evaluate. Options include: * "sklearn_dt": Decision tree from scikit-learn * "sklearn_rf": Random forest from scikit-learn * "product_dt": Product space decision tree * "product_rf": Product space random forest * "tangent_dt": Decision tree on tangent space * "tangent_rf": Random forest on tangent space * "knn": k-nearest neighbors * "ps_perceptron": Product space perceptron Defaults to None.

  • max_depth (int, default: 5 ) –

    Maximum depth of tree-based models. Defaults to 5.

  • n_estimators (int, default: 12 ) –

    Number of estimators for ensemble models. Defaults to 12.

  • min_samples_split (int, default: 2 ) –

    Minimum samples required to split an internal node. Defaults to 2.

  • min_samples_leaf (int, default: 1 ) –

    Minimum samples required in a leaf node. Defaults to 1.

  • task (TASKTYPE, default: 'classification' ) –

    Type of machine learning task. Options: 'classification' or 'regression'. Defaults to 'classification'.

  • seed (int | None, default: None ) –

    Random seed for reproducibility. Defaults to None.

  • use_special_dims (bool, default: False ) –

    Whether to use special manifold dimensions. Defaults to False.

  • n_features (Literal['d', 'd_choose_2'], default: 'd_choose_2' ) –

    Feature dimensionality type. Options: 'd' or 'd_choose_2'. Defaults to 'd_choose_2'.

  • X_train (Float[Tensor, 'n_samples n_manifolds'] | None, default: None ) –

    Training feature tensor with shape (n_samples, n_manifolds). If provided, overrides split from X. Defaults to None.

  • X_test (Float[Tensor, 'n_samples n_manifolds'] | None, default: None ) –

    Testing feature tensor with shape (n_samples, n_manifolds). If provided, used with X_train. Defaults to None.

  • y_train (Real[Tensor, 'n_samples'] | None, default: None ) –

    Training labels tensor with shape (n_samples,). Must be provided if X_train is given. Defaults to None.

  • y_test (Real[Tensor, 'n_samples'] | None, default: None ) –

    Testing labels tensor with shape (n_samples,). Must be provided if X_test is given. Defaults to None.

  • batch_size (int | None, default: None ) –

    Batch size for neural network models. Defaults to None.

  • adj (Float[Tensor, 'n_nodes n_nodes'] | None, default: None ) –

    Adjacency matrix for graph-based models with shape (n_nodes, n_nodes). Defaults to None.

  • A_train (Float[Tensor, 'n_samples n_samples'] | None, default: None ) –

    Training adjacency matrix with shape (n_samples, n_samples). Defaults to None.

  • A_test (Float[Tensor, 'n_samples n_samples'] | None, default: None ) –

    Testing adjacency matrix with shape (n_samples, n_samples). Defaults to None.

  • hidden_dims

    List of hidden layer dimensions for neural networks. Defaults to None.

  • epochs (int, default: 4000 ) –

    Number of training epochs for iterative models. Defaults to 4000.

  • lr (float, default: 0.0001 ) –

    Learning rate for gradient-based optimization. Defaults to 1e-4.

  • kappa_gcn_layers (int, default: 1 ) –

    Number of layers in GCN models. Defaults to 1.

Returns:
  • dict[str, float]

    Dictionary mapping model names to their corresponding evaluation scores.

Source code in manify/utils/benchmarks.py
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
def benchmark(
    X: Float[torch.Tensor, "batch dim"],
    y: Real[torch.Tensor, "batch"],
    pm: ProductManifold,
    device: Literal["cpu", "cuda", "mps"] = "cpu",
    score: list[SCORETYPE] | None = None,
    models: list[MODELTYPE] | None = None,
    max_depth: int = 5,
    n_estimators: int = 12,
    min_samples_split: int = 2,
    min_samples_leaf: int = 1,
    task: TASKTYPE = "classification",
    seed: int | None = None,
    use_special_dims: bool = False,
    n_features: Literal["d", "d_choose_2"] = "d_choose_2",
    X_train: Float[torch.Tensor, "n_samples n_manifolds"] | None = None,
    X_test: Float[torch.Tensor, "n_samples n_manifolds"] | None = None,
    y_train: Real[torch.Tensor, "n_samples"] | None = None,
    y_test: Real[torch.Tensor, "n_samples"] | None = None,
    batch_size: int | None = None,
    adj: Float[torch.Tensor, "n_nodes n_nodes"] | None = None,
    A_train: Float[torch.Tensor, "n_samples n_samples"] | None = None,
    A_test: Float[torch.Tensor, "n_samples n_samples"] | None = None,
    epochs: int = 4_000,
    lr: float = 1e-4,
    kappa_gcn_layers: int = 1,
) -> dict[str, float]:
    """Benchmarks various machine learning models on Riemannian manifold datasets.

    Evaluates and compares different machine learning models on datasets with a
    product manifold structure, providing metrics for their performance.

    Args:
        X: Tensor of input features with shape (batch, dim).
        y: Tensor of target labels with shape (batch,).
        pm: ProductManifold object defining the geometric structure for benchmarks.
        device: Device for computation. Options: 'cpu', 'cuda', 'mps'. Defaults to 'cpu'.
        score: List of scoring metrics for model evaluation (e.g., 'accuracy', 'f1-micro').
            Defaults to None.
        models: List of model names to evaluate. Options include:
            * "sklearn_dt": Decision tree from scikit-learn
            * "sklearn_rf": Random forest from scikit-learn
            * "product_dt": Product space decision tree
            * "product_rf": Product space random forest
            * "tangent_dt": Decision tree on tangent space
            * "tangent_rf": Random forest on tangent space
            * "knn": k-nearest neighbors
            * "ps_perceptron": Product space perceptron
            Defaults to None.
        max_depth: Maximum depth of tree-based models. Defaults to 5.
        n_estimators: Number of estimators for ensemble models. Defaults to 12.
        min_samples_split: Minimum samples required to split an internal node. Defaults to 2.
        min_samples_leaf: Minimum samples required in a leaf node. Defaults to 1.
        task: Type of machine learning task. Options: 'classification' or 'regression'.
            Defaults to 'classification'.
        seed: Random seed for reproducibility. Defaults to None.
        use_special_dims: Whether to use special manifold dimensions. Defaults to False.
        n_features: Feature dimensionality type. Options: 'd' or 'd_choose_2'.
            Defaults to 'd_choose_2'.
        X_train: Training feature tensor with shape (n_samples, n_manifolds).
            If provided, overrides split from X. Defaults to None.
        X_test: Testing feature tensor with shape (n_samples, n_manifolds).
            If provided, used with X_train. Defaults to None.
        y_train: Training labels tensor with shape (n_samples,).
            Must be provided if X_train is given. Defaults to None.
        y_test: Testing labels tensor with shape (n_samples,).
            Must be provided if X_test is given. Defaults to None.
        batch_size: Batch size for neural network models. Defaults to None.
        adj: Adjacency matrix for graph-based models with shape (n_nodes, n_nodes).
            Defaults to None.
        A_train: Training adjacency matrix with shape (n_samples, n_samples).
            Defaults to None.
        A_test: Testing adjacency matrix with shape (n_samples, n_samples).
            Defaults to None.
        hidden_dims: List of hidden layer dimensions for neural networks.
            Defaults to None.
        epochs: Number of training epochs for iterative models. Defaults to 4000.
        lr: Learning rate for gradient-based optimization. Defaults to 1e-4.
        kappa_gcn_layers: Number of layers in GCN models. Defaults to 1.

    Returns:
        Dictionary mapping model names to their corresponding evaluation scores.
    """
    score = score or ["accuracy", "f1-micro", "f1-macro"]
    models = models or [
        "sklearn_dt",
        "sklearn_rf",
        "product_dt",
        "product_rf",
        "tangent_dt",
        "tangent_rf",
        "knn",
        "ps_perceptron",
        "svm",
        "ps_svm",
        "tangent_mlp",
        "ambient_mlp",
        "tangent_gcn",
        "ambient_gcn",
        "kappa_gcn",
        "ambient_mlr",
        "tangent_mlr",
        "kappa_mlr",
        "single_manifold_rf",
    ]

    # Input validation on (task, score) pairing
    if task in {"classification", "link_prediction"}:
        assert all(s in {"accuracy", "f1-micro", "f1-macro", "time"} for s in score)
    elif task == "regression":
        assert all(s in {"mse", "rmse", "percent_rmse", "time"} for s in score)

    # Input validation on (task, score) pairing
    if task in {"classification", "link_prediction"}:
        assert all(s in {"accuracy", "f1-micro", "f1-macro", "time"} for s in score)
    elif task == "regression":
        assert all(s in {"mse", "rmse", "percent_rmse", "time"} for s in score)
    else:
        raise ValueError(f"Unknown task: {task}")

    # Make sure we're on the right device
    pm = pm.to(device)

    # Split data
    if X_train is not None and X_test is not None and y_train is not None and y_test is not None:
        # Coerce to tensor as needed
        if not torch.is_tensor(X_train):
            X_train = torch.tensor(X_train)
        if not torch.is_tensor(X_test):
            X_test = torch.tensor(X_test)
        if not torch.is_tensor(y_train):
            y_train = torch.tensor(y_train)
        if not torch.is_tensor(y_test):
            y_test = torch.tensor(y_test)

        # Move to device
        X_train = X_train.to(device)
        X_test = X_test.to(device)
        y_train = y_train.to(device)
        y_test = y_test.to(device)

        # Get X and y
        X = torch.cat([X_train, X_test])
        y = torch.cat([y_train, y_test])
        train_idx = np.arange(len(X_train))
        test_idx = np.arange(len(X_train), len(X))

    else:
        # Coerce to tensor as needed
        if not torch.is_tensor(X):
            X = torch.tensor(X)
        if not torch.is_tensor(y):
            y = torch.tensor(y)

        X = X.to(device)
        y = y.to(device)

        X_train, X_test, y_train, y_test, train_idx, test_idx = train_test_split(X, y, np.arange(len(X)), test_size=0.2)

    # Make sure classification labels are formatted correctly
    if task in {"classification", "link_prediction"}:
        y = torch.unique(y, return_inverse=True)[1]
        y_train = y[train_idx]
        y_test = y[test_idx]

    # Make sure everything is detached
    X, X_train, X_test = X.detach(), X_train.detach(), X_test.detach()
    y, y_train, y_test = y.detach(), y_train.detach(), y_test.detach()

    # Get pdists
    pdists = pm.pdist(X).detach()

    # Get tangent plane
    X_train_tangent = pm.logmap(X_train).detach()
    X_test_tangent = pm.logmap(X_test).detach()

    # Get numpy versions
    X_train_np, X_test_np = X_train.detach().cpu().numpy(), X_test.detach().cpu().numpy()
    y_train_np, y_test_np = y_train.detach().cpu().numpy(), y_test.detach().cpu().numpy()
    X_train_tangent_np, X_test_tangent_np = X_train_tangent.cpu().numpy(), X_test_tangent.cpu().numpy()

    # Get stereographic version
    pm_stereo, X_train_stereo, X_test_stereo = pm.stereographic(X_train, X_test)
    assert isinstance(X_train_stereo, torch.Tensor)
    X_train_stereo = X_train_stereo.detach()
    assert isinstance(X_test_stereo, torch.Tensor)
    X_test_stereo = X_test_stereo.detach()

    # Also euclidean """PM"""
    pm_euc = ProductManifold(signature=[(0.0, X.shape[1])], device=device, stereographic=True)

    # Get A_hat
    if adj is not None:
        A_hat = get_A_hat(adj).detach()
    else:
        dists = pdists**2
        dists_train = dists[train_idx][:, train_idx]
        dists /= dists_train[torch.isfinite(dists_train)].max()
        A_hat = get_A_hat(dists).detach()
    A_hat = A_hat.to(device)

    if A_train is None and A_test is None:
        A_train = A_hat[train_idx][:, train_idx].detach()
        A_test = A_hat[test_idx][:, test_idx].detach()
    else:
        assert A_train is not None
        assert A_test is not None
        A_train = A_train.to(device).detach()
        A_test = A_test.to(device).detach()

    # Aggregate arguments
    tree_kwargs = {"max_depth": max_depth, "min_samples_leaf": min_samples_leaf, "min_samples_split": min_samples_split}
    prod_kwargs = {"use_special_dims": use_special_dims, "n_features": n_features, "batch_size": batch_size}
    rf_kwargs = {"n_estimators": n_estimators, "n_jobs": -1, "random_state": seed}
    nn_outdim = 1 if task == "regression" else len(torch.unique(y))
    nn_kwargs = {"task": task, "output_dim": nn_outdim}
    nn_train_kwargs = {"epochs": epochs, "lr": lr}

    # Define your models
    if task in {"classification", "link_prediction"}:
        dt_class = DecisionTreeClassifier
        rf_class = RandomForestClassifier
        knn_class = KNeighborsClassifier
        svm_class = SVC

    else:  # task == "regression"
        dt_class = DecisionTreeRegressor
        rf_class = RandomForestRegressor
        knn_class = KNeighborsRegressor
        svm_class = SVR

    # Evaluate sklearn
    accs: dict[MODELTYPE, dict[SCORETYPE, float]] = {}
    if "sklearn_dt" in models:
        dt = dt_class(**tree_kwargs)
        t1 = time.time()
        dt.fit(X_train_np, y_train_np)
        t2 = time.time()
        accs["sklearn_dt"] = _score(X_test_np, y_test_np, dt, use_torch=False, score=score)
        accs["sklearn_dt"]["time"] = t2 - t1

    if "sklearn_rf" in models:
        rf = rf_class(**tree_kwargs, **rf_kwargs)
        t1 = time.time()
        rf.fit(X_train_np, y_train_np)
        t2 = time.time()
        accs["sklearn_rf"] = _score(X_test_np, y_test_np, rf, use_torch=False, score=score)
        accs["sklearn_rf"]["time"] = t2 - t1

    if "product_dt" in models:
        psdt = ProductSpaceDT(pm=pm, task=task, **tree_kwargs, **prod_kwargs)  # type: ignore
        t1 = time.time()
        psdt.fit(X_train, y_train)
        t2 = time.time()
        accs["product_dt"] = _score(X_test, y_test_np, psdt, use_torch=True, score=score)
        accs["product_dt"]["time"] = t2 - t1

    if "product_rf" in models:
        psrf = ProductSpaceRF(pm=pm, task=task, **tree_kwargs, **rf_kwargs, **prod_kwargs)  # type: ignore
        t1 = time.time()
        psrf.fit(X_train, y_train)
        t2 = time.time()
        accs["product_rf"] = _score(X_test, y_test_np, psrf, use_torch=True, score=score)
        accs["product_rf"]["time"] = t2 - t1

    # if "single_manifold_rf" in models:
    #     smrf = SingleManifoldEnsembleRF(pm=pm, task=task, n_estimators=n_estimators)
    #     t1 = time.time()
    #     smrf.fit(X_train, y_train)
    #     t2 = time.time()
    #     accs["single_manifold_rf"] = _score(X_test, y_test_np, smrf, torch=True, score=score)
    #     accs["single_manifold_rf"]["time"] = t2 - t1

    if "tangent_dt" in models:
        tdt = dt_class(**tree_kwargs)
        t1 = time.time()
        tdt.fit(X_train_tangent_np, y_train_np)
        t2 = time.time()
        accs["tangent_dt"] = _score(X_test_tangent_np, y_test_np, tdt, use_torch=False, score=score)
        accs["tangent_dt"]["time"] = t2 - t1

    if "tangent_rf" in models:
        trf = rf_class(**tree_kwargs, **rf_kwargs)
        t1 = time.time()
        trf.fit(X_train_tangent_np, y_train_np)
        t2 = time.time()
        accs["tangent_rf"] = _score(X_test_tangent_np, y_test_np, trf, use_torch=False, score=score)
        accs["tangent_rf"]["time"] = t2 - t1

    if "knn" in models:
        # Get dists - max imputation is a workaround for some nan values we occasionally get
        t1 = time.time()
        train_dists = pm.pdist(X_train)
        train_dists = torch.nan_to_num(train_dists, nan=train_dists[~train_dists.isnan()].max().item())
        train_test_dists = pm.dist(X_test, X_train)
        train_test_dists = torch.nan_to_num(
            train_test_dists,
            nan=train_test_dists[~train_test_dists.isnan()].max().item(),
        )

        # Convert to numpy
        train_dists = train_dists.detach().cpu().numpy()
        train_test_dists = train_test_dists.detach().cpu().numpy()

        # Train classifier on distances
        knn = knn_class(metric="precomputed")
        t2 = time.time()
        knn.fit(train_dists, y_train_np)
        t3 = time.time()
        accs["knn"] = _score(train_test_dists, y_test_np, knn, use_torch=False, score=score)
        accs["knn"]["time"] = t3 - t1

    # if "perceptron" in models:
    #     loss = "perceptron" if task == "classification" else "squared_error"
    #     ptron = perceptron_class(
    #         loss=loss,
    #         learning_rate="constant",
    #         fit_intercept=False,
    #         eta0=1.0,
    #         max_iter=10_000,
    #     )  # fit_intercept must be false for ambient coordinates
    #     t1 = time.time()
    #     ptron.fit(X_train_np, y_train_np)
    #     t2 = time.time()
    #     accs["perceptron"] = _score(X_test_np, y_test_np, ptron, torch=False, score=score)
    #     accs["perceptron"]["time"] = t2 - t1

    if "ps_perceptron" in models:
        if task == "classification":
            ps_per = ProductSpacePerceptron(pm=pm)
            t1 = time.time()
            ps_per.fit(X_train, y_train)
            t2 = time.time()
            accs["ps_perceptron"] = _score(X_test, y_test_np, ps_per, use_torch=True, score=score)
            accs["ps_perceptron"]["time"] = t2 - t1
        else:
            warnings.warn("Product Space Perceptron is only implemented for classification tasks.", stacklevel=2)

    if "svm" in models:
        # Get inner products for precomputed kernel matrix
        t1 = time.time()
        train_ips = pm.manifold.component_inner(X_train[:, None], X_train[None, :]).sum(dim=-1)
        train_test_ips = pm.manifold.component_inner(X_test[:, None], X_train[None, :]).sum(dim=-1)

        # Convert to numpy
        train_ips = train_ips.detach().cpu().numpy()
        train_test_ips = train_test_ips.detach().cpu().numpy()

        # Train SVM on precomputed inner products
        svm = svm_class(kernel="precomputed", max_iter=10_000)
        # Need max_iter because it can hang. It can be large, since this doesn't happen often.
        t2 = time.time()
        svm.fit(train_ips, y_train_np)
        t3 = time.time()
        accs["svm"] = _score(train_test_ips, y_test_np, svm, use_torch=False, score=score)
        accs["svm"]["time"] = t3 - t1

    if "ps_svm" in models:
        ps_svm = ProductSpaceSVM(pm=pm, task=task, h_constraints=False, e_constraints=False)  # type: ignore
        t1 = time.time()
        ps_svm.fit(X_train, y_train)
        t2 = time.time()
        accs["ps_svm"] = _score(X_test, y_test_np, ps_svm, use_torch=False, score=score)
        accs["ps_svm"]["time"] = t2 - t1

    if "kappa_mlp" in models:
        assert isinstance(X_test_stereo, torch.Tensor)
        kappa_mlp = KappaGCN(
            pm=pm_stereo,
            num_hidden=kappa_gcn_layers,
            task=task,
            output_dim=nn_outdim,  # type: ignore
        ).to(device)
        t1 = time.time()
        if task == "link_prediction":
            kappa_mlp.fit(X_train_stereo, y_train, A=A_train, tqdm_prefix="kappa_mlp", **nn_train_kwargs)
        else:
            kappa_mlp.fit(X_train_stereo, y_train, A=None, tqdm_prefix="kappa_mlp", **nn_train_kwargs)
        t2 = time.time()
        y_pred = kappa_mlp.predict(X_test_stereo, A=None)
        accs["kappa_mlp"] = _score(None, y_test_np, kappa_mlp, y_pred_override=y_pred, use_torch=True, score=score)
        accs["kappa_mlp"]["time"] = t2 - t1

    if "ambient_mlp" in models:
        ambient_mlp = KappaGCN(pm=pm_euc, num_hidden=kappa_gcn_layers, **nn_kwargs).to(device)  # type: ignore
        t1 = time.time()
        ambient_mlp.fit(X_train, y_train, A=None, tqdm_prefix="ambient_mlp", **nn_train_kwargs)
        t2 = time.time()
        y_pred = ambient_mlp.predict(X_test, A=None)
        accs["ambient_mlp"] = _score(None, y_test_np, ambient_mlp, y_pred_override=y_pred, use_torch=True, score=score)
        accs["ambient_mlp"]["time"] = t2 - t1

    if "tangent_mlp" in models:
        tangent_mlp = KappaGCN(pm=pm_euc, num_hidden=kappa_gcn_layers, **nn_kwargs).to(device)  # type: ignore
        t1 = time.time()
        tangent_mlp.fit(X_train_tangent, y_train, A=None, tqdm_prefix="tangent_mlp", **nn_train_kwargs)
        t2 = time.time()
        y_pred = tangent_mlp.predict(X_test_tangent, A=None)
        accs["tangent_mlp"] = _score(None, y_test_np, tangent_mlp, y_pred_override=y_pred, use_torch=True, score=score)
        accs["tangent_mlp"]["time"] = t2 - t1

    if "ambient_gcn" in models:
        ambient_gcn = KappaGCN(pm=pm_euc, num_hidden=kappa_gcn_layers, **nn_kwargs).to(device)  # type: ignore
        t1 = time.time()
        ambient_gcn.fit(X_train, y_train, A=A_train, **nn_train_kwargs)
        t2 = time.time()
        y_pred = ambient_gcn.predict(X_test, A=A_test)
        accs["ambient_gcn"] = _score(None, y_test_np, None, y_pred_override=y_pred, use_torch=True, score=score)
        accs["ambient_gcn"]["time"] = t2 - t1

    if "tangent_gcn" in models:
        tangent_gcn = KappaGCN(pm=pm_euc, num_hidden=kappa_gcn_layers, **nn_kwargs).to(device)  # type: ignore
        t1 = time.time()
        tangent_gcn.fit(X_train_tangent, y_train, A=A_train, tqdm_prefix="tangent_gcn", **nn_train_kwargs)
        t2 = time.time()
        y_pred = tangent_gcn.predict(X_test_tangent, A=A_test)
        accs["tangent_gcn"] = _score(None, y_test_np, None, y_pred_override=y_pred, use_torch=True, score=score)
        accs["tangent_gcn"]["time"] = t2 - t1

    if "kappa_gcn" in models:
        assert isinstance(X_test_stereo, torch.Tensor)
        kappa_gcn = KappaGCN(pm=pm_stereo, num_hidden=kappa_gcn_layers, task=task, output_dim=nn_outdim).to(device)  # type: ignore
        t1 = time.time()
        kappa_gcn.fit(X_train_stereo, y_train, A=A_train, tqdm_prefix="kappa_gcn", **nn_train_kwargs)
        t2 = time.time()
        y_pred = kappa_gcn.predict(X_test_stereo, A=A_test)
        accs["kappa_gcn"] = _score(None, y_test_np, None, y_pred_override=y_pred, use_torch=True, score=score)
        accs["kappa_gcn"]["time"] = t2 - t1

    if "kappa_mlr" in models:
        kappa_mlr = KappaGCN(pm=pm_stereo, num_hidden=0, task=task, output_dim=nn_outdim).to(device)  # type: ignore
        t1 = time.time()
        kappa_mlr.fit(X_train_stereo, y_train, A=None, tqdm_prefix="kappa_mlr", **nn_train_kwargs)
        t2 = time.time()
        y_pred = kappa_mlr.predict(X_test_stereo, A=None)
        accs["kappa_mlr"] = _score(None, y_test_np, None, y_pred_override=y_pred, use_torch=True, score=score)
        accs["kappa_mlr"]["time"] = t2 - t1

    if "tangent_mlr" in models:
        tangent_mlr = KappaGCN(pm=pm_euc, num_hidden=0, task=task, output_dim=nn_outdim).to(device)  # type: ignore
        t1 = time.time()
        tangent_mlr.fit(X_train_tangent, y_train, A=None, tqdm_prefix="tangent_mlr", **nn_train_kwargs)
        t2 = time.time()
        y_pred = tangent_mlr.predict(X_test_tangent, A=None)
        accs["tangent_mlr"] = _score(None, y_test_np, None, y_pred_override=y_pred, use_torch=True, score=score)
        accs["tangent_mlr"]["time"] = t2 - t1

    if "ambient_mlr" in models:
        ambient_mlr = KappaGCN(pm=pm_euc, num_hidden=0, task=task, output_dim=nn_outdim).to(device)  # type: ignore
        t1 = time.time()
        ambient_mlr.fit(X_train, y_train, A=None, tqdm_prefix="ambient_mlr", **nn_train_kwargs)
        t2 = time.time()
        y_pred = ambient_mlr.predict(X_test, A=None)
        accs["ambient_mlr"] = _score(None, y_test_np, None, y_pred_override=y_pred, use_torch=True, score=score)
        accs["ambient_mlr"]["time"] = t2 - t1

    # return accs
    return {
        **{
            f"{model}_{metric}": value
            for model, metrics in accs.items()
            if isinstance(metrics, dict)
            for metric, value in metrics.items()
        },
        **{k: v for k, v in accs.items() if not isinstance(v, dict)},  # type: ignore
    }

dataloaders

Dataloaders Submodule.

The dataloaders module allows users to load datasets from Manify's datasets repo on Hugging Face.

We provide a summary of the data types available, and their original sources, here.

Earlier versions of Manify included scripts to process raw data, which we have replaced with a single, centralized Hugging Face repo and the function load_hf. For transparency, we have preserved the data generation code in the Dataset-Generation branch of Manify.

Dataset Task Distance Matrix Features Labels Adjacency Matrix Source/Citation
cities none Network Repository: Cities
cs_phds regression Network Repository: CS PhDs
polblogs classification Network Repository: Polblogs
polbooks classification Network Repository: Polbooks
cora classification Network Repository: Cora
citeseer classification Network Repository: Citeseer
karate_club none Network Repository: Karate
lesmis none Network Repository: Lesmis
adjnoun none Network Repository: Adjnoun
football none Network Repository: Football
dolphins none Network Repository: Dolphins
blood_cells classification See datasets from Zheng et al (2017): Massively parallel digital transcriptional profiling of single cells.
- CD8+ Cytotoxic T-cells
- CD8+/CD45RA+ Naive Cytotoxic T Cells
- CD56+ Natural Killer Cells
- CD4+ Helper T Cells
- CD4+/CD45RO+ Memory T Cells
- CD4+/CD45RA+/CD25- Naive T Cells
- CD4+/CD25+ Regulatory T Cells
- CD34+ Cells
- CD19+ B Cells
- CD14+ Monocytes
lymphoma classification See datasets from 10x Genomics:
- Hodgkin's Lymphoma
- Healthy Donor PBMCs
cifar_100 classification Hugging Face Datasets: CIFAR-100
mnist classification Hugging Face Datasets: MNIST
temperature regression [Citation]
landmasses classification Generated using basemap.is_land
neuron_33 classification Allen Brain Atlas
neuron_46 classification Allen Brain Atlas
traffic regression Kaggle: Traffic Prediction Dataset
qiita none NeuroSEED Git Repo

load_hf(name, namespace='manify')

Load a dataset from HuggingFace Hub at {namespace}/{name}.

Returns:
  • features( Float[Tensor, 'n_points ...'] | None ) –

    The features for each node, if any

  • dists( Float[Tensor, 'n_points n_points'] | None ) –

    The pairwise distance matrix over all nodes, if any

  • adj( Float[Tensor, 'n_points n_points'] | None ) –

    The adjacency matrix over all nodes, if any

  • labels( Real[Tensor, 'n_points'] | None ) –

    The (classification or regression) labels for each node, if any

Source code in manify/utils/dataloaders.py
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
def load_hf(
    name: str, namespace: str = "manify"
) -> tuple[
    Float[torch.Tensor, "n_points ..."] | None,  # features
    Float[torch.Tensor, "n_points n_points"] | None,  # pairwise dists
    Float[torch.Tensor, "n_points n_points"] | None,  # adjacency labels
    Real[torch.Tensor, "n_points"] | None,  # labels
]:
    """Load a dataset from HuggingFace Hub at {namespace}/{name}.

    Returns:
        features: The features for each node, if any
        dists: The pairwise distance matrix over all nodes, if any
        adj: The adjacency matrix over all nodes, if any
        labels: The (classification or regression) labels for each node, if any
    """
    # 1) fetch the single‑row dataset
    ds = load_dataset(f"{namespace}/{name}")
    data = ds.get("train", ds)  # use "train" split if available, else the only split
    row = data[0]

    # 2) helper to turn lists → torch (or None)
    def to_tensor(key: str, dtype: torch.dtype) -> torch.Tensor | None:
        vals = row.get(key, [])
        if not vals:
            return None
        return torch.tensor(vals, dtype=dtype)

    # 3) reconstruct everything
    dists = to_tensor("distances", torch.float32)
    feats = to_tensor("features", torch.float32)
    adj = to_tensor("adjacency", torch.float32)

    cls_ls = row.get("classification_labels", [])
    reg_ls = row.get("regression_labels", [])
    if cls_ls:
        labels = torch.tensor(cls_ls, dtype=torch.int64)
    elif reg_ls:
        labels = torch.tensor(reg_ls, dtype=torch.float32)
    else:
        labels = None

    return feats, dists, adj, labels

Preprocessing datasets for link prediction.

Preprocess a graph link prediction task into a binary classification problem on a new product manifold.

This function constructs a dataset for link prediction by creating pairwise embeddings from the input node embeddings, optionally appending pairwise distances, and returning labels from an adjacency matrix. It also updates the manifold signature correspondingly.

Parameters:
  • X_embed (Float[Tensor, 'batch n_dim']) –

    Node embeddings.

  • pm

    The manifold on which the embeddings lie.

  • adj (Float[Tensor, 'batch batch']) –

    A binary adjacency matrix indicating edges between nodes.

  • add_dists (bool, default: True ) –

    If True, appends pairwise distances to the feature vectors. Default is True.

Returns:
  • X( Float[Tensor, 'batch**2 n_dim*2'] ) –

    Node-pair embeddings in \(\mathcal{M} \times \mathcal{M}\)

  • y( Float[Tensor, 'batch**2'] ) –

    Edge labels derived from the adjacency matrix.

  • new_pm( ProductManifold ) –

    A new instance of ProductManifold with an updated signature reflecting the feature space \(\mathcal{M} \times \mathcal{M}\).

Source code in manify/utils/link_prediction.py
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
def make_link_prediction_dataset(
    X_embed: Float[torch.Tensor, "batch n_dim"],
    pm: ProductManifold,
    adj: Float[torch.Tensor, "batch batch"],
    add_dists: bool = True,
) -> tuple[Float[torch.Tensor, "batch**2 n_dim*2"], Float[torch.Tensor, "batch**2"], ProductManifold]:
    r"""Preprocess a graph link prediction task into a binary classification problem on a new product manifold.

    This function constructs a dataset for link prediction by creating pairwise embeddings from the input node
    embeddings, optionally appending pairwise distances, and returning labels from an adjacency matrix. It also updates
    the manifold signature correspondingly.

    Args:
        X_embed: Node embeddings.
        pm : The manifold on which the embeddings lie.
        adj: A binary adjacency matrix indicating edges between nodes.
        add_dists: If True, appends pairwise distances to the feature vectors. Default is True.

    Returns:
        X: Node-pair embeddings in $\mathcal{M} \times \mathcal{M}$
        y: Edge labels derived from the adjacency matrix.
        new_pm: A new instance of `ProductManifold` with an updated signature reflecting the feature space
            $\mathcal{M} \times \mathcal{M}$.

    """
    # Stack embeddings
    X = torch.stack([torch.cat([X_i, X_j]) for X_i in X_embed for X_j in X_embed])

    # Add distances
    if add_dists:
        dists = pm.pdist(X_embed)
        X = torch.cat([X, dists.flatten().unsqueeze(1)], dim=1)

    y = adj.flatten()

    # Binarize y
    y = (y > 0).long()

    # Make a new signature
    new_sig = pm.signature + pm.signature
    if add_dists:
        new_sig.append((0.0, 1))
    new_pm = ProductManifold(signature=new_sig)

    return X, y, new_pm

Split a link prediction dataset into train and test sets.

Parameters:
  • X (Float[Tensor, 'n_pairs n_dims']) –

    Node-pair embeddings of shape (n_nodes^2, n_dims).

  • y (Int[Tensor, 'n_pairs']) –

    Edge labels of shape (n_nodes^2,).

  • test_size (float, default: 0.2 ) –

    Proportion of nodes to include in test set.

  • downsample (int | None, default: None ) –

    If provided, downsample to this many pos/neg pairs each.

  • random_state (int | None, default: None ) –

    Random seed for reproducibility.

  • **kwargs (Any, default: {} ) –

    Additional arguments for train_test_split.

Returns:
  • tuple[Float[Tensor, '... n_dims'], Float[Tensor, '... n_dims'], Int[Tensor, '...'], Int[Tensor, '...'], Int[Tensor, '...'], Int[Tensor, '...']]

    Tuple of (X_train, X_test, y_train, y_test, idx_train, idx_test).

Source code in manify/utils/link_prediction.py
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
def split_link_prediction_dataset(
    X: Float[torch.Tensor, "n_pairs n_dims"],
    y: Int[torch.Tensor, "n_pairs"],
    test_size: float = 0.2,
    downsample: int | None = None,
    random_state: int | None = None,
    **kwargs: Any,
) -> tuple[
    Float[torch.Tensor, "... n_dims"],
    Float[torch.Tensor, "... n_dims"],
    Int[torch.Tensor, "..."],
    Int[torch.Tensor, "..."],
    Int[torch.Tensor, "..."],
    Int[torch.Tensor, "..."],
]:
    """Split a link prediction dataset into train and test sets.

    Args:
        X: Node-pair embeddings of shape (n_nodes^2, n_dims).
        y: Edge labels of shape (n_nodes^2,).
        test_size: Proportion of nodes to include in test set.
        downsample: If provided, downsample to this many pos/neg pairs each.
        random_state: Random seed for reproducibility.
        **kwargs: Additional arguments for train_test_split.

    Returns:
        Tuple of (X_train, X_test, y_train, y_test, idx_train, idx_test).
    """
    if random_state is not None:
        torch.manual_seed(random_state)

    n_pairs, n_dims = X.shape
    n_nodes = int(n_pairs**0.5)
    assert n_nodes**2 == n_pairs, f"Expected {n_nodes}^2 = {n_nodes**2} pairs, got {n_pairs}"

    # Downsample if requested (before split to maintain structure)
    if downsample is not None:
        pos_mask = y == 1
        neg_mask = y == 0

        pos_indices = torch.where(pos_mask)[0]
        neg_indices = torch.where(neg_mask)[0]

        # Sample up to 'downsample' examples from each class
        n_pos = min(len(pos_indices), downsample)
        n_neg = min(len(neg_indices), downsample)

        sampled_pos = pos_indices[torch.randperm(len(pos_indices))[:n_pos]]
        sampled_neg = neg_indices[torch.randperm(len(neg_indices))[:n_neg]]

        # Create a mask for selected pairs
        mask = torch.zeros(n_pairs, dtype=torch.bool)
        mask[sampled_pos] = True
        mask[sampled_neg] = True

        # Zero out unselected pairs
        X_filtered = X.clone()
        y_filtered = y.clone()
        X_filtered[~mask] = 0
        y_filtered[~mask] = 0
    else:
        X_filtered = X
        y_filtered = y

    # Reshape to adjacency format
    X_adj = X_filtered.view(n_nodes, n_nodes, n_dims)
    y_adj = y_filtered.view(n_nodes, n_nodes)

    # Split nodes into train/test
    node_indices = torch.arange(n_nodes)
    idx_train, idx_test = train_test_split(node_indices, test_size=test_size, random_state=random_state, **kwargs)

    # Extract train and test subgraphs and flatten
    X_train = X_adj[idx_train][:, idx_train].reshape(-1, n_dims)
    y_train = y_adj[idx_train][:, idx_train].reshape(-1)

    X_test = X_adj[idx_test][:, idx_test].reshape(-1, n_dims)
    y_test = y_adj[idx_test][:, idx_test].reshape(-1)

    return X_train, X_test, y_train, y_test, idx_train, idx_test

visualization

Manify visualization utilities.

hyperboloid_to_poincare(X)

Convert hyperboloid coordinates to Poincaré ball coordinates.

Parameters:
  • X (Float[Tensor, 'n_points n_dim']) –

    Input coordinates in the hyperboloid model.

Returns:
  • poincare_coords( Float[Tensor, 'n_points n_dim-1'] ) –

    Coordinates in the Poincaré ball model.

Source code in manify/utils/visualization.py
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def hyperboloid_to_poincare(X: Float[torch.Tensor, "n_points n_dim"]) -> Float[torch.Tensor, "n_points n_dim-1"]:
    """Convert hyperboloid coordinates to Poincaré ball coordinates.

    Args:
        X: Input coordinates in the hyperboloid model.

    Returns:
        poincare_coords: Coordinates in the Poincaré ball model.
    """
    # Spatial components: all columns except the first
    x_space = X[:, 1:]

    # Time-like component: first column, reshaped for broadcasting
    x_time = X[:, 0:1]

    # Convert to Poincaré ball coordinates
    poincare_coords = x_space / (1 + x_time)

    return poincare_coords

spherical_to_polar(X)

Convert spherical coordinates to polar coordinates.

Parameters:
  • X (Float[Tensor, 'n_points n_dim']) –

    Input coordinates in spherical form.

Returns:
  • polar_coords( Float[Tensor, 'n_points n_dim-1'] ) –

    Coordinates in polar form.

Source code in manify/utils/visualization.py
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
def spherical_to_polar(X: Float[torch.Tensor, "n_points n_dim"]) -> Float[torch.Tensor, "n_points n_dim-1"]:
    """Convert spherical coordinates to polar coordinates.

    Args:
        X: Input coordinates in spherical form.

    Returns:
        polar_coords: Coordinates in polar form.
    """
    # Radius computation
    r = torch.norm(X, dim=1, keepdim=True)

    # Prepare output tensor
    out = torch.zeros_like(X)
    out[:, 0] = r.squeeze()  # Set the radius

    # Compute angles
    for i in range(1, X.size(1)):
        if i == X.size(1) - 1:
            # Last angle, use atan2 for full 360 degree
            out[:, i] = torch.atan2(X[:, i - 1], X[:, i - 2])
        else:
            # Compute angle from the higher dimension 'hypotenuse'
            hypotenuse = torch.norm(X[:, i:], dim=1, keepdim=True)
            # Prevent division by zero
            safe_hypotenuse = torch.where(hypotenuse > 0, hypotenuse, torch.tensor(1.0).to(X.device))
            # Ensure acos receives values within [-1, 1] and preserve dimensions
            angle = torch.acos(torch.clamp(X[:, i : i + 1] / safe_hypotenuse, -1, 1))
            out[:, i] = angle.squeeze()

    return out[:, 1:]

S2_to_polar(X)

Convert S^2 (2-sphere) coordinates to polar coordinates.

Parameters:
  • X (Float[Tensor, 'n_points 3']) –

    Input coordinates on the 2-sphere.

Returns:
  • polar_coords( Float[Tensor, 'n_points 2'] ) –

    Coordinates in polar form (elevation, azimuth).

Source code in manify/utils/visualization.py
67
68
69
70
71
72
73
74
75
76
def S2_to_polar(X: Float[torch.Tensor, "n_points 3"]) -> Float[torch.Tensor, "n_points 2"]:
    """Convert S^2 (2-sphere) coordinates to polar coordinates.

    Args:
        X: Input coordinates on the 2-sphere.

    Returns:
        polar_coords: Coordinates in polar form (elevation, azimuth).
    """
    return torch.stack([torch.acos(X[:, 2]), torch.atan2(X[:, 1], X[:, 0])], dim=1)