Distributed Execution#

NKIPy supports multi-device execution with collective communication (CC) through DeviceKernel.compile_and_load. This guide covers the three execution patterns and when to use each.

Execution Patterns#

1. SPMD (default)#

When torch.distributed is initialized and is_spmd=True (the default), rank 0 traces and compiles the kernel, then broadcasts the NEFF path to all workers. All ranks load the same NEFF with CC enabled.

import torch.distributed as dist

dist.init_process_group(...)

kernel = DeviceKernel.compile_and_load(my_kernel, input_a, input_b)

Use this when every rank runs the same kernel with the same input shapes.

2. MPMD (is_spmd=False)#

Set is_spmd=False so every rank traces and compiles independently. This is required when different ranks run different kernels or different input shapes.

# With torch.distributed (CC auto-detected)
kernel = DeviceKernel.compile_and_load(
    my_kernel, input_a, input_b,
    is_spmd=False,
)

# Without torch.distributed (explicit CC)
kernel = DeviceKernel.compile_and_load(
    my_kernel, input_a, input_b,
    is_spmd=False,
    cc_enabled=True,
    rank_id=my_rank,
    world_size=total_workers,
)

3. No CC (single device or explicit opt-out)#

Without torch.distributed and without explicit CC parameters, the kernel loads for single-device execution. You can also pass cc_enabled=False to explicitly disable CC even when torch.distributed is active.

# Single device (no torch.distributed)
kernel = DeviceKernel.compile_and_load(my_kernel, input_a)

# Opt out of CC in a distributed setting
kernel = DeviceKernel.compile_and_load(my_kernel, input_a, cc_enabled=False)

Parameter Reference#

Parameter

Controls

Values

is_spmd

Compilation

True = rank-0 broadcast, False = all rank

cc_enabled

CC at load time

None = auto, True = on, False = off

rank_id

Rank for CC load

None = auto from dist, or explicit int

world_size

World size for CC

None = auto from dist, or explicit int

Comparison#

Setting

SPMD (default)

MPMD

No CC

is_spmd

True

False

Either

cc_enabled

None (auto)

None/True

False/None

torch.distributed

Required

Optional

N/A

Compilation

Rank 0 only + broadcast

Every rank

Every rank

Barrier

Yes

No

No

Use case

Same kernel, all ranks

Per-rank kernels

Single device

Build Directory Isolation#

In MPMD mode (is_spmd=False), the build directory is automatically namespaced by rank (e.g. build_dir/rank_0/, build_dir/rank_1/) to prevent concurrent writes when different ranks produce the same content hash. The rank is taken from the explicit rank_id parameter, or auto-detected from torch.distributed when available.

Caching#

Compiled NEFFs are cached in memory by a content hash of the HLO and compiler arguments. The cache key is the same regardless of CC mode, so a kernel compiled once can be reused across calls. Pass use_cached_if_exists=False to force recompilation.