# Generate NKI Kernels from NKIPy Kernels

NKIPy is a project under active development to help Neuron users write and execute kernels on Trainium with ease.

This demo focuses on the ability of NKIPy taking in NumPy kernels and generating NKI code.

Let's first look at a few examples!

## Softmax Examples

In [1]:
import numpy as np

# Here we have a softmax kernel implemented in NumPy
def softmax_kernel(x):
    exp_x = np.exp(np.subtract(x, np.max(x, axis=-1, keepdims=True)))
    sum_x = np.sum(exp_x, axis=-1, keepdims=True)

    return np.divide(exp_x, sum_x)

Since NKIPy kernels can be just NumPy kernels, they can run as such.

In [2]:
x = np.random.rand(2, 2).astype(np.float32)
print("Input:", x)
out = softmax_kernel(x)
print("Output:", out)

Input: [[0.9579238  0.1510723 ]
 [0.64579445 0.8481816 ]]
Output: [[0.69143814 0.3085618 ]
 [0.44957522 0.55042475]]


To generate NKI code from the NumPy function above, we need to first trace it --
in this step, NKIPy will go through the NumPy kernel and convert it to a NKIPy kernel

To trace it, we need to wrap it with trace, then specialize it with concrete shape.

In [3]:
from nkipy.core.trace import NKIPyKernel

softmax_nkipy_kernel = NKIPyKernel.trace(softmax_kernel)

Now the function is traced, it becomes a NKIPy kernel, and we are ready to convert it to NKI

In [4]:
from nkipy.core.compile import lower_to_nki

softmax_nkipy_kernel.specialize(x)
nki_code = lower_to_nki(softmax_nkipy_kernel)

In [5]:
# Add some helper function to display the generated code
from pygments import highlight
from pygments.lexers import PythonLexer
from pygments.formatters import HtmlFormatter
from IPython.display import HTML, display

def display_code(code):
    formatter = HtmlFormatter(style='friendly', full=True)
    highlighted_code = highlight(code, PythonLexer(), formatter)
    
    custom_html: str = f"""
    <div>
        {highlighted_code}
    </div>
    """
    display(HTML(custom_html))


display_code(nki_code)

The above NKI code is translated from the optimized intermediate representation of the Neuron Compiler.

There are still some gaps here from a proper hand-written NKI kernel:

- We recently changed how NKI code returns output tensors. This generated NKI uses the old syntax, which takes the output tensor as an input argument.
- Some additional structures are created, such as the additional `BB_entry_1`
- All shapes are concrete values rather than variables

Let's try to change the input tensor shape, and do the process again.

In [6]:
x = np.random.rand(256, 256).astype(np.float32)
softmax_nkipy_kernel = NKIPyKernel.trace(softmax_kernel)
_ = softmax_nkipy_kernel.specialize(x)
nki_code = lower_to_nki(softmax_nkipy_kernel)

display_code(nki_code)

The generated NKI code is certainly different!
We see a new loop being introduced `for i0 in nl.affine_range(2):` -- this is because the compiler is doing tiling so we can meet the 128 partition dimension size of the Trainium hardware.

## Matrix Multiplication Examples

Now let's move on to something more interesting -- matrix multiplication. Trainium hardware is really powerful with it. Let's see how Neuron Compiler fully utilizes the hardware!

This goes well with the [NKI Matrix Multiplication Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/tutorials/matrix_multiplication.html).

In [7]:
def matmul_kernel(x, y):
    return np.matmul(x, y)

def gen_matmul_nki(M, N, K):
    x = np.random.rand(M, K).astype(np.float32)
    y = np.random.rand(K, N).astype(np.float32)

    matmul_nkipy_kernel = NKIPyKernel.trace(matmul_kernel)
    _ = matmul_nkipy_kernel.specialize(x, y)
    nki_code = lower_to_nki(matmul_nkipy_kernel)
    return nki_code

display_code(gen_matmul_nki(M=64, N=512, K=128)) # [64, 128] @ [128, 512]

Compared to the [Basic Compute Kernel](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/tutorials/matrix_multiplication.html#basic-compute-kernel) in the tutorial, this generated kernel does an additional `nc_matmul`, which is doing the transpose of `x` because unlike the tutorial, the lhs is not pre-transposed.

Let's try with some larger sizes.

In [8]:
display_code(gen_matmul_nki(M=1024, N=1024, K=512)) # [1024, 512] @ [512, 1024]

Now we are seeing more tiling structure.

Let's try something even larger to see how it gets handled!

In [9]:
display_code(gen_matmul_nki(M=4096, N=4096, K=2048)) # [4096, 2048] @ [2048, 4096]

The generated NKI code for matrix multiplication with the large size is very similar to the [Optimized Matmul Code](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/tutorials/matrix_multiplication.html#optimization-3-further-blocking-and-dma-efficiency-optimization) in the tutorial.

Have fun experimenting with NKIPy now!