Tensara Logo

tensara

Back

Patterns for Mojo solutions on Tensara (Mojo 26.1)

Soham Jog

·

Feb 22, 2026


Mojo GPU on Tensara (Mojo 26.1): Practical Patterns That Don’t Break

We recently upgraded to Mojo 26.1 on Tensara, which introduced a new GPU programming model. This post shares practical patterns and gotchas we’ve learned while adapting our kernels to the new system. If you're using LLMs to help write Mojo GPU code, this context should help them generate more accurate and efficient code as well.

We cover:

  • how pointer arguments work in the current harness
  • the infamous 0.0 / Float64 type pitfall
  • when to stick with UnsafePointer vs when to graduate to LayoutTensor
  • how LayoutTensor unlocks shared memory tiling (the big performance lever)

Device pointers arrive as Int

In Mojo 26.1 on Tensara, GPU buffers are provided to solution(...) as integer addresses. You turn those raw addresses into something indexable by casting them to UnsafePointer:

from memory import UnsafePointer

comptime dtype = DType.float32

@export
fn solution(input_addr: Int, output_addr: Int, n: Int32) raises:
    input = UnsafePointer[Scalar[dtype], MutExternalOrigin](unsafe_from_address=input_addr)
    output = UnsafePointer[Scalar[dtype], MutExternalOrigin](unsafe_from_address=output_addr)

After that, treat your data as a flat 1D array and apply row-major indexing math yourself (for matrices, that usually means row * stride + col).

The 0.0 trap

Mojo is strict about numeric types. One of the most common “why won’t this compile?” moments is accidentally introducing Float64 into code that should be Float32.

A classic example is starting an accumulator with 0.0 and then doing += with Float32 values. Depending on context, 0.0 can be treated as Float64, and suddenly your arithmetic stops type-checking.

The easiest workaround is: avoid writing 0.0 and instead derive a zero value from a Float32 you already have:

x = input[idx]     # Float32
zero = x - x       # Float32 zero, no literals involved

With that, ReLU-style code stays clean and type-stable:

if x > zero:
    output[idx] = x
else:
    output[idx] = zero

UnsafePointer vs LayoutTensor: when to use which

For many EASY problems (elementwise ops like activations), UnsafePointer is enough:

  • one thread = one output element
  • flat indexing
  • bounds-check
  • store result

But once performance matters—especially for matrix multiplication—your bottleneck becomes memory traffic, not arithmetic. That’s where LayoutTensor starts paying for itself.

You can think of LayoutTensor as doing two jobs:

1) Give structure to memory (optional)

Sometimes you use LayoutTensor to make indexing intent clearer (row-major layouts, shapes). This can reduce bugs, but it’s not automatically faster.

2) Allocate shared memory tiles (the important one)

The biggest win is using LayoutTensor to allocate and manage shared memory:

from gpu.memory import AddressSpace
from layout import Layout, LayoutTensor

smem = LayoutTensor[
    dtype,
    Layout.row_major(256),
    MutAnyOrigin,
    address_space = AddressSpace.SHARED,
].stack_allocation()

Once you have shared memory, you can use the standard GPU tiling strategy:

  1. threads cooperatively load a tile into shared memory
  2. synchronize the block (barrier())
  3. compute using the shared tile (fast reuse)
  4. repeat for the next tile

That pattern is the difference between “works on small matrices” and “doesn’t TLE on large ones”.

A small LayoutTensor gotcha: scalar vs SIMD element types

Depending on the layout/type, indexing a LayoutTensor can produce a SIMD-flavored element type. If you need an actual scalar Float32 value, extracting lane 0 is a simple fix:

val = smem[index][0]

If you ever see an error along the lines of “cannot convert LayoutTensor.element_type to Float32”, this is often the reason.

Specializing for known testcase sizes (when the judge repeats shapes)

If you’ve already seen the exact test shapes from earlier submissions, you can often squeeze out more performance by specializing for those sizes:

  • choose tile sizes that fit the shape perfectly
  • unroll fixed loop bounds
  • pick vector widths that stay aligned

The key is to keep a general path too: specialize the frequent cases, but don’t paint yourself into a corner if a hidden test uses a different size.

Concrete implementation examples

Two complete submissions you can use as references:


Comments