tensara

Problems

Sandbox

Leaderboards

Blog

Contests

Back

Patterns for Mojo solutions on Tensara (Mojo 26.1)

Soham Jog

Feb 22, 2026

Mojo GPU on Tensara (Mojo 26.1): Practical Patterns That Don’t Break

We recently upgraded to Mojo 26.1 on Tensara, which introduced a new GPU programming model. This post shares practical patterns and gotchas we’ve learned while adapting our kernels to the new system. If you're using LLMs to help write Mojo GPU code, this context should help them generate more accurate and efficient code as well.

We cover:

how pointer arguments work in the current harness
the infamous 0.0 / Float64 type pitfall
when to stick with UnsafePointer vs when to graduate to LayoutTensor
how LayoutTensor unlocks shared memory tiling (the big performance lever)

Device pointers arrive as `Int`

In Mojo 26.1 on Tensara, GPU buffers are provided to solution(...) as integer addresses. You turn those raw addresses into something indexable by casting them to UnsafePointer:

from memory import UnsafePointer

comptime dtype = DType.float32

@export
fn solution(input_addr: Int, output_addr: Int, n: Int32) raises:
    input = UnsafePointer[Scalar[dtype], MutExternalOrigin](unsafe_from_address=input_addr)
    output = UnsafePointer[Scalar[dtype], MutExternalOrigin](unsafe_from_address=output_addr)

After that, treat your data as a flat 1D array and apply row-major indexing math yourself (for matrices, that usually means row * stride + col).

The `0.0` trap

Mojo is strict about numeric types. One of the most common “why won’t this compile?” moments is accidentally introducing Float64 into code that should be Float32.

A classic example is starting an accumulator with 0.0 and then doing += with Float32 values. Depending on context, 0.0 can be treated as Float64, and suddenly your arithmetic stops type-checking.

The easiest workaround is: avoid writing 0.0 and instead derive a zero value from a Float32 you already have:

x = input[idx]     # Float32
zero = x - x       # Float32 zero, no literals involved

With that, ReLU-style code stays clean and type-stable:

if x > zero:
    output[idx] = x
else:
    output[idx] = zero

`UnsafePointer` vs `LayoutTensor`: when to use which

For many EASY problems (elementwise ops like activations), UnsafePointer is enough:

one thread = one output element
flat indexing
bounds-check
store result

But once performance matters—especially for matrix multiplication—your bottleneck becomes memory traffic, not arithmetic. That’s where LayoutTensor starts paying for itself.

You can think of LayoutTensor as doing two jobs:

1) Give structure to memory (optional)

Sometimes you use LayoutTensor to make indexing intent clearer (row-major layouts, shapes). This can reduce bugs, but it’s not automatically faster.

2) Allocate shared memory tiles (the important one)

The biggest win is using LayoutTensor to allocate and manage shared memory:

from gpu.memory import AddressSpace
from layout import Layout, LayoutTensor

smem = LayoutTensor[
    dtype,
    Layout.row_major(256),
    MutAnyOrigin,
    address_space = AddressSpace.SHARED,
].stack_allocation()

Once you have shared memory, you can use the standard GPU tiling strategy:

threads cooperatively load a tile into shared memory
synchronize the block (barrier())
compute using the shared tile (fast reuse)
repeat for the next tile

That pattern is the difference between “works on small matrices” and “doesn’t TLE on large ones”.

A small `LayoutTensor` gotcha: scalar vs SIMD element types

Depending on the layout/type, indexing a LayoutTensor can produce a SIMD-flavored element type. If you need an actual scalar Float32 value, extracting lane 0 is a simple fix:

val = smem[index][0]

If you ever see an error along the lines of “cannot convert LayoutTensor.element_type to Float32”, this is often the reason.

Specializing for known testcase sizes (when the judge repeats shapes)

If you’ve already seen the exact test shapes from earlier submissions, you can often squeeze out more performance by specializing for those sizes:

choose tile sizes that fit the shape perfectly
unroll fixed loop bounds
pick vector widths that stay aligned

The key is to keep a general path too: specialize the frequent cases, but don’t paint yourself into a corner if a hidden test uses a different size.

Concrete implementation examples

Two complete submissions you can use as references:

Patterns for Mojo solutions on Tensara (Mojo 26.1)

Mojo GPU on Tensara (Mojo 26.1): Practical Patterns That Don’t Break

Device pointers arrive as Int

The 0.0 trap

UnsafePointer vs LayoutTensor: when to use which