We recently upgraded to Mojo 26.1 on Tensara, which introduced a new GPU programming model. This post shares practical patterns and gotchas we’ve learned while adapting our kernels to the new system. If you're using LLMs to help write Mojo GPU code, this context should help them generate more accurate and efficient code as well.
We cover:
0.0 / Float64 type pitfallUnsafePointer vs when to graduate to LayoutTensorLayoutTensor unlocks shared memory tiling (the big performance lever)IntIn Mojo 26.1 on Tensara, GPU buffers are provided to solution(...) as integer addresses. You turn those raw addresses into something indexable by casting them to UnsafePointer:
from memory import UnsafePointer
comptime dtype = DType.float32
@export
fn solution(input_addr: Int, output_addr: Int, n: Int32) raises:
input = UnsafePointer[Scalar[dtype], MutExternalOrigin](unsafe_from_address=input_addr)
output = UnsafePointer[Scalar[dtype], MutExternalOrigin](unsafe_from_address=output_addr)
After that, treat your data as a flat 1D array and apply row-major indexing math yourself (for matrices, that usually means row * stride + col).
0.0 trapMojo is strict about numeric types. One of the most common “why won’t this compile?” moments is accidentally introducing Float64 into code that should be Float32.
A classic example is starting an accumulator with 0.0 and then doing += with Float32 values. Depending on context, 0.0 can be treated as Float64, and suddenly your arithmetic stops type-checking.
The easiest workaround is: avoid writing 0.0 and instead derive a zero value from a Float32 you already have:
x = input[idx] # Float32
zero = x - x # Float32 zero, no literals involved
With that, ReLU-style code stays clean and type-stable:
if x > zero:
output[idx] = x
else:
output[idx] = zero
UnsafePointer vs LayoutTensor: when to use whichFor many EASY problems (elementwise ops like activations), UnsafePointer is enough:
But once performance matters—especially for matrix multiplication—your bottleneck becomes memory traffic, not arithmetic. That’s where LayoutTensor starts paying for itself.
You can think of LayoutTensor as doing two jobs:
Sometimes you use LayoutTensor to make indexing intent clearer (row-major layouts, shapes). This can reduce bugs, but it’s not automatically faster.
The biggest win is using LayoutTensor to allocate and manage shared memory:
from gpu.memory import AddressSpace
from layout import Layout, LayoutTensor
smem = LayoutTensor[
dtype,
Layout.row_major(256),
MutAnyOrigin,
address_space = AddressSpace.SHARED,
].stack_allocation()
Once you have shared memory, you can use the standard GPU tiling strategy:
barrier())That pattern is the difference between “works on small matrices” and “doesn’t TLE on large ones”.
LayoutTensor gotcha: scalar vs SIMD element typesDepending on the layout/type, indexing a LayoutTensor can produce a SIMD-flavored element type. If you need an actual scalar Float32 value, extracting lane 0 is a simple fix:
val = smem[index][0]
If you ever see an error along the lines of “cannot convert LayoutTensor.element_type to Float32”, this is often the reason.
If you’ve already seen the exact test shapes from earlier submissions, you can often squeeze out more performance by specializing for those sizes:
The key is to keep a general path too: specialize the frequent cases, but don’t paint yourself into a corner if a hidden test uses a different size.
Two complete submissions you can use as references: