Mojo:AIKernel
Write Python's syntax, run at C's speed, with MLIR under the hood — the goal Chris Lattner set in 2023. Three years on, Mojo runs production inference at frontier AI labs, codegens one kernel to NVIDIA / AMD / Apple Silicon, and is the first credible CUDA challenger since OpenCL.
Modular Keynote
launch demo · same hw
one kernel · MLIR codegen
magic · stable stdlib
What is Mojo
Mojo is an AI systems language released in 2023 by Modular (Chris Lattner). The design goal is blunt: weld Python-grade syntax onto C/Rust-grade performance, with MLIR under the hood — the same AI compiler IR Lattner led at Google in 2017.
Syntactically Mojo aims to be a Python superset: def, indentation, import are nearly identical. "Superset" is the roadmap, not today's exact state — a few Python corner cases still don't run.
The compiler's IR is MLIR directly — not LLVM IR. Multi-level dialects, extensible, the same program lowers to CPU / GPU / TPU. This is the deepest structural gap between Mojo and every other "speed up Python" project.
Three parameter conventions — borrow / inout / owned — no GC, structs default to value types. Rust's memory model without writing lifetime annotations — annotations plus inference do the work.
SIMD is a type, not an intrinsic; GPU codegen flows through MLIR → PTX (NVIDIA) / ROCm (AMD). You don't "call CUDA from Mojo" — what you write IS the kernel.
def matmul(a, b, c, n):
for i in range(n):
for j in range(n):
for k in range(n):
c[i][j] += a[i][k] * b[k][j]
# 1024×1024 matrix · minutes-class runtime
# interpreted + refcounted + everything boxedfn matmul(inout c: Matrix,
a: Matrix, b: Matrix):
alias nelts = simdwidthof[DType.float32]()
for i in range(c.rows):
for k in range(a.cols):
@parameter
fn v[w: Int](j: Int):
c.store[w](i, j,
c.load[w](i, j)
+ a[i,k] * b.load[w](k, j))
vectorize[v, nelts](c.cols)
# Modular's number: ~35,000× over interpreted PythonHistory : Timeline
Mojo didn't appear from nowhere in 2023 — it is Lattner's third language, on a line that runs from his 2003 LLVM thesis through LLVM, Clang, Swift (/code/swift), Tesla, Google Brain's MLIR, and finally lands at Modular.
- 2003
Chris Lattner submits the LLVM thesis
His UIUC master's thesis: "LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation". The infra it laid down became the substrate for the next two decades of Lattner's work — Clang, Swift, MLIR, and finally Mojo. The origin of the whole line.
- 2014·06
WWDC: Swift ships
The "modern Objective-C replacement" Lattner led inside Apple goes public (/code/swift). Swift becomes Apple's full-stack language. Lattner stays through 2017. It is his second language to reshape an industry; Mojo is the third.
- 2017·01
Lattner → Tesla → Google Brain
January: he joins Tesla as VP of Autopilot, then resigns six months later. Autumn: he lands at Google Brain on TPU compiler infrastructure — where MLIR is born. MLIR is the direct technical seed of Mojo: multi-level IR, extensible dialects, designed for heterogeneous hardware.
- 2022·01
Modular is founded
Lattner and ex-Google ML-compiler lead Tim Davis co-found Modular. The mission is bluntly simple: "AI shouldn't be locked into CUDA + Python." Early engineers come from Google, Apple and SiFive's compiler crowd.
- 2023·05·02
Mojo unveiled at Modular's keynote
Three-line pitch: Python syntax, C-class speed, MLIR for the IR. The first demo runs a naive Python matmul 35,000× faster on the same hardware — the number stuns the room. The AI world learns the name overnight.
- 2023·09
Mojo SDK 0.1 — first downloadable build
The 0.1 SDK ships in September, Linux only, behind a license waitlist. The language surface is still volatile, but you can finally run it locally. The early crowd is ML engineers plus compiler hobbyists.
- 2024·01
macOS support
Mojo lands on Apple Silicon. M-series chips are sweet-spot targets for LLVM/MLIR codegen; "runs on a Mac" instantly doubles the developer hardware base. The first wave of non-Linux issues hits the tracker.
- 2024·03·29
Standard library open-sourced (Apache 2.0)
March 29: Modular open-sources the stdlib on
modularml/mojo. The compiler stays closed for now — same playbook as early Swift: hand the library to the community first, decide on the toolchain later. ~100 outside PRs land within a week. - 2024·08
Mojo 24.4 — ownership overhaul
The Rust-flavoured ownership story gets a thorough rework:
borrow/inout/ownedbecome parameter conventions, explicitly annotated rather than inferred. TheReference[T]type lines up at the same time. The community senses the language is "starting to set." - 2024·09
GPU kernels land — H100 / A100
The first NVIDIA H100 / A100 codegen ships, via the MLIR → PTX pipeline. This is the first real proof that you can write GPU kernels without CUDA C++. Triton (OpenAI) had been doing it via Python AST, but the route is different.
- 2025·02
MAX 24.6 — production inference
Modular's MAX inference engine wires Mojo kernels into production at multiple AI startups (Replit and Together AI have said so publicly). The "research-stage language" label starts peeling off.
- 2025·09
AMD GPU support — via ROCm/MLIR
Mojo gains AMD GPU support through a ROCm/MLIR backend. It matters: this is the first credible CUDA alternative since OpenCL. One kernel source, two vendors — the first real crack in the CUDA monopoly.
- 2025·11
Approaching 1.0 — package manager + stable stdlib
The new
magicpackage manager ships; the stdlib enters its "breaking changes need an RFC" phase; the doc generator lands. The "still being torn up" feel fades; 1.0 is in sight. - 2026
Three years in — production inference at frontier labs
Mojo in 2026: frontier AI labs use MAX + Mojo for production inference; the "Python is slow" pitch holds — Modular's matmul / softmax numbers are reproducible; the ecosystem is still tiny next to PyTorch-native. Hardware-portable AI kernels are the real 2026 battleground, with Mojo / Triton / CUDA / JAX-XLA all competing.
Language Essentials : MojoAlphabet
The eight cards below are where Mojo differs hardest from the other 11 languages on this site: def vs fn, ownership annotations, @parameter, SIMD types, @value, struct vs class, alias, perf annotations. The ninth covers the Python-superset story today.
def vs fn
Mojo carries both Python-loose def and strict typed fn. One file can mix: prototype with def like Python, then switch the hot path to fn for compile-time checks.
def loose(x):
return x * 2
fn strict(x: Int) -> Int:
return x * 2borrow / inout / owned
Rust-flavoured ownership, but explicitly annotated: default is borrow (read-only ref), inout for mutable borrows, owned for transfer. None of Rust's lifetime annotation pain, but the semantics stay clear.
fn peek(borrow s: String):
print(s)
fn grow(inout s: String):
s += "!"
fn eat(owned s: String): ...@parameter — compile-time programming
One @parameter annotation covers generics, conditional compilation, and loop unrolling. Runtime and compile-time code read the same; MLIR decides when to specialise.
fn repeat[@parameter n: Int]()
@parameter
for i in range(n):
print(i) # unrolled at compile timeSIMD[T, n] as a first-class type
SIMD is a type, not an intrinsic. SIMD[DType.float32, 8] is a parametric vector; arithmetic auto-parallelises — reads like scalar, runs as AVX/NEON.
var a = SIMD[DType.float32, 8](1.0)
var b = SIMD[DType.float32, 8](2.0)
var c = a * b + a # 8-wide FMA@value — auto-derived methods
Tag a struct with @value and copy / move / init / del are generated for you. The equivalent of Rust's derive(Copy, Clone) — one line saves 30 of boilerplate.
@value
struct Point:
var x: Float64
var y: Float64
# __init__ / __copyinit__ / __moveinit__ all derivedstruct vs class
Mojo prefers struct (value types): stack-allocated, ownership-aware. class (reference type, GC-flavoured) is deferred for now — a deliberate trade: nail numeric / systems code first.
struct Vec3:
var x: Float32
var y: Float32
var z: Float32
# class { ... } # not stable yet, on roadmapalias — compile-time constants
One keyword for every compile-time bound value. The three things C++ splits across #define, const and constexpr are all just alias here.
alias WIDTH: Int = 8
alias F32x8 = SIMD[DType.float32, WIDTH]
var v: F32x8 = 0@always_inline and friends
Mojo doesn't gamble on LLVM heuristics; it gives programmers explicit knobs: @always_inline, @noinline, @register_passable. Performance becomes predictable — no more "regressed when the compiler upgraded".
@always_inline
fn dot(a: F32x8, b: F32x8) -> Float32:
return (a * b).reduce_add()Python superset — how far it actually goes today
"Python superset" is Modular's public slogan, but in 2026 the reality is: most Python runs, corners don't. Working: def, list/dict literals, indentation, import for third-party packages (over the GIL). Not working: full metaclass machinery, exec / eval dynamic bytecode, parts of the dunder protocol. "Use Mojo as Python" largely works — just don't expect 100%.
"Python compatibility is a roadmap, not a finished checklist — Modular itself is explicit about this."
Why Mojo : WhyMojo
Mojo isn't out to replace Python for scripting, or Rust for OS work. It targets the gap nobody filled in the last 15 years: AI kernels that are both fast and portable, without forcing ML engineers to drop down into CUDA C++.
No more pybind11
Speeding up Python used to mean C/C++ extensions, pybind11 wrangling, and GIL accounting — three languages and three build systems. Mojo just imports any Python package and lets you write fn hot paths in the same file. The FFI ceremony is gone.
from python import Python
var np = Python.import_module("numpy")
var arr = np.array([1,2,3])MLIR-native, not bolted on
Most languages bolt "AI acceleration" onto the toolchain (TorchScript, JAX trace, TF graph). Mojo inverts that — MLIR is the IR itself. Multi-level dialects, extensible, the same program lowers to CPU / GPU / TPU back ends without language-level changes.
# Mojo source → MLIR → LLVM IR → CPU
# → PTX → NVIDIA
# → ROCm → AMDHardware-portable — one kernel, many chips
The same Mojo kernel codegens to CPU, NVIDIA, AMD GPU, and Apple Silicon. The CUDA-era world of "one kernel, one vendor" is loosening. This is the biggest contested ground in 2026's AI infrastructure.
# mojo build matmul.mojo --target=cuda
# mojo build matmul.mojo --target=rocm
# mojo build matmul.mojo --target=cpuLattner's track record
2003 LLVM → 2007 Clang → 2014 Swift → 2017 MLIR → 2023 Mojo. Every project Lattner has shipped became industry infrastructure. Whether Mojo joins them, time will tell — but the resume is the strongest signal developers bet on.
# LLVM · 2003 — every modern compiler
# Clang · 2007 — C/C++/ObjC frontend
# Swift · 2014 — Apple full stack
# MLIR · 2017 — AI compiler IR
# Mojo · 2023 — ?The first credible crack in CUDA's monopoly
For 15 years, GPU programming meant CUDA C++. Mojo plus its ROCm back end is the first credible CUDA challenger since OpenCL: open kernels, open IR, open competition. Not "kill CUDA," but "finally give an alternative a real path".
# single kernel source · NVIDIA + AMD
# open IR · open stdlib · Apache 2.0Who's Using : ProductionUsers
Mojo is young; the list is far shorter than Python's — but every entry is a real user, none invented. Modular's own MAX is the biggest; Replit and Together AI are publicly named AI platforms; the rest are robotics / quant-finance / drug-discovery shops Modular has called out on its blog.
The AI Era : Built For AI
This is the heart of the page: Mojo is one of the very few languages designed for the AI era from day one, rather than retrofitted. PyTorch / vLLM / TensorRT-LLM are upper-layer frameworks; Mojo stands beside them at the kernel layer, not as a replacement.
"For fifteen years the AI stack has been pinned to a single thread: Python calling CUDA C++. Algorithm engineers write Python, performance engineers write CUDA, with a wall between them. We're not building Mojo to make Python faster — we want the same person, in the same language, to write both the algorithm and the kernel.
Modular's launch-demo number: a 1024×1024 float32 matmul, Mojo with SIMD + tiling + parallelize vs three nested Python for loops, on the same Intel Xeon ≈ 35,000×. Narrowly defined: one kernel, same hardware, interpreted Python baseline — those are the real bounds.
A more realistic comparison: Modular's fused softmax on H100 is roughly 7× faster than PyTorch eager and on par with Triton (OpenAI). Numbers shift across kernels and hardware, but the conclusion "on par with hand-written CUDA" is stable.
One .mojo file targets NVIDIA (PTX), AMD (ROCm), Apple Silicon. The first credible path since OpenCL. CUDA's monopoly isn't broken — it's cracked for the first time, and what grows from here is worth tracking.
SIMD + GPU
What a Mojo kernel looks like: SIMD is a type, GPU is a back end. You write SIMD[DType.float32, 8]; the compiler, given a target, lowers it to AVX-512 / NEON / PTX / ROCm. One source, no abstraction tax, no performance lost.
- SIMD type — parametric vector, arithmetic auto-parallel
- GPU codegen — MLIR → PTX (NVIDIA) / ROCm (AMD)
- Apple Silicon — M-series NEON + Metal compute
- No CUDA C++ — No third language, no second build system
Compared with Triton (OpenAI): Triton uses a Python AST + JIT, GPU-only; Mojo is a standalone language covering CPU / GPU / edge. The two coexist; they don't substitute.
# one kernel · NVIDIA / AMD / CPU
from tensor import Tensor
from algorithm import vectorize, parallelize
fn softmax(inout x: Tensor[DType.float32]):
alias nelts = simdwidthof[DType.float32]()
@parameter
fn row(i: Int):
var m = x.row_max(i)
var s: Float32 = 0
@parameter
fn v[w: Int](j: Int):
var e = exp(x.load[w](i,j) - m)
x.store[w](i, j, e)
s += e.reduce_add()
vectorize[v, nelts](x.cols)
scale_row(x, i, 1/s)
parallelize[row](x.rows)
# build: mojo build softmax.mojo --target=cuda
# mojo build softmax.mojo --target=rocm2026 toolchain / backends / surroundings
Counter-intuitive: AIs write Mojo worse than older languages
An interesting paradox: Mojo is an AI-era language, yet LLMs write it less reliably than Python / Java / Rust. The cause is simple — less training data. Public Mojo on GitHub in 2026 is still a few thousand repos, four orders of magnitude below Java.
Actual workflow: "AI writes the Python prototype → human translates to a Mojo kernel" is still dominant. Modular ships its own "AI-assisted Mojo authoring" tooling — feeding the model the stdlib + docs — but in 2026 we're still far from "let the AI write kernels for you".
Ironic but expected: every new language pays a "training-data cold-start tax". Rust paid it early; Zig still pays it; Mojo can't dodge it either. Every early PR, blog post and public kernel is teaching the models this language.
# Status today: AI writes Python prototype → human ports to Mojo
# Python (AI-friendly)
def attention(q, k, v):
scores = q @ k.T / sqrt(dim)
return softmax(scores) @ v
# Mojo (human-translated · kernel-class speed)
fn attention(borrow q: Tensor,
borrow k: Tensor,
borrow v: Tensor) -> Tensor:
# SIMD-packed matmul + fused softmax
# vectorize / parallelize / tile
...In one line: Mojo isn't "Python glue" and isn't a "CUDA replacement" slogan — it's the first real language that puts the algorithm and the GPU kernel in one place. In 2026 it's young, the ecosystem is small, and the AIs still struggle with it — but the architecture is right: MLIR + Lattner's track record + real production users.
vs Python / Swift : Mojo vs Python vs Swift
Versus Python: Mojo is Python's acceleration off-ramp, not a replacement. Cross-link /code/python. Versus Swift (/code/swift): same designer (Chris Lattner), but Swift targets app developers and Mojo targets ML compiler engineers — one person, two completely different audiences.
| Python | Mojo | Swift | |
|---|---|---|---|
| Origin | Guido · 1991 | Modular · 2023 | Apple · 2014 |
| Designer | Guido van Rossum | Chris Lattner | Chris Lattner |
| Primary audience | Scripts · data · AI algorithms | AI kernels · compiler engineers | iOS/macOS app developers |
| Syntax | Python itself | Python superset (in progress) | Own syntax · ML-flavoured |
| Performance | Interpreted (CPython) | C / Rust class · MLIR codegen | C-class · LLVM |
| Memory model | GC + refcount | borrow / inout / owned | ARC + value types |
| GPU | CUDA C++ via PyTorch | Native · NVIDIA + AMD + Apple | Metal · Apple GPU only |
| SIMD | NumPy abstraction, outside the language | SIMD[T, n] first-class type | SIMD[N] · stdlib |
| Compile-time programming | None (dynamic language) | @parameter · shared with generics | Yes (associated types / macros) |
| Interop | Everything · pip ecosystem | Native Python import · across GIL | C direct · ObjC bridge |
| Ecosystem maturity | 35 years · the largest of all | 3 years · early (~10³ public repos) | 11 years · Apple-saturated |
| Open source | Fully · PSF | stdlib yes · compiler closed (open before 1.0) | Fully · Apache 2.0 |
Outlook : TheRoadAhead
Mojo in 2026 sits on the eve of 1.0 — open-sourcing the compiler is the final big gate. Deeper NumPy / PyTorch ABI interop is on the way; Apple's own MLX is a same-niche competitor. Whether Mojo escapes AI to become a general systems language is an open question.
Open-sourcing the compiler — the last gate on the way to 1.0
The stdlib opened in 2024; the compiler is still closed. Community pressure to fork, embed and audit codegen has been steady. Modular has publicly committed to "open it before 1.0" — the cadence mirrors early Swift's exactly.
What it unlocks: this is the final gate between "an interesting Modular product" and "a real industrial language." Only after open-source do you get third-party compilers, teaching distros, and a cross-vendor RFC process — the Rust pattern.
Deeper NumPy / PyTorch ABI
Today's Python interop crosses the GIL via Python.import_module; data has to round-trip. The next step is shared buffers / zero-copy tensors — operating on a PyTorch Tensor as a Mojo struct without moving the bytes. The path from "fast but isolated" to "fast and seamless."
Apple MLX — direct competition
Apple shipped MLX in 2023: NumPy-style, Apple-Silicon-tuned, LLVM/MLIR-flavoured. On macOS, Mojo competes with Apple's first-party stack. The very line Lattner walked away from, Apple has now picked up — he sees this competitor more clearly than anyone.
A general-purpose systems language?
Mojo in 2026 is positioned for AI, but the language itself is general — struct + ownership + SIMD + MLIR has no "AI-only" baked in. Can it leave the AI niche and challenge Rust for general systems work? Depends on where the post-1.0 community pushes it. An open question, not a roadmap item.