Roblox Calls It 'Hybrid Architecture.' We Just Call It Conch Engine.
Roblox published a detailed technical paper on their new Roblox Reality system in April 2026. You can read it here.
Two Problems, One Engine
Roblox and Conch are solving completely different problems. Roblox is trying to render photorealistic multiplayer worlds without requiring every creator to build AAA-level art assets. Conch is trying to run coherent, persistent audio/text-based RPG adventures with AI narration that does not contradict itself five minutes in.
The problems are different. The architecture is the same.
In their April 2026 paper, Roblox describes what they call a "hybrid architecture": a traditional game engine manages all world state, rules, and multiplayer consistency, while a Video World Model — a large generative neural network — handles the rendering of rich, photorealistic visuals on top. The engine is authoritative. The generative model is expressive.
That sentence could have been written about Conch on day one.
What Roblox Discovered About Generative Models
The Roblox paper is unusually candid about what generative models cannot do. It is worth quoting directly.
Video World Models, Roblox writes, excel at generating high-dimensional visuals — textures, lighting, secondary motion, environmental details — things that are expensive to simulate explicitly. But they struggle with:
- True multiplayer simulation and persistent state
- Long-term memory across sessions
- Rule enforcement and player input control
- Consistent logic over extended gameplay
Game engines, by contrast, excel at exactly those things — precise simulation, state persistence, rule systems, multiplayer synchronization — while struggling with the cost of photorealistic rendering.
So Roblox split the job: engine handles truth, generative model handles expression.
This Is Conch's Architecture, Word for Word
We have described this split before, in our post on why we built a game engine instead of a chatbot. But Roblox's paper sharpens the argument in a way that is worth spelling out directly.
In a Conch adventure, the deterministic game engine — backed by a real database — tracks everything that matters:
- Where you are (scene graph)
- What you are carrying (inventory)
- Who else is present and what they have (NPC state)
- What is locked, what is open, what has been solved (gating logic)
- Combat stats, health, equipment (mechanics)
- Win conditions and story progression
The LLM never decides any of this. It is not allowed to. When you pick up an item, the engine registers that event. When you move to a new room, the engine updates your location. When you fight a creature, the engine resolves the outcome with actual mechanics.
What the LLM does is narrate. Given a precise, structured picture of the current game state — here is where you are, here is what you are carrying, here is what just happened — it generates the prose that makes the moment feel alive. It does not need to remember your inventory, because the engine provides it. It does not need to track NPC locations, because the engine knows. It does not need to maintain a mental map, because the scene graph exists.
The result, as Roblox's paper puts it, is a system where the engine provides "the shared and consistent state" and the generative model adds the richness that would be prohibitively expensive to produce any other way.
For Roblox, that richness is photorealism. For Conch, it is narrative voice. The structural logic is identical.
Why the Generative Layer Cannot Run State
Roblox is explicit about why you cannot let the generative model be authoritative, and the reasoning applies just as cleanly to language models as it does to video models.
A generative model produces plausible output given recent context. It does not maintain a model of the world. It does not enforce rules. It does not remember what it decided three turns ago, or carry consistent facts across a long session. When you ask it to hold state, it approximates — and approximation is where coherence dies.
Every AI adventure platform built as a chatbot runs into this eventually. The sword the player picked up an hour ago is gone because the model's context window moved on. The NPC the player befriended now speaks as if they have never met. The "puzzle" resolves based on whatever the model thinks sounds dramatically satisfying, not based on whether the player actually holds the required item.
None of this happens in Conch, not because our LLM is smarter, but because we do not ask it to hold state. The engine holds state. The LLM narrates it.
What Roblox's Paper Adds to How We Think About This
The architecture is validated. But the paper points toward a few directions worth developing further.
Richer signals from engine to generative layer. Roblox's engine feeds the video model more than raw geometry — it passes normals, occlusion data, and surface properties that guide how the generative layer renders. The equivalent in Conch would be the state machine emitting explicit narrative signals: tension level, relationship delta, scene mood, narrative beat. The LLM already receives game state, but structured mood signals would let it calibrate atmosphere the same way surface properties calibrate a renderer.
Model routing by event complexity. Roblox does local avatar rendering for low-latency cases, reserving edge GPU capacity for expensive global renders. In Conch, simple, predictable events — moving through a visited room, picking up a common item — could route to a smaller model with a lightweight prompt. High-stakes moments — first encounters, combat resolution, revelation scenes — earn the full model and a richer context. Latency goes down. Quality goes where it matters.
Pre-generation for predictable branches. Roblox pre-renders frames where engine state is predictable. The scene graph gives Conch the same opportunity: while a player reads the current response, the engine already knows which exits exist and which NPCs are present. Background generation of likely next-turn narratives would make the experience feel instant — not because the model is faster, but because it already did the work.
What This Means for AI Games
There is a broader signal here worth paying attention to.
When one of the largest gaming companies in the world publishes a technical paper independently arriving at the same fundamental architecture as a startup in this space, it is not a coincidence. It is a convergence. Two very different teams, looking at two very different problems, reached the same structural conclusion: generative models are extraordinary at producing high-dimensional richness, and terrible at running consistent state. So you build a real engine, and you put the generative model on top.
The chatbot-as-game approach was always going to hit a ceiling. The novelty of "the AI goes along with anything" wears off quickly once players realize the world does not remember what happened. The future of AI-powered games is not smarter chatbots. It is real game engines with AI as the creative layer.
That is what Roblox is building for photorealistic worlds. It is what Conch has been running for narrative adventures. The medium is different. The insight is the same.
If you want to experience what this architecture feels like in practice — a world that holds together, state that persists, narrative that stays coherent — play an adventure. The engine is invisible when it is working. That is the point.