May 2026

How VR Works

A walk down the stack of a virtual reality headset - from the photoreceptors at the back of your eye to the runtime that owns the coordinate frame of your living room - and the strange, beautiful tricks every layer is playing on you.

I think VR is one of the most quietly miraculous pieces of technology we've built. Not because the experience is always good - much of it is still ugly, the headsets are heavy, the content libraries are uneven. But because of what it has to do to work at all. A VR headset has to lie to your visual system convincingly enough that your brain commits to the lie, while also moving fast enough that the lie never breaks, while also being light enough that you can wear it on your face, while also being cheap enough that you can buy it. The fact that any of this is possible at all turns out to depend on a long chain of clever, slightly desperate engineering compromises.

And the most interesting thing is that every one of those compromises is shaped by something further up the stack - and the thing at the top of the stack isn't software, isn't hardware, isn't optics. It's the way your eye actually works. So that's where we'll start.

This article is long. It has four parts: perception, optics and display, tracking and rendering, and the runtime layer. There's some math, none of it scary; you can skim the equations and lose nothing important. The interactive demos are where most of the intuition lives. Play with them.

Part 1 - Perception

1.1 The eye is a sensor with weird specs

We talk about the eye as a camera, and the analogy gets you surprisingly far. Light enters through an aperture (the pupil), passes through a deformable lens, and lands on a curved sensor (the retina), where photoreceptors convert photons into electrical signals. Same problem as a camera, same general architecture.

But the spec sheet is bizarre. Start with the photoreceptors themselves. The retina contains two main types: cones, of which you have about 6 million, and rods, of which you have about 120 million. Cones come in three subtypes - short, medium, and long-wavelength sensitive (the "S/M/L" cones, often loosely called blue, green, and red) - and they handle color and fine detail. Rods are colorblind, much more sensitive to dim light, and tuned for motion.

The two are not uniformly mixed. Cones are crammed into a tiny patch in the center, the fovea, perhaps a millimeter and a half across, with a cone density that drops off a cliff outside it. Rods are everywhere except the fovea, peaking around 18 degrees of eccentricity. The result is that your color and detail vision is concentrated in a thumbnail-sized region in the center of your gaze, and your peripheral vision is essentially a colorblind, high-sensitivity, motion-detecting wide-field sensor. Two different cameras, sharing one optic.

The retina also outputs processed signals, not raw pixels. Retinal ganglion cells perform an early stage of opponent processing - encoding red-vs-green and blue-vs-yellow contrasts rather than absolute red, green, blue values - before the signal even reaches the optic nerve. By the time the visual cortex receives anything, edge detection, motion detection, and color encoding are already partially done. The eye is not a passive sensor; it's a cooperating front-end of the visual system.

The fovea covers about 1-2 degrees of your visual field. By 10 degrees out, your visual acuity has dropped by an order of magnitude. By 30 degrees, you're essentially blind to fine detail. Your peripheral vision is a low-resolution motion detector wearing the costume of normal sight.

You don't notice this. The reason you don't notice this is that your eyes are moving, constantly, in tiny rapid jumps called saccades - three or four per second, every waking minute. Each saccade aims the fovea at a new point of interest. The brain stitches the resulting sequence of high-detail snapshots into the illusion that you're seeing everything in front of you in detail at once. You aren't. You're seeing a mosaic, and the seams have been hidden from you so thoroughly you've never even thought to look for them.

Saccades aren't the only kind of eye movement either. There are at least four:

  • Saccades: the fast jumps we just described, 3-4 Hz, ballistic, lasting 30-80ms each. During the movement itself, your visual system briefly suppresses input - a trick called saccadic suppression - so you don't perceive the smear.
  • Smooth pursuit: when tracking a moving object, your eyes lock to it and move continuously, no jumps. You can't do this voluntarily without a moving target; try and you'll fall back into saccades.
  • Vergence: both eyes rotating inward or outward together to focus on objects at different depths.
  • Vestibulo-ocular reflex (VOR): when your head moves, your eyes counter-rotate at the same speed in the opposite direction, holding the world stable on the retina. This is wired through the inner ear and completes in 7-10ms - much faster than any conscious movement. It's also the most important reflex for VR: a system that violates the VOR, even slightly, makes users sick fast.

The visual system also adapts in time. When you walk into a dark room, your pupils widen within seconds, but full dark adaptation - your photopigments regenerating to peak sensitivity - takes 20 to 30 minutes. Going the other direction is faster: a dark-adapted eye recovers from a bright flash in seconds, not minutes. Pupillary diameter also affects the eye's optical performance: a wider pupil gathers more light but introduces more aberration; a narrow pupil is sharper but dimmer.

Below: a side-cutaway view of an eye, with the retina colored by photoreceptor density. Light from a distant source enters from the right, refracts through the lens, and converges somewhere behind it. Squeeze the lens by changing its curvature - only one value lands the convergence point exactly on the retina. Toggle on a near light source to watch the same lens fail to focus both at once. Your eye doesn't have this problem in practice because it accommodates to whatever you're attending to; objects outside that depth blur naturally. That blur is a depth cue your visual system uses, called the retinal blur cue.

drag to orbit · the retina is colored by photoreceptor density
f ≈ 1.60
watch how no single curvature focuses both at once

Squeeze or relax the lens. With both sources on, you'll find that the curvature that focuses the far source blurs the near one, and vice versa. Your eye accommodates to whichever object you're attending to — depth of field is built in, and it's the reason the world is always sharp where you look and softer everywhere else.

A few things to notice. The bright peak in the center-back of the retina is the fovea. The small dark gap on one side is the blind spot, where the optic nerve exits the eye - there are no photoreceptors there at all. You have a literal hole in your visual field, and your brain fills it in with a confident guess so seamless that almost no one ever notices it without being shown a parlor trick to expose it.

One last thing about the eye, which becomes a problem for VR design: the lens stiffens with age. Children can accommodate over a range of about 14 diopters; by your mid-40s, you're down to 1-2. This is presbyopia, and it's why most adults eventually need reading glasses. For VR, presbyopia means that older users may struggle to focus on the panel even when the optics are nominally set for their distance - the eye can no longer do the small accommodation adjustments that compensate for individual differences. Some headsets ship with prescription-lens inserts to address this; most assume your eyes can still do accommodation work the optics is asking of them.

One more demonstration. Below is a simulated scene with the "fovea" jumping around like real eyes do. With peripheral blur on, you can see what the visual system is actually receiving at any single instant. Without it, you see the stitched illusion your brain constructs.

The fovea jumps about 3 times per second. The line traces where it's been.

With peripheral blur on, you can see what the visual system is actually receiving at any instant. With it off, your brain's composited illusion is restored. The illusion is so good you usually forget the underlying input ever looked like the first version.

Hold on to all of this, because we'll keep cashing in on it. Your visual system lies to you constantly, and the lies are the product. VR doesn't have to construct ground truth. It has to construct lies your visual system finds plausible. That's a much easier problem, and it's the only reason any of this works at all.

1.2 Depth is a calculation, not a property

Here's something worth sitting with: a 2D image of a 3D scene does not contain depth. There's no depth channel arriving at the retina. What arrives is a flat array of brightness and color. Depth is something the brain computes, by combining a long list of unreliable cues, none of which would be enough on their own.

Cutting and Vishton (1995) ranked the depth cues by relative strength at different distances, dividing space into three zones:

  • Personal space (within ~2 meters). Stereopsis, accommodation, vergence, motion parallax, and occlusion are all strong. This is the zone where you reach, manipulate, eat. Stereo dominates.
  • Action space (~2-30 meters). Stereopsis weakens dramatically - disparity becomes too small to measure reliably. Motion parallax, relative size, and occlusion do the heavy lifting.
  • Vista space (beyond 30 meters). Stereopsis is dead. Atmospheric haze, relative size, height in field of view, and known-object size are what's left.

The cues themselves, in roughly decreasing strength at personal-space distances:

  • Occlusion: things in front cover things behind. The strongest cue at any distance, but it only gives you ordinal information (this is in front of that), not metric depth.
  • Relative size: known-similar objects appear smaller when farther. Requires familiarity.
  • Motion parallax: as your head moves, near things shift more than far things across your visual field. The amount they shift encodes depth quantitatively.
  • Stereopsis: your two eyes get slightly different views, and the disparity encodes depth.
  • Vergence: how far your eyes have to rotate inward to fuse the two images. The brain uses the convergence angle as a depth signal.
  • Accommodation: how much your lens has to deform to bring the object into sharp focus. Weaker than the others, but real.
  • Texture gradients: regular textures (a tile floor) compress in apparent size with depth.
  • Aerial perspective: distant things appear hazier, bluer, lower-contrast.
  • Shading and shadow: shape-from-shading is surprisingly powerful; cast shadows ground objects in 3D space.
  • Kinetic depth effect: a rigid object rotating reveals its 3D structure even from a 2D silhouette.

VR can reproduce most of these for free. Occlusion, relative size, motion parallax, lighting, texture gradients - those just fall out of rendering a scene from two eye positions and updating with head motion. Stereopsis comes from rendering one frame per eye. Vergence comes along for the ride: if the stereo geometry says an object is fifty centimeters away, your eyes will rotate inward by exactly the right amount to fuse the two images into a single percept.

The cue that VR cannot reproduce honestly is accommodation. We'll get to that in section 1.3. First, stereopsis itself.

How does stereopsis actually work? The geometry is simpler than people make it out to be. Imagine looking at a point with both eyes. Each eye has a forward direction; let's measure the angle from that forward direction to the point you're looking at. Call those angles αL for the left eye and αR for the right eye. The difference, αL − αR, is the binocular disparity. For an object directly in front of you, both eyes turn inward and the angles are equal in magnitude but opposite in sign. The sum of their absolute values is the disparity.

And that disparity, given a known interpupillary distance, uniquely determines the depth:

Z ≈ IPD / (2 · tan(δ/2))

where Z is the distance to the object, IPD is the interpupillary distance, and δ is the disparity. For small disparities (objects beyond about 1 meter), this approximates to Z ≈ IPD/δ. Disparity falls as the inverse of distance: an object at 50cm with a 6cm IPD produces 6.9° of disparity, while at 5m it's only 0.69°, and at 50m it's 0.069° - about four arcminutes. That's right at the limit of what your visual system can reliably measure. By the time you're a hundred meters out, stereo is dead.

15cm30cm45cm60cm75cmα_L = 16.4°α_R = -16.4°LRIPD = 25.0cmobject
disparity (α_L − α_R)
32.78°
1967 arcmin
depth (from disparity)
43cm
Z = IPD / 2·tan(δ/2)
depth (geometric truth)
43cm
from the diagram

Drag the object. Notice how disparity grows quickly when the object is close and collapses to zero at infinity. The brain does this calculation backwards: given the disparity, recover the depth. That's why stereopsis dies past about 30m — the disparity becomes too small to measure.

One subtlety the brain handles: the two retinal images aren't merged into a single average. The visual cortex keeps both available and uses their differences directly. What you perceive - the seemingly unified, depth-rich view - is a construct called the cyclopean image, assembled from the disparity field. You can prove this with random-dot stereograms: pairs of pure-noise images with no recognizable features that, when fused, reveal a 3D shape made entirely of depth-encoded disparity. The brain doesn't need recognition to see depth. It needs disparity.

Now a stereo demo to ground all this. Two side-by-side viewports below render the same scene from two virtual eye positions. The slider controls the distance between them - the IPD. At zero, both views are identical and the depth cue dies. Around 65mm you have roughly natural human stereo. Push past 100mm and you're in hyperstereo, where the world starts to feel like a tabletop diorama because your brain assumes the eyes that gave it those views must belong to a much larger creature.

left eye
right eye
65mm

natural. Try crossing or relaxing your eyes to fuse the two views — depth comes from the brain reconciling the small differences between them.

VR headsets need to know your IPD to render stereo correctly. Adult human IPDs range from about 54mm (small faces) to 74mm (large), with a mean near 63mm. Get it wrong by even a few millimeters and the whole world will feel slightly off - wrong scale, wrong distances, vague nausea. That's why every modern headset either lets you set IPD manually, has a mechanical IPD adjustment (sliding the optics apart), or measures it automatically with eye tracking. The brain is good at fusing slightly wrong stereo, but it pays for the work.

1.3 The conflict at the heart of every current VR headset

In the real world, vergence and accommodation are linked. When you focus on something thirty centimeters away, your eyes rotate inward to point at it (vergence) and your lenses thicken to bring it into sharp focus (accommodation). Both responses are triggered by the same stimulus - light coming from a real point in space - and they always give the same answer. They are so tightly coupled in your nervous system that one even drives the other. Force your eyes to converge harder and your accommodation will follow, and vice versa. The mapping is built into the near triad: convergence, accommodation, and pupillary constriction, all firing together.

The unit optometrists use for these is the diopter (D), which is the inverse of distance in meters. Accommodating to focus on something at 50cm is a 2D demand; on something at 1m, 1D; on infinity, 0D. The eye can instantaneously hit any value in its accommodation range (about 14D in young eyes, declining to 1-2D by 50). Vergence is normally measured in meter angles or prism diopters but scales the same way: closer object → both more convergence and more accommodation, in lockstep.

Now consider what a VR headset actually is. There's a small display panel about two centimeters from your eye. Between your eye and the panel sits a lens, whose job is to make that panel appear to be much further away - typically optically focused at infinity, or sometimes at around 1.3-2 meters. That focal distance is fixed. It's determined by the geometry of the lens and the panel, and it never changes regardless of what is being shown.

But the vergence distance - where your eyes rotate to point - depends on the stereo content. If the renderer puts a virtual object thirty centimeters away, your eyes converge to thirty centimeters. If it puts one at five meters, your eyes converge to five meters. The vergence cue is fully under software control.

So the two cues, locked together for the entire evolutionary history of vertebrate vision, come apart. Your eyes converge to wherever the virtual object lives, but they have to focus where the panel actually is. Your visual system spends every waking second in VR trying to reconcile two depth signals that disagree, and it can't, because the disagreement is structural and constant.

This is the vergence-accommodation conflict. Shibata et al. (2011) measured the comfort zone empirically and found that most users are comfortable within roughly ±0.3 diopters of mismatch - fewer at the close end, more at far. Outside that band, fatigue and eyestrain build up quickly. Inside it, most people don't notice. That tolerance is the only reason current VR is usable at all. Headset manufacturers are careful to keep most virtual content within the comfort band; that's why text and UI in VR sits at a fixed "comfortable" depth, typically 1-2 meters.

Below is a top-down schematic of the conflict. The two sliders let you control the vergence depth and the accommodation depth independently - a thing reality never lets you do. Pull them apart and watch the conflict open up. In a real headset, the accommodation slider is fixed and only the vergence slider is under content control.

focal plane (where light is sharp)perceived object (vergence)
120
120

Conflict: matched. In nature these two are always locked — light from a real object carries both cues at once. In VR they don't have to be, and your visual system pays the tax in fatigue and nausea.

Hardware companies have spent staggering sums on fixing this. The general approach is called varifocal: make the focal distance of the optics adjustable, so it can be matched to wherever the user is currently converging. Several flavors have been tried:

  • Mechanically actuated panels. Move the panel physically closer to or further from the lens to change the apparent focal distance. Meta's Half Dome 1 (2018) did this. Works, but mechanically loud and not fast-enough for saccade-rate adjustments.
  • Liquid lenses. A lens whose curvature can be changed by varying voltage across an electrowetted oil interface. Half Dome 3 (2019) used these. Solid-state, fast, but optically noisier and more expensive than fixed glass.
  • Multi-focal-plane displays. Stack two or three displays at different focal depths (driven by beamsplitters or polarized stacks). Each pixel can be composited at the depth nearest its content. Works but dims the image and makes the optical stack thick.
  • Light field displays. Capture and emit not just an image but a directionally-varying field of light rays per pixel. Each ray naturally focuses at the right depth because the eye reconstructs the depth from the ray bundle. Beautiful in theory; computationally expensive and resolution-limited in practice. Magic Leap 1 shipped a two-state version.
  • Holographic displays. Use computer-generated holograms to reconstruct the actual wavefront of light that would have come from a real 3D scene. State-of-the-art research, far from production.

None of them have shipped to consumers in a fully solved form. Meta has shown working Half Dome prototypes for years. Magic Leap had a two-state varifocal in their first headset and abandoned it in their second when they couldn't stretch it to a continuous range. Apple's Vision Pro famously does not include varifocal - the focal distance is fixed at about 1.3m. They bet that the rest of the experience could be good enough to make it not matter. The bet seems to be holding, but the conflict is still there in every Vision Pro session, every Quest session, every commercial VR headset on the market. It's a tax we haven't figured out how to stop paying.

Part 2 - Optics & Display

2.1 The panel

The display panel in a VR headset is a tiny, dense, ferocious piece of hardware. It sits two to three centimeters from your eye. It must produce an image sharp enough that detail isn't blurred when magnified by a lens directly in front of it. It must refresh fast enough that motion looks smooth. And it must do all of this while you whip your head around like an ungrateful passenger.

The metric that matters most is not raw resolution but pixels per degree (PPD): how many pixels fall in each degree of your visual field. PPD = horizontal pixels per eye / horizontal field of view in degrees. The human fovea resolves detail at about 60 PPD (some research goes as high as 120 PPD when accounting for super-resolution from saccades). Below 60, you can in principle see the pixels. Below 30, you definitely can - text legibility falls off, fine detail in textures collapses.

18.8pixels per degreenoticeably pixelated
text legibility
foveal limit
You're rendering at 31% of what your eye could resolve at the fovea. The other 69% is information your panel can never deliver.
2064px
110°
Snap to a real headset:

The trade-off is built in: more PPD means either more pixels (more cost, more rendering work, more bandwidth) or less FOV (less immersion). Bigscreen Beyond hits 32 PPD by keeping FOV narrow at 90°; Vision Pro hits 34 by throwing an absurd number of pixels at it; consumer Quests sit around 20-25 by accepting that you can see the pixels and making up for it elsewhere.

Two related effects come from low PPD. The first is the screen-door effect: visible black gaps between pixels, an artifact of pixels having a non-zero gap between their light-emitting subpixel structures. As density rises, the gaps shrink relative to lit pixel area, and the effect fades. Modern headsets have largely killed it.

The second is more subtle: subpixel layout. Most consumer displays use one of two arrangements. RGB stripe places three full-resolution subpixels (red, green, blue) per logical pixel - the layout in basically every monitor. PenTile, common in OLED panels and mobile, uses a diamond pattern where green is full resolution but red and blue are half-resolution and shared between adjacent pixels. PenTile gives you apparent higher density on green-heavy content (like grass, skin tones, most photographs) at the cost of chromatic edge detail - text especially looks a bit fringed on PenTile if you look closely. Quest 2 used PenTile; Quest 3 moved to RGB stripe. The difference is most visible in fine UI text.

Other panel parameters that affect VR but rarely make spec-sheet headlines:

  • Bit depth. Most VR panels are 8-bit per channel. In dim scenes you can see banding in smooth gradients. 10-bit panels exist (Vision Pro, some PCVR headsets) and the difference is real.
  • Color gamut. sRGB coverage is ~100% for any modern panel. P3 coverage is harder; only the high-end micro-OLED panels comfortably cover P3.
  • HDR. Until micro-OLED came along, basically absent in VR. Some flagship headsets are starting to ship limited HDR.

The technology choice is mostly between OLED, LCD, and the newer micro-OLED. OLED wins on contrast (true blacks, high dynamic range) and switching speed; LCD wins on raw pixel density and cost; micro-OLED, used in Vision Pro, gives you both at the price of a small country.

But the more interesting parameter - and the one almost no consumer-facing comparison ever talks about - is persistence. A panel pixel doesn't switch on and off instantly. It takes some time to reach its target brightness, and some more time to fade back. The total amount of time the pixel spends emitting light during a single frame is its persistence.

High persistence (the pixel stays lit for most of the frame) is what every TV and monitor you've ever owned does. It looks fine when your head isn't moving. But in VR, your head is moving, and during a single frame your eye has tracked across an arc of the panel. If the pixel is still lit during that motion, the lit-up rectangle smears across your retina. The world looks wiped with vaseline every time you turn your head. Oculus DK1 had this problem in 1080p detail; nobody who tried it and a DK2 a year later forgets the difference.

The fix is low persistence: light up each pixel for a tiny fraction of the frame and let it stay dark for the rest. Like a strobe light running at 90Hz. Your eye integrates the briefly-flashed image and your visual system stitches it into apparent continuity. This is what every modern VR headset does. It's also why VR headsets are dimmer than you might expect - the panel is only emitting light maybe 10-20% of the time.

2.2 The lens

Sitting between your eye and the panel is a lens. The lens has two jobs: make the panel - which is two centimeters from your face - appear far enough away that your eye can focus on it, and magnify it so the small panel fills a wide field of view. Both of these are demanding, and the way you achieve them shapes how the whole headset feels.

Two parameters that almost never make marketing material but dominate how comfortable a headset is:

  • Eye relief. The distance from your cornea to the back surface of the lens. Typical: 12-20mm. More eye relief means more clearance for eyelashes, glasses frames, eye tracking cameras - but also a smaller effective FOV (because your eye is further from the optic) and worse off-axis quality. Less eye relief means a bigger, sharper image but might literally touch your eyelashes.
  • Eye box. The volume of space in which your eye can sit and still see a correct image. If you move outside the eye box - by tilting your head, by the headset shifting on your face - you start seeing vignetting, blur, or chromatic fringing. A good eye box is comfortable; a small one means even small fit changes ruin the image. Pancake lenses, which we'll get to, traded eye box for thinness.

FOV itself is a function of the lens focal length, the panel size, and the eye position. Roughly: FOV = 2·arctan((panel_half_width)/(focal_length + eye_relief)). For a typical headset with a 70mm-wide-per-eye panel and a 35mm focal length, you get something around 105° per eye. Push the panel closer (shorter eye relief) or shorten the focal length to widen FOV; the tradeoffs scale.

For most of VR's first decade, the workhorse lens was the Fresnel lens - a flat lens etched with concentric grooves that approximate the surface of a thicker, curved lens. They're light and cheap. They're also full of compromises: visible god-ray artifacts where light scatters off the grooves, a sharp sweet spot in the center surrounded by a ring of soft distortion. If you've used an Oculus Rift S or original Quest, you've squinted through a Fresnel.

Some headsets used aspheric lenses - single curved elements with non-spherical profiles to correct off-axis aberration. Sharper than Fresnels but heavier and longer. Vision Pro and high-end PCVR headsets use multi-element designs, stacking several lens elements (a real camera lens, basically) to cancel aberrations across the field. Beautiful image quality, heavy and expensive.

The newer flagship trend is pancake lenses, which fold the light path back on itself by bouncing it between polarized surfaces. The optical path is much longer than the physical thickness of the lens, which means you can get the image distance you need in a much shorter tube. That's why modern headsets are visibly thinner than their predecessors. The cost is that pancake lenses are dim - they throw away most of the panel's light to the polarizers - which forces panel brightness back up. The other cost is a smaller eye box, which is why pancake-lens headsets are noticeably fussier about fit.

Augmented-reality headsets, which we're not focusing on here, replace the lens entirely with waveguides - thin pieces of glass with microstructures (diffractive, reflective, or holographic) that pipe a projected image into the eye while passing through the real-world scene unchanged. HoloLens and Magic Leap use these. They're optically clever but currently FOV-limited (40-60° at best) and tend to have visible color separation.

But here's the part that matters for the rest of this article. Every VR lens distorts the image. Not by accident - by necessity. The lens is short, the eye is close, the field of view is wide. There is no design that gives you all three of those without introducing significant pincushion distortion: straight lines bowing inward at the corners. There is also chromatic aberration, where different wavelengths of light refract by slightly different amounts, causing colored fringes at the edges.

If the renderer just drew an undistorted image and put it on the panel, the lens would warp it on the way to your eye and you'd see a curved, color-fringed mess. The trick to making VR look right involves pre-distorting the image in software, on purpose, so that the lens can undo the distortion in hardware on the way to your eye.

2.3 The pre-warp trick

This is one of the more counterintuitive ideas in real-time graphics, so it's worth slowing down for. The renderer wants to display a clean, undistorted image. The lens is a fixed piece of physics that distorts whatever passes through it. The renderer has no control over the lens. It only has control over the image it puts on the panel.

So the renderer plays a trick. After drawing the scene, it applies the inverse of the lens's distortion. Whatever curve the lens introduces, the renderer pre-applies the opposite curve. If the lens makes straight lines bow inward, the renderer bows them outward by the matching amount. The image on the panel ends up looking grotesque - a fish-eye view with the edges bulging outward - but when light from that image passes through the lens to your eye, the two distortions cancel, and you see straight lines.

For chromatic aberration, the renderer goes further: it applies different distortion strengths to the red, green, and blue channels separately, knowing the lens will refract each of them by a slightly different amount. Each color is bowed in the opposite direction the lens will bow it, so the three converge cleanly at the eye.

The demo below shows this end to end. Three images:

  1. The clean grid the renderer wants to display.
  2. The actual contents of the panel buffer - what would be photographed if you held a camera up to the screen.
  3. What your eye sees after the panel passes through the lens.
1. scene the app rendered
2. panel buffer (pre-warped)
3. through the lens (eye)
With pre-warp on, the panel buffer looks "wrong" but the eye sees a clean grid.

With pre-warp on, the panel looks "wrong" and the eye sees a clean grid. Turn pre-warp off - the panel looks fine, but the lens warps it into the eye, which is what the user would experience if the headset shipped without correction.

Now the math underneath. The standard model for radial lens distortion is the Brown-Conrady polynomial:

r' = r · (1 + k₁·r² + k₂·r⁴ + k₃·r⁶ + ...)

where r is the radial distance from the optical center (normalized to 0-1), r' is the distorted radial distance, and k₁, k₂, k₃ are coefficients fit to the specific lens. Negative k₁ produces pincushion (lens-typical), positive k₁ produces barrel (which is what the renderer applies as pre-warp). Most engines truncate after k₂; Brown-Conrady also has tangential terms p₁, p₂ that correct for the optical center being slightly off the panel center. Try it below.

distortion curve
grid through this polynomial
-0.30
0.05
0.00

The Brown-Conrady model: r' = r·(1 + k₁r² + k₂r⁴ + k₃r⁶). This curve is currently pincushion (lens-like). Lens manufacturers fit these coefficients to a specific lens; the renderer applies the inverse to the rendered image so the panel-times-lens product is identity.

Real-world implementations don't do per-pixel polynomial math on every frame. That would burn far too much GPU time. Instead, the renderer draws the scene normally, and then a final post-process pass samples the rendered image through a precomputed distortion mesh - a tessellated grid of triangles where each vertex's texture coordinate has been pre-warped through the inverse polynomial. Drawing that mesh with the rendered scene as a texture gives you the pre-warped panel buffer. The GPU is very good at this; the distortion pass costs perhaps half a millisecond on a modern mobile chip.

A few practical wrinkles every shipping VR engine has to handle:

  • Per-eye distortion is asymmetric. Each eye's lens has its own optical center and slightly different coefficients due to manufacturing tolerances. The distortion mesh is generated separately per eye and calibrated per-headset.
  • Chromatic correction is per-channel. The mesh actually has three texture-coordinate sets per vertex, one for each color channel, with the channel-specific k coefficients baked in. The fragment shader samples the rendered image three times, one per channel, at slightly different coordinates.
  • The mesh has to be denser than you'd think. The distortion is curved and the warp is non-linear in both axes. A coarse mesh shows visible polygon edges after warping. Modern systems use roughly a 50×50 mesh per eye.
  • Distortion is the last thing. It happens AFTER everything else: after rendering, after timewarp, after compositing. You want the final image, the one that's about to scan out, to be the pre-warped image. We'll see in Part 3 why that ordering matters.

Part 3 - Tracking, Latency, Rendering

3.1 Tracking the head

Everything in VR - every frame, every sound, every reprojected timewarp - is computed relative to where the headset is in space. So the headset has to know, with bewildering precision, where it is. Not just within the room. Within millimeters, several hundred times per second, in six degrees of freedom (position and rotation). A few millimeters of error in head pose translates immediately into a stuttering, swimming, nauseating world.

The two pieces of hardware that do this work are:

  • The IMU (inertial measurement unit). A small chip containing accelerometers and gyroscopes (usually a magnetometer too). It tells you how the headset is accelerating and rotating, sampled at 1000Hz or higher. Cheap, fast, immediate.
  • The cameras. Several wide-angle cameras pointed outward from the headset, sampled at 30-120Hz. They watch the room and pick out features (corners, edges, textured patches) to triangulate the headset's absolute position. Slow, expensive, but ground-truth.

Neither is sufficient on its own. The IMU is fast but only measures change. To turn that change into a position, you have to integrate it twice (acceleration → velocity → position), and integration accumulates noise. Tiny errors in each sample grow into huge errors in absolute pose within seconds. This is called drift, and it is the central enemy of inertial-only tracking. Worse: gyroscopes have a slow bias drift on top of their sample noise, which means even rotation drifts over time.

The cameras are the opposite. They give you a (relatively) absolute fix on where you are, with no drift, but they're slow - by the time you've grabbed a frame, run the feature detector, and matched it against your map of the room, dozens of milliseconds have passed and you've already moved. Cameras also need features to track. In a featureless white room, vision-based tracking degenerates to whatever the IMU says.

The trick is to fuse them. The combined approach is called visual-inertial odometry (VIO), and the algorithms have a few moving parts worth understanding.

Feature detection. When a camera frame arrives, the system runs a detector that finds interest points - small patches of image that are visually distinctive and stable across views. Classic algorithms here include FAST (corner detection: a pixel is a corner if enough of its surrounding ring is brighter or darker than it is), ORB (FAST corners + a binary descriptor for fast matching), and the older SIFT(more accurate but slower; mostly historical now in real-time systems). The output of this stage is a set of 2D points in the image plane along with a fingerprint you can match against future frames.

Feature matching. When the next frame comes in, the system tries to find the same features again. Most implementations search a small region around where the IMU prediction says the feature should now appear, massively narrowing the search and rejecting bad matches. Outliers are filtered out with RANSAC: try random subsets of matches, see which subset agrees on a consistent camera motion, throw out the rest.

Triangulation. Each tracked feature seen from two or more known camera positions can be triangulated to a 3D world position. The first time the system sees a feature it has unknown depth; on the second view it gets a depth estimate; on subsequent views the estimate refines.

The state estimator. All of this - the headset's current pose, its velocity, IMU biases, the 3D positions of the tracked features - gets stuffed into a single high-dimensional state vector. Every IMU sample and camera frame updates the state through some flavor of Extended Kalman Filter (EKF) or, in higher-end systems, an error-state Kalman filter or a sliding-window bundle adjustment. The EKF is a recursive Bayesian update: given a prior belief and a new measurement, compute the posterior belief. Bundle adjustment is heavier but more accurate: re-optimize all recent poses and feature positions jointly to minimize reprojection error.

Loop closure. When you walk around a room and come back to where you started, your accumulated drift has probably moved your estimated pose a few centimeters from where it actually is. Loop closure detects this: the system recognizes "I've seen this view before" by comparing the current frame's features against a database of past frames. When a match is found, it corrects the entire trajectory backwards to be consistent. This is what makes long-running VR sessions stay accurate.

Scale recovery and the gravity vector. Pure monocular vision has a fundamental ambiguity: a small scene seen from up close looks identical to a big scene seen from far away. The IMU resolves this by giving you absolute acceleration in m/s². Combined with the visual system's relative motion estimates, you can solve for absolute scale. The accelerometer also gives you gravity for free (when the headset is at rest, the only measured acceleration is gravitational), which lets the system establish a "down" direction even before the cameras have seen anything.

Controllers. Most systems track controllers using the same machinery, plus dedicated infrared LEDs around each controller's tracking ring that the headset's cameras can see at high contrast. Quest controllers have an LED constellation; the inside-out cameras recognize the pattern and triangulate it. Hand tracking is similar but uses a learned ML model that estimates a 26-DOF hand skeleton directly from camera images, without LEDs. Both feed into the same tracker stack and produce poses on the same coordinate frame.

Below: a top-down view of a virtual room, with a "headset" following an orbital path. The truth pose is in muted teal. The IMU estimate is in orange. With optical correction off, you can watch the orange ghost drift away from truth as the integrator's noise accumulates. With it on, every camera fix yanks the estimate back into agreement.

truthIMU estimateanchors
drift: 0.0 px

One thing worth flagging: the choice of outside-inversus inside-out tracking is a real and ongoing design tension. Outside-in (the original HTC Vive lighthouse setup) puts the cameras or laser emitters on the walls and the markers on the headset; inside-out (every modern consumer headset) puts the cameras on the headset and lets it figure out the room itself. Outside-in is more accurate, especially in a fixed setup, but inside-out doesn't require you to mount anything in your living room. Inside-out won the consumer market on convenience alone, even though purists will tell you the tracking is objectively a hair worse.

3.2 The latency budget

There is a number that defines the difference between a comfortable VR session and one that will have you staring at a kitchen wall trying not to throw up. It's called motion-to-photon latency: the time between when you move your head and when the photons reflecting that motion actually hit your retina.

The widely-quoted target is under 20 milliseconds. Above roughly 20ms, your visual system starts to register a lag between proprioception (where your body says your head is) and vision (where the world tells you your head is). That mismatch is one of the strongest known triggers of motion sickness. Above 50ms it's intolerable. Above 100ms, most people are actively ill within a few minutes. The vestibulo-ocular reflex, which we met in section 1.1, is a big part of why: that reflex completes within 10ms in your inner ear, and any visual lag past that puts the eye-movement and visual streams out of sync.

How do engineers actually measure motion-to-photon? The standard technique is to instrument a headset with a small mechanical actuator and a high-speed camera. The actuator yanks the headset by a known amount; the high-speed camera (running at 1000fps or faster) watches both the headset and the panel. The number of frames between "headset starts moving" and "panel image starts shifting" gives you motion-to-photon in milliseconds. Every shipping headset has been measured this way, and most are public.

Twenty milliseconds is not a long time. It's one frame at 50fps. The full pipeline that has to fit inside it includes:

  • Reading and fusing the IMU sample.
  • Running the application's simulation step.
  • Rendering the scene from two eye positions.
  • Running the compositor, including distortion correction.
  • Scanning the result out to the panel, line by line.
  • Waiting for the panel's pixels to actually emit photons.

Drag the sliders below and watch the budget add up. The white line at 20ms marks the comfort threshold. Most of these stages are not fully under software control - display scanout in particular is gated by panel hardware and can be the dominant cost on a slow display.

8.0
4.0
20ms · comfort threshold
17.0ms motion-to-photoncomfortable
1.0ms
8.0ms
2.0ms
4.0ms
2.0ms

Look at how aggressively the application stage gets squeezed. Most VR apps have something like 8-10ms of total CPU+GPU time to do everything the game logic and renderer need to do, twice (once per eye). That is brutal. Console games typically have 16-33ms for a single eye. VR apps have less time and twice the work, and they're frequently running on mobile-class hardware. This is why VR scenes look simpler than you'd expect - the budget just isn't there for high-poly art.

One detail buried in the "scanout" stage worth pulling out: panels don't refresh all at once. A real LCD or OLED panel paints pixels top-to-bottom over the scanout duration, typically 8-13ms. This is the same rolling shutter behavior that makes spinning propellers look bent in cellphone photos. In VR it means that the top of the displayed image was painted with the head pose at one moment and the bottom was painted with the head pose ~10ms later. If you don't correct for this, fast head motion shears the image slightly. The fix is per-scanline timewarp: re-sample the rendered frame for each row using the head pose at that row's actual scanout time. We'll see the full timewarp story in 3.3, but rolling scanout is where the "every millisecond counts" pressure starts to bite.

naive scanout — whole frame at render time
per-scanline late warp — each row at its own time
11ms
2.5 rad/s

Crank up scanout duration and head speed. The naive panel shears slightly because the top of the image was painted at one head pose and the bottom is being painted ~10ms later at another. The per-scanline corrected panel re-samples the renderer's frame for each row at the head pose appropriate for that row's actual scanout time. Rolling shutter cameras have the same problem; the fix is the same.

Some panels avoid rolling scanout entirely. Global shutter panels (still rare in consumer VR) flash every pixel simultaneously at the end of the frame. No shear, but worse luminance and more demanding driver silicon. Some flagship headsets are starting to use global-shutter micro-OLED.

3.3 Timewarp and reprojection

Now the trick that, more than any other single piece of engineering, makes modern VR possible.

Suppose your headset is targeting 90Hz. That's 11.1ms per frame. Your renderer is doing its best, but on a complex scene it sometimes overshoots - say, a frame that takes 14ms to render. By any normal accounting, you've now missed the display's frame deadline. The compositor would have to either display the previous frame again (judder) or wait an extra full frame (a stutter the user will instantly feel).

Both are unacceptable. So the compositor cheats. Instead of displaying the frame as-is, it takes the most recent rendered frame, looks at where the head pose was when that frame was rendered, and compares it to where the head pose is right now, microseconds before scanout. Then it applies a final 2D transformation to the rendered image to account for the difference. This is called asynchronous timewarp (ATW), and it comes in several flavors of increasing capability.

The simplest is rotational timewarp: assume the entire rendered scene is at infinity and rotate the framebuffer by the head pose delta. Cheap, fast, works for distant content. Fails for nearby objects, because rotation can't manufacture parallax - if a near object should appear to slide sideways relative to the background as your head translates, rotation alone won't make it do that.

Positional timewarp uses the rendered scene's depth buffer to handle parallax correctly. Each pixel knows how far away its content is; the reprojection shifts near pixels by more than far pixels, exactly as real parallax would. The cost is GPU work proportional to the depth-aware sampling pass, plus the risk of disocclusion artifacts at edges - pixels that need to be revealed by the head movement but weren't present in the rendered frame, leaving small black gaps. Most modern compositors fill these by stretching nearby pixels.

Asynchronous spacewarp (ASW) goes further: it extracts motion vectors from the renderer (per-pixel screen-space motion of moving objects) and reprojects those too. So an animated scene where a ball is flying past you can be reprojected even if the ball would otherwise be frozen in place between renderer frames. ASW2 on Oculus combines positional warp and motion vectors. The result is striking: you can render at 45fps, reproject to 90Hz, and the user generally can't tell.

Late-stage reprojection. The most aggressive timewarp implementations do the reprojection AFTER scanout has begun, per-scanline. Just before each row is sent to the panel, the compositor re-samples the most recent rendered frame using the head pose at that specific moment. This essentially eliminates rolling scanout shear and pushes motion-to-photon latency down to the panel's response time, which is as close to zero as the hardware physically allows.

One important wrinkle: some content shouldn't be timewarped at all. Reticles, head-locked UI, and crosshairs should track the head, not the world; if you timewarp them along with the scene, they swim around during head motion. Modern compositors handle this with head-locked layers: a separate rendering layer submitted by the app, drawn directly onto the panel without reprojection. Quad and cylinder layers (which we'll meet in 4.2) use this mechanism.

Below: a simulation. The left canvas shows naive playback - the renderer emits at a deliberately low 12fps and the display shows whatever the last rendered frame was. The right canvas applies timewarp. Toggle between rotational and positional modes; move your mouse left/right (rotation) and up/down (translation). Notice that rotational mode tracks rotation but the near purple bars don't parallax against the distant pillars when you translate, while positional mode does both correctly.

naive — last rendered frame
timewarped (positional)
mouse X = head rotation · mouse Y = sideways translation

Move mouse left/right (rotation) — both modes track. Move mouse up/down (translation) — only positional warp moves the near purple bars relative to the distant pillars. Rotational warp can't manufacture parallax; positional warp can, because it knows depth.

That difference - between unwatchable and comfortable - is happening in approximately one millisecond of compositor time, every frame, on every VR headset shipping today.

3.4 Foveated rendering

We pick up the thread from Part 1. Your visual acuity falls off a cliff outside the fovea, but every VR headset historically rendered every pixel at uniform quality. That's an enormous amount of wasted GPU work - high-detail shading, textures, and antialiasing being computed for pixels your eye can't even resolve.

Foveated rendering is the idea of varying the rendering quality based on retinal eccentricity. The center of the field of view gets full resolution. The periphery gets increasingly aggressive degradation: lower sample counts, lower texture mip levels, simpler shaders, sometimes outright blurring.

There are two flavors. Fixed foveated rendering (FFR) just assumes you're always looking at the center of the panel and degrades the corners. It's a nice 20-30% performance win for free, and shipping headsets like the Quest have used it for years. The fancier flavor is eye-tracked foveated rendering: cameras inside the headset watch your pupils in real time and the high-detail region follows your gaze. Apple's Vision Pro and Meta's Quest Pro do this. The performance win is dramatic - sometimes 50% or more of shading cost - and you genuinely cannot tell, because by definition, you can never look at the part being degraded.

The hardware feature that makes this practical at consumer scale is variable rate shading (VRS). Modern GPUs (Adreno, Mali, NVIDIA Turing+, Apple silicon) can shade groups of pixels - 1×1, 2×2, 4×4 - together as a single fragment evaluation, and choose the shading rate per screen region. The renderer submits a low-resolution "shading rate" image; the GPU uses it to decide how aggressively to shade each region. Typical VR foveation profiles run full rate in a small central disk, 2×2 in a wider ring, and 4×4 or 8×8 in the corners.

For eye-tracked VR, the shading rate image gets updated every frame from the latest eye-tracker sample. Eye trackers run at around 120Hz, with prediction during saccades to keep the high-detail region ahead of where your gaze is moving. If the rate image lagged, you'd briefly see degradation in the new fixation region - but saccadic suppression (section 1.1) hides exactly this kind of transient. The system exploits a feature of your visual system to mask its own seam.

One related idea worth knowing about is foveated transport: in cloud-rendered or wireless VR, where the headset is just a thin client receiving a video stream, only the foveated regions need to be sent at full quality. The peripheral regions can be heavily compressed, dropped to lower framerate, or both. This makes streaming VR practical at lower bandwidths than a uniformly- encoded stream would require.

The demo below mocks foveated rendering on a 2D image using the actual cone-density function from the eye demo as the falloff curve. Your mouse acts as the gaze tracker; the dashed circle marks the high-detail foveal region. Toggle "show sample blocks" to see the structure. The "work saved" readout reports the fraction of pixel-shader work the foveated pass avoids relative to a uniform pass.

samples drawn
0
of 2,16,000 possible
work saved
100.0%
of pixel-shading load
falloff curve
cone-matched
same function as the retina demo
Mouse = your gaze. The dashed circle is the high-detail foveal region.

One subtle detail worth surfacing: foveated rendering is a perceptual exploit that works in part because of saccadic suppression. When your eyes jump from one fixation point to another, the visual system briefly turns down sensitivity during the jump. By the time you've finished the saccade and the new fixation lands, the renderer has already seen the new gaze target and shifted the high-detail region. You don't see the transition, because for those few milliseconds your visual system isn't really seeing anything at all.

Part 4 - Audio, Runtime, Frontiers

4.1 Spatial audio

We've spent three parts on what your eyes are doing. Your ears are doing something analogous, and VR has to fake their input too, with comparable subtlety.

You localize sound in 3D using two main cues. The interaural time difference (ITD) is the gap between when a sound reaches your two ears - a sound on your right arrives at your right ear maybe half a millisecond before it reaches your left, because the path is a few centimeters shorter. The interaural level difference (ILD) is the difference in loudness between the two ears - your head blocks high-frequency sound from reaching the far ear, creating a "head shadow."

ITD dominates at low frequencies (below ~1.5kHz), where the wavelengths are longer than your head and the level difference is small. ILD takes over at higher frequencies, where the head casts a sharper acoustic shadow. The crossover is called the duplex theory of sound localization, and it's why music engineers think about pan in two regimes.

Those two cues tell you left from right. They don't tell you front from back, or up from down - the geometry on either side of the median plane is symmetric. The third cue, the one that resolves front/back/up/down, comes from your outer ears. The folds of your pinna filter sound based on the direction it's coming from: certain frequencies get amplified, others attenuated, in patterns that depend on incidence angle. Your brain has built up, over years, a learned mapping between these spectral fingerprints and where the sound originated. It's why you can tell a sound is behind you, even with one ear plugged.

That mapping has a name: the head-related transfer function, or HRTF. Mathematically, an HRTF is a pair of frequency-dependent filters (one per ear) that, applied to a sound, produce the binaural signal that ear would actually receive given the sound's position in space. Capture an HRTF by recording impulses from many directions in an anechoic chamber with microphones in the listener's ear canals; convolve any future audio with the HRTF for the desired direction; play the result through headphones. Done well, the result is uncanny - you can hear a sound move behind you, over you, beneath you, all from a pair of stereo headphones.

But HRTF alone is direction. It doesn't give you distance, or a sense of place. For that you need room acoustics. A real sound in a real room doesn't arrive at your ears as a single signal. It arrives as:

  • Direct sound, the part that travels straight from source to ear.
  • Early reflections, the parts that bounced off one or two surfaces before reaching you. These arrive within ~80ms of the direct sound. They tell your brain about the room's size and shape.
  • Reverberant tail, the slowly-decaying mush of late reflections that have bounced off many surfaces. The decay rate (RT60: time for the tail to drop 60dB) tells your brain about the room's materials and volume.

VR audio engines synthesize these by convolution with a room impulse response (RIR). Capture the impulse response of a room (fire a short impulse, record what comes back), or compute one geometrically from a virtual room, and convolve every sound source with it. The result has direction (from the HRTF) and place (from the RIR). It's also expensive: full-quality convolution reverb on each source can cost more CPU than the rest of the audio pipeline combined.

A few more pieces of the audio puzzle that show up in VR engines:

  • Head tracking integration. When the user turns their head, the audio scene must rotate the opposite way relative to the listener - the world is fixed, the head moves through it. This rotation has to happen with the same low latency as the video pipeline; lag here causes the same nausea. Most engines apply head pose to audio at scanout time, just like timewarp does for video.
  • The precedence effect (Haas effect). When two sounds arrive within ~30ms of each other, the brain localizes to the first arrival and treats the second as a coloration of the first. This is how the brain disambiguates direct sound from early reflections in a real room. Spatial audio engines must respect it; injecting a "reflection" 100ms after the source breaks the illusion.
  • Ambisonics. A scene-based audio format that encodes a soundfield rather than channels. First- order ambisonics (B-format) uses four channels: an omnidirectional W plus three directional X/Y/Z components. Higher-order ambisonics adds more channels for finer angular resolution. Critically, an ambisonic recording can be rotated after capture - exactly what you want for head-tracked playback. YouTube 360 video uses first-order ambisonics; high-end VR uses third-order or higher.
  • Distance and air absorption. Loud sounds travel further; high frequencies attenuate faster than low. Both depend on the volume of air between source and listener. Engines model this with simple distance-attenuation curves and frequency-dependent low-pass filters.

Try the demo below. Two sources, both with HRTF panning. Drag them around the head, then turn them on (use headphones). Sweep reverb up to add a sense of place - at 0% the source is dry and direction-only; at 100% it's drenched in early reflections, like you're in a small stone room. Toggle HRTF off and notice how both sources collapse to flat left-right pan, locked inside your skull.

front12
30%

Use headphones. Drag the colored sources around the head, then turn them on. Sweep reverb up to add a sense of place — at 0% the source is dry and direction-only; at 100% it's drenched in early reflections, like you're in a small stone room. Toggle HRTF off and notice how both sources collapse to flat left-right pan, locked inside your skull.

One nuance worth being honest about: HRTFs are subtly different for every individual because every individual's head and outer ears are subtly different. The HRTF used by your browser's WebAudio implementation (which is what powers the demo) is a generic one, measured from a mannequin or averaged across many subjects. It works well enough for most people, badly for some. State-of-the-art VR systems are starting to scan your head and personalize the HRTF; Apple's Vision Pro does this with the front cameras. The personalized version sounds noticeably better - not in some abstract audiophile sense, but in the "front/back ambiguity goes away" sense.

4.2 The runtime layer

Up to this point we've been talking about pieces of the stack as if they live in a single application. They don't. On every modern headset, there is an operating-system-level runtime sitting between the app and the hardware, and it does much more than the average graphics API does on a desktop machine.

The standard interface for all of this is OpenXR, a Khronos specification that lets apps target any compliant runtime. Before OpenXR, every headset shipped with its own proprietary SDK and apps had to be ported by hand. OpenXR is the reason modern VR development looks anything like cross-platform. Let's walk through what the runtime actually does for an app, because the abstractions matter.

Reference spaces. When an app asks "where is the headset?", it has to specify relative to what origin?. OpenXR defines several:

  • VIEW: relative to the headset itself. Useful for head-locked overlays.
  • LOCAL: a coordinate frame centered roughly where the user was when the app started. Survives for the session but doesn't track room boundaries.
  • STAGE: a coordinate frame fixed to the user's room (the "play area" or "guardian boundary"). Survives across sessions, persistent, tied to the headset's understanding of the physical space.
  • UNBOUNDED (extension): an open-ended frame that can extend across larger spaces, useful for walking experiences that go room-to-room.

All poses returned to the app are in some named reference space. The app never owns the "true" world origin; the runtime does. Two apps running back to back will agree on where the floor is because they both queried the runtime, not because they negotiated it.

Composition layers. The app doesn't render directly to the panel. It submits one or more layers to the compositor, which combines them into the final image. Layer types:

  • Projection layers: the app's main rendered stereo image. The headset's perspective.
  • Quad layers: a flat rectangle floating in the world. Used for UI panels, text, video. The compositor draws them sharp because it knows their exact 2D geometry - no need to render and resample through a perspective camera.
  • Cylinder layers: a curved rectangle wrapped around the user. Common for menus that surround the user.
  • Equirect layers: a 360° spherical image, used for skyboxes or spherical video.
  • Cube layers: a cubemap, mostly used for skyboxes.

Why would you submit text as a quad layer instead of drawing it into the projection layer? Because the compositor will render it at the panel's native resolution, sample-perfect, with no pre-warp / timewarp re-sampling artifacts. Text in a quad layer is consistently sharper than the same text rendered into the main scene. Most VR UI engines use quad layers for everything text-heavy.

The action system. Input doesn't come into the app as raw button presses. The app declares actions it cares about - "select", "grab", "menu", a "throw" pose, a "trigger amount" - and the runtime maps them to whatever physical inputs the user's hardware actually has. A "select" action might be the controller trigger on a Quest, the index-thumb pinch on a Vision Pro, or the system click on a wand. The app never has to know which. Apps that do this right run unchanged across any input modality. Apps that hardcode "trigger button" don't.

Predicted display time. When the app is asked to render a frame, the runtime tells it the time at which that frame will actually be displayed - typically 10-30ms in the future. The app uses that predicted time to ask the runtime for the head pose it should render with, NOT the current head pose. The renderer is essentially extrapolating the user's head motion forward to when the user will actually see the result. Combined with timewarp at scanout time, this two-stage prediction is what keeps motion-to-photon latency hidden.

Spatial anchors. An app can ask the runtime to "anchor" a virtual object to a point in the physical world, in a way that survives drift correction, loop closure, and even sometimes session restarts. The runtime owns the anchor; the app gets back an anchor ID it can later query for an updated pose. This is how mixed-reality apps make virtual furniture stay put on your real coffee table.

Scene understanding. Modern headsets emit a triangle mesh of your room - walls, floor, ceiling, major furniture. Apps query this mesh through the runtime for things like collision (a virtual ball bounces off a real wall), occlusion (a virtual character walks behind the real couch), and spatial-audio reflections.

Composition modes. Each layer the app submits specifies how it should be combined with the layers below it: alpha blended, additive, depth-tested. Passthrough (the camera feed of the real world) usually goes underneath everything as the bottom layer. UI usually goes on top. The runtime handles this composition once per frame, late in the pipeline, after all timewarp and reprojection has happened.

One useful mental model: the runtime is the operating system of the headset's spatial domain. The OS-of-the-OS, in a sense. Just as desktop operating systems abstract over different physical screens, keyboards, and disks, the VR runtime abstracts over different physical bodies, rooms, and sensors. Your app doesn't care what tracking algorithm is being used or what kind of lens distortion the headset has, any more than your text editor cares whether your monitor is an LCD or an OLED.

Below: an exploded-view 3D diagram of what the compositor is actually combining for every panel frame. Drag the slider to zero to collapse the layers into the final image; orbit with your mouse.

+0.70

Drag the slider to zero — the layers collapse into the single panel image the user actually sees. Drag to orbit the view in 3D.

4.3 What's still broken

We've spent a long time on the things that work. Here's a more honest list of the things that don't, ordered roughly from most-likely-to-be-fixed-soon to most-likely-to-take-decades.

Vergence-accommodation conflict. We covered this in Part 1. Varifocal optics are the fix, and at least four flavors are in research: mechanical actuated panels, liquid lenses, multi-focal-plane stacks, and light field displays. Meta's Half Dome line has been showing this publicly for years; Magic Leap shipped a two-state varifocal in their first headset and abandoned it in their second. The technology works in lab; productionizing it at consumer cost and weight has so far defeated everyone. I expect this to ship in some flagship within five years.

Light field displays. The asymptotic answer to the V-A conflict. Instead of emitting a flat 2D image, a light field display emits a directionally- varying field of light per pixel - many rays per output direction. The eye reconstructs depth from the ray bundle, naturally and continuously, without any need for varifocal optics. Researchers have built impressive prototypes (Stanford's near-eye light field, NVIDIA's research displays). The problem is that "many rays per pixel" requires many more pixels. Resolution drops by a factor equal to the angular sampling, often 5-10×. We're not close to a consumer-grade light field VR headset.

Holographic optical elements (HOEs). Use diffractive nanostructures to do lens work in flat, transparent surfaces. Could in principle replace the entire bulk of a VR or AR optic with a sheet of patterned glass. Microsoft and several startups have shown impressive lab demos. The challenge is efficiency (they throw away most of the light) and chromatic uniformity (different wavelengths diffract at different angles, like a prism).

Form factor. Every VR headset is too heavy and too warm. Pancake lenses helped a little. Micro-OLED helps a little more. But you are still strapping a small computer to your face, and there are limits to how much you can do about that until either silicon gets dramatically more efficient or the rendering work moves off-device entirely. Cloud-rendered VR exists (NVIDIA CloudXR, Meta Air Link does a sort of hybrid) but introduces a wireless link and another round-trip in the latency budget - which, as we've established, is not where VR has slack.

True optical-passthrough AR. Most current "mixed reality" headsets are camera-based passthrough - they show you a video feed of the world from cameras mounted on the headset. It's not the real world; it's a low-resolution, slightly delayed video version of the real world, and your visual system knows. Optical see-through, where you look at the actual world through transparent waveguides, is what actually solves this. Field of view is currently dismal (40-60° on the best devices), brightness doesn't compete with sunlight, and the cost is absurd. This may be what the next generation of devices does well, or it may take longer. The Magic Leap 2, HoloLens 2, and various startups are chipping away.

Neural rendering. A more recent thread. Gaussian splatting represents scenes as millions of tiny anisotropic Gaussian blobs, with position, color, and shape parameters learned from input images. The result renders in real time at very high quality and offers continuous view-synthesis from arbitrary new positions. Several VR demos using Gaussian splat scenes have shipped in the last year. Earlier Neural Radiance Field (NeRF) approaches were too slow for VR; Gaussian splatting fixes that. The open question is whether captured-real-world scenes done this way can replace traditional rendered content for some classes of VR experience. Tourism, cinema, telepresence all look possible.

Social presence. All the technology we've discussed is about delivering a convincing world to a single user. Making two users feel like they share that world - eye contact, facial expression, fine motor mimicry, the unconscious body language that makes another person feel present - is a much harder problem, and one I think VR has not really solved at all. The face-tracking cameras in Quest Pro and Vision Pro are a step. The full-body capture systems in research are another. But the gap between "I am in a world" and "I am with another person in a world" is enormous, and most of the stack we've talked about doesn't address it. I suspect this is the problem that defines the next decade of the field, more than any optical or rendering advance.

What I want you to take away

Every layer of the VR stack is shaped by the layer above it, and the thing at the top of the stack is your eye. Foveated rendering exists because your acuity drops off outside the fovea. Variable rate shading exists because foveated rendering needs hardware to be cheap. Timewarp exists because your vestibular system can detect 20ms of mismatch. Lens pre-warp exists because you can't put a thick lens between a 2cm panel and a human face. Spatial audio exists because you localize sound the same way you localize light - by reconciling slightly different signals through learned priors.

None of these are arbitrary engineering choices. They're all consequences of being inside a particular kind of body. The shape of the technology is the shape of human perception, traced backwards into silicon.

That's what I find beautiful about the whole field. We're building hardware whose design constraints come almost entirely from neuroanatomy - from facts about the retina and the inner ear and the visual cortex that are older than fire. The headset on your face is, in a strange way, a fossil of your own perception, made tangible.

And it doesn't quite work yet. That's okay. The point of looking at the stack like this is that you can now see exactly where it doesn't work, and why, and what would have to change. Most of the unsolved problems are not mysterious. They're hard. There's a difference, and it's a hopeful one.


Get in touch

Have a thought, a question, or something you'd like me to write about?

Email me →