Table of Contents:
What is color accuracy?
What's wrong with the rendered color matching the baseColor?
How does rendering compare to photography?
How does tone mapping work?
How do we validate a glTF 3D render?
What role does perception play?
How do we validate a glTF model?
What's the takeaway?
Interactive 3D models are the next media type, following images and
video, and as such
<model-viewer> and other renderers are
being used more often to display these 3D models in commercial settings,
museums, and many more. Both of these users have deep interest in ensuring
the presented pixels accurately represent the real object the model is based
on. As such, a quality assurance process is needed to ensure that both the
3D model itself has been designed accurately, and that the presentation of
that model is appropriate for realism. While your process may differ based
on your situation, this document is intended to give the background
necessary to set up a process and to set expectations.
Khronos' glTF is the first 3D model format to specify physically-based rendering (PBR), meaning it contains material properties that define in real-world units how light should be reflected & refracted in terms of physics. This means renderers are free to innovate in GPU shaders to create more and more accurate approximations of the underlying physics, because glTF does not specify any single approximation. This also means that while different renderers may make different tradeoffs of accuracy vs. speed you can have confidence that your glTF will look consistent (though not pixel-identical) even across unrelated codebases. We call glTF the JPEG of 3D because it is compressed for efficient web delivery and can be rendered consistently by a large number viewers.
At first blush, it may appear that different glTF viewers are not consistent, but this is generally not due to rendering differences, but to default scene setup. Physically-based rendering means the scene takes environment light as input just like a real camera does, so just like to get a consistent photo, you need to not just have the same object, but also the same lighting and camera settings. There is no standard for the default settings of these viewers, so it is important to intentionally set them to consistent values. This is precisely what we've done to show the state of glTF rendering convergence across a variety of popular renderers.
The goal of PBR is to create color-accurate images at frame rate as a user interacts with the 3D model. Of course a renderer can only be as accurate as the input 3D model, so a process to decide if the 3D model is in fact accurately representing the real-world object it's based on is also crucial, so any errors can be fixed. The most correct way to do this would be to set up a photo shoot of the real object, capture the environment lighting around it in its full dynamic range, record the camera settings and position, then set up an identical rendered scene and compare the output images. Unfortunately this tends to be prohibitively expensive.
Even in this idealized comparison scenario, there is a non-trivial issue: what metric does one use to compare the images? Simple pixel comparison metrics like PSNR tend to give too much weight to differences that are not noticeable. Thus perceptual metrics are better, but also more arbitrary and harder to define.
Since many products are designed with RGB material specs (like paint), a common idea is to simply reflect this in the baseColor of the glTF. This can be a good approach to color-accurate modeling, provided the RGB value is in the proper color space. The glTF spec says the baseColor factor (normalized between zero and one) is exactly the fraction of the given channel (wavelength) of light reflected by the material. The baseColor texture is the same, but put through the sRGB transfer function to extract linear values first. It should not be assumed that a given paint swatch RGB value is defined the same way.
When it comes to verifying that the render is color-accurate, the first idea is often to check that the output rendered image has the same RGB pixel values as the glTF baseColor (or the expected paint swatch RGB). This is a fundamentally incorrect expectation, as it negates the purpose of physically-based rendering! Details and examples follow to support this assertion and to point the way towards a more useful verification scheme.
The most important thing to understand about PBR is that it accurately represents the interplay between incident light and material properties, of which there are several beyond just baseColor, the most important of which are metalness and roughness. However, the rendered output for a given pixel is only RGB, which means if it matched the baseColor RGB, then by definition the incident light and other material properties could not in any way affect the resulting image.
Let's start with a simple example: six spheres with uniform materials. The top row are white (baseColor RGB: [1, 1, 1]), while the bottom row are yellow (baseColor RGB: [1, 1, 0]). From left to right they are shiny metal (metalness: 1, roughness: 0), shiny plastic (metalness: 0, roughness: 0), and matte plastic (metalness: 0, roughness: 1). The left-most can be thought of approximately as polished silver and gold.
Note how different materials with the same baseColor render differently. Which pixels match the baseColor RGB? In fact, if you really want the rendered pixels to match the baseColor RGB values, glTF has an extension specifically for this: KHR_materials_unlit. This extension is not physically-based, and so is appropriate for things like labels and 3D scans that produce only RGB textures with all applied lighting baked in as part of the capture process. This is how the above model looks with the unlit extension:
Now that it's clear that lighting is important in making a 3D model look realistic, the next common idea is to choose a nice uniformly neutral lighting scenario to make the output RGB values "close to" the intended baseColor. Well, it's easy to produce a uniform lighting environment, but the results may be surprising for PBR:
It may look like the white spheres have vanished, but they're still present - tilt the view and you can see them occluding the yellow spheres. In fact they are just perfectly camouflaged. This scene is known as a furnace test, which is used to check energy conservation of the renderer, which ours passes. It can be shown with physics that under perfectly uniform lighting, these white spheres should each uniformly reflect exactly the same light as is incident from the environment, and hence be indistinguishable from it.
Note this result with the yellow spheres is actually pretty close to the unlit result - there's almost no discernable difference between shiny and matte or metal and plastic. This is in fact accurate; if you really found a place with perfectly uniform lighting, you wouldn't be able to tell the difference between those materials either (assuming you could somehow also hide your own reflection - you are also a part of your lighting environment). The reason this looks so unreal is that it is nearly impossible to find an environment like this. Real environments have texture, and the reflections of that texture are what we use to understand shininess.
Next let's return to the original version, which uses
<model-viewer>'s default "neutral" environment-image. Details of
how this environment was designed can be found here and here. The
lighting is intended to be even (from all sides), though not uniform, thus
providing enough texture to discern material types. It is purely grayscale,
thus not shifting the hues of the materials. This is as opposed to indoor
lighting which might skew yellow, outdoor that might skew blue, or sunsets
that might skew red. PBR will faithfully produce the colors a camera would
capture in these scenarios, which of course will look different than the
same object under neutral lighting.
Note that the top-right ball we might call paper-white: a perfect matte reflector. However, despite the white baseColor (sRGB: [255, 255, 255]), notice that the rendered color varies from [177, 177, 177] to [237, 237, 237], never achieving pure white. Why is this? Notice that the top-middle ball does have some reflections off its shiny surface that are pure white. These specular reflections are in addition to its diffuse reflection (which is all you get from a matte surface), so if the diffuse reflection had already saturated our pixels, it would be impossible to discern a shiny object from a matte one. Try increasing the exposure slider to see the effect, which in photography is referred to as overexposure.
You might notice that exposure appears to affect the midrange values more than the blacks and whites (despite exposure being a linear light multiplier). You would be correct, and this is caused by the nonlinear tone mapping step that happens last in the rendering pipeline. Tone mapping is a complicated topic in its own right, but it is vital to understanding PBR. Before we get into that, let's begin by comparing the rendering and photography pipelines.
3D rendering, especially PBR, is designed to mimic photography, and photography in turn is designed to mimic the human eye and brain. The brain part is essential, as the goal of a realistic photo is to evoke the same perception by looking at it as one would have looking at the original scene. This is difficult as the light reflected by a printed photo or emitted by a display is dramatically less intense and has less contrast than the real world. Even HDR displays have orders of magnitude less contrast than your eye sees on a normal outdoor day.
Thankfully, our brains do a lot of adjustment to our perception, including correcting for contrast. This allows us to print photos with a very compressed dynamic range while still giving the perception of e.g. a real sunset. This compression of dynamic range we'll refer to as tone mapping. In photography you can think of this as the conversion from the camera's raw image (which tends to look washed out) to the final image. It becomes even more important in modern photography with exposure stacking, where a higher dynamic range raw image can be produced than the sensor is capable of in a single shot.
In 3D rendering, there is no sensor and computations are done in floating point, which means the raw image is effectively full HDR, with even more range than is generally possible with exposure stacking. Looking at a histogram of this raw image will often show a very long tail, representing the small, shiny glints that are orders of magnitude more intense than the rest of the scene. In order to maintain perception while compressing down to SDR, a nonlinear tone mapping curve is used.
Tone mapping is a general term, which can refer to basically any color conversion function. Even separate operations that a photographer might apply in post processing, such as gamma correction, saturation, contrast, and even things like sepia can all be combined into a single resulting function that here we're calling tone mapping. However, we are only interested in hue-neutral functions and those are the only type of tone mapping functions we'll be discussing here. Therefore the focus will be primarily on luma, or brightness.
The tone mapping function used by
ACES, which is a standard developed by the film industry and is widely used
for 3D rendering. Like most tone mapping curves, it is fairly linear in the
central focus of its contrast range, then asymptotes out to smoothly
compress the long tails of brights and darks into the required zero to one
output range, the idea being that humans perceive less difference between
over-bright and over-dark zones as compared to the bulk of the scene.
However, since some output range is reserved for these extra-bright
highlights, the range left over to represent the input range of matte
baseColors is also reduced somewhat. This is why the paper-white sphere does
not produce white pixels.
Sometimes when working with matte objects and trying to compare output color to baseColor, this tone mapping compression will be noticed and identified as the source of the apparent tone discrepancy. The immediate thought is usually, let's fix this by not applying tone mapping! The problem is there is actually no such thing as "no tone mapping", since somehow the unbounded input range must be converted to the zero-to-one range that the encoder expects. If this step is not done, the encoder simply clamps the values, which amounts to a piecewise-linear tone mapping function with sharp corners that introduce perceptual errors for shiny objects, as shown in the example below.
Note that once again the shiny and matte white plastic spheres are indistinguishable, even without cranking up the exposure. Since half of the matte white sphere is now rendering pure white, there is no headroom for shiny highlights. Likewise, the top half of the sphere loses its 3D appearance since the shading was removed by clamping the values. Tick the checkbox to go back to ACES tone mapping for a quick comparison. Remember to look away and back again after switching; another trick of human perception is how dependent it is on anchoring. The yellow will look washed-out immediately after switching from saturated yellow, but this perception fades after looking around.
This example also highlights a second key element of good tone mapping functions: desaturating overexposed colors. Look at the golden sphere (lower-left) and compare to the previous version with ACES tone mapping applied. The baseColor of a metal multiplies the incoming light, so a white light on a golden sphere produces a yellow reflection (fully saturated yellow, in this case of a fully saturated baseColor). With clamped tone mapping, the highlight is indeed saturated yellow, but this does not look perceptually right, even though you could make the argument it is physically correct.
Good tone mapping curves like ACES not only compress the luma, but also push colors toward white the brighter they are. This is why the highlights on the golden sphere become white instead of yellow. This follows both the behavior of camera sensors and our eyes when responding to overexposed colored light. You can see this effect simply by looking at a candle's flame or a spark, the brightest parts of which tend to look white despite their color. Nvidia has helpfully provided more details on tone mapping and HDR for the interested reader.
One final way to avoid tone mapping that is sometimes suggested is to choose an exposure such that all pixels are inside the [0, 1] range, such that value clamping is avoided. For matte objects with low-dynamic-range lighting, this can give semi-decent results, but it breaks down completely for shiny objects, as shown in the following screenshot.
The trouble is that the specular highlights are orders of magnitude brighter than the majority of the scene, so to fit them into the output range requires the exposure to be lowered by more than a factor of 50. This kills the brightness and contrast of the majority of the scene, because of just a few small highlights. And this neutral environment does not have very high dynamic range; if you were to use an outdoor environment that includes the sun, the exposure would have to be so low that nearly the entire render would be black.
Everything shown here is rendered to an 8-bit sRGB output, but HDR displays and formats are getting more common. Might we be able to avoid tone mapping by keeping the HDR of our raw image in an HDR output format? The short answer is no, because HDR displays may be high dynamic range compared to traditional SDR, but they are still orders of magnitude short of what our eyes experience in the real world, so all the same reasons for tone mapping still apply. However, it is important to note that the choice of tone mapping function should be dependent on the output format. Ideally it would even depend on the display's contrast ratio, its brightness settings, and the level of ambient lighting around it, but this data is unlikely to be available.
Hopefully the preceding discussion has convinced you that simply validating some rendered pixel color against the object's "correct" color is not a useful or valid process. If only it were so simple! Instead you need to look at your whole pipeline and consider what your end goals really are.
For the sake of measurability, let's consider the end goal to be minimizing the return rate of products bought online where the shopper's primary visual cue is a 3D model (let's assume we're making a decent enough render that we're not destroying the purchasing rate). To succeed, we must render images that the shopper perceives to be equivalent to their perception of the actual product when it arrives. This is exactly the same goal as a photograph of this product in a magazine or on a website.
There is a much longer history of product photography, so let's start there. What does a product photographer's pipeline look like? Obviously they vary, but the overall steps will tend to look like: 1) lighting setup, 2) camera setup, 3) take the photos, 4) post-processing. Ideally, it is possible to capture this lighting environment, convert the camera setup to matrices and exposure, use the same post-processing, author the glTF according to measured material properties, and output a render that will match the photo quite well (especially with a path-traced renderer).
It is tempting to blame any observed discrepancies on the renderer, especially real-time rasterizers, but in fact a lot of research has gone into making them amazingly physically accurate and consistent, as you can see in our fidelity comparisons. Nearly always the most serious errors are differences in lighting, materials, exposure, and post-processing.
Let's consider for a moment what happens in the photography post-processing step. Some things don't apply to rendering, like masking out shadows to make them semi-transparent - a 3D renderer can do this automatically. In addition to color-neutral tone mapping, sometimes the colors are intentionally "corrected". Why? After all, if the measured light from the actual scene is not correct, what is?
It may be that some of the post-processing color correction is simply a matter of expediency. Before digital photography, to get the right look, the lighting and scene had to be adjusted, which involves manual labor. Digital post-processing can allow the lighting to be less precise. However in 3D rendering, the lighting is equally digital, so it is generally a better practice to keep the post-processing step simple (e.g. just ACES tone mapping) and adjust the environment image if necessary. Since 3D rendering is happening completely automatically in real time, there is no way to make manual bespoke color adjustments for each frame.
As with photography, the goal is the viewer's perception of the image. There is no mathematical metric that will represent human perception as well as the perception of an actual human. Isolated pixel metrics are problematic because much of human perception is affected by the background and surroundings of the object. This is because our brains are effectively trying to remove the effect of lighting and allow us to perceive the underlying material properties consistently. Our brains estimate what lighting to remove based on the background.
When rendering a product in AR using the camera image as the background, to get the most consistent perception of the product, its actual rendered pixels need to change color, ideally according the actual local lighting environment. This helps ensure that when our brain "corrects" for the lighting, our perception returns to the real object's properties. Of course this is only as good as the AR lighting estimation algorithms, but they are improving. Ideally, the same tone mapping and white balance that's applied to the camera's raw image should be used in the rendering pipeline as well. Unfortunately, it may be difficult to access this data.
Of course human perception is not uniform, which makes all of this even more difficult. An excellent example is the famous dress photo which is actually a blue and black dress in an overexposed photo under yellow lighting. Even though the pixel colors are roughly purple and brown, more than half of people correctly perceived the dress as blue and black, though another 30% perceived it as white and gold. Obviously this kind of ambiguous image is not desireable in a commerce setting, but it is useful for understanding how much perceived color can differ from pixel color. The Scientific Explanations sections of that article is full of interesting insights.
Generally speaking, part of the purpose of 3D rendering is to avoid the cost of photography, by making the entire process digital. The 3D model is costly for an artist to create, but it persists for creating many different photo shoots and interactive experiences quickly. So how can this 3D model be verified as accurate to the physical product? Of course dimensions and shape can be measured (glTF is always in meters, so check your units), but we'll be focusing on material properties here, like baseColor and roughness.
The most accurate way to choose material properties is to measure them, since all glTF materials are defined in physical terms. However, these measurements are difficult to make. There are products on the market that will scan an inserted material swatch and fit these properties to it, and hopefully these machines will become more common. Modern 3D scanning algorithms can back out material properties, but these machine learning systems are not perfect and are difficult to benchmark. Until these technologies are further along, we may not have access to properly measured material properties.
The most common workflow for creating materials is that a 3D artist manually adjusts the properties until the render looks right. Ideally the artist has the physical object on their desk and compares to their render, thus focusing on perception rather than pixel values. However, it is of utmost importance that their authoring software is set up to render the same way as the object will be presented to users. This is because when the render doesn't look right, it could be that either the materials or the lighting needs to change. If the materials are measured one can safely focus on changing the lighting, but if not, it is very helpful to have chosen a fixed lighting environment ahead of time to remove that variable. Equally important is that the tone mapping function used is consistent. If you're not sure what is being used, it's probably inconsistent.
Ideally the artist should check the rendering of the model under a variety of lighting environments. Certain types of lighting will tend to hide or expose certain material property errors. At minimum, it's a good idea to test with both a neutral indoor environment and with a sunny outdoor environment. The sun will produce very high dynamic range, as well as colored lighting, which will help with detecting a variety of possible material problems. These scenarios will tend to bracket a wide variety of real lighting situations that might be applied for a virtual photo shoot or in AR, which should give confidence the asset can be reused broadly.
Finally, all the usual caveats apply regarding colorspaces and differences between displays. The glTF format specifies the use of the sRGB colorspace internally, which is also the most common output format, but with HDR displays becoming more common this is likely to become more complicated soon.
Unfortunately there are no easy answers, as color and perception are complex, even more so than physically-based rendering. Hopefully this background gives a framework on which your authoring and validation processes can be built. The most important thing to remember about PBR is that the whole point is for the rendered colors to vary compared to the material baseColor, which is what allows the demo below to be compellingly realistic.