k s more the metric accurately predicts human sub-
jective scores. In this fashion, Figure 2 tells us
that VMAF is a superior metric.
What’s interesting is that every time a metric
is released, it comes with a scatter graph much
like that shown on the left. SSIMPLUS has one,
TekMOS has one, and Tektronix’s older metrics, PQR and ADMOS, had them as well. This
is not to cast doubt on any of their results, but
to observe that all of these metrics are highly
functional and generally correlate with subjective ratings more accurately than PSNR.
However, accuracy is not the only factor to
consider when choosing a metric. Let’s explore
some of the others.
Referential vs. Non-Referential
One critical distinction between metrics is
referential vs. non-referential. Referential metrics compare the encoded file to the original to
measure quality, while non-referential metrics
analyze only the encoded file. In general, referential metrics are considered more accurate,
but obviously can be used in a much more limited circumstance since the source file must
Non-referential metrics can be applied anywhere the compressed file lives. As an example,
TekMOS is included in the Tektronix Aurora
platform, an automated quality control package that can assess visual quality, regulatory
compliance, packaging integrity, and other errors. Telestream subsidiary IneoQuest developed iQ MOS, a non-referential metric that can
provide real-time quality assessments of multiple streams in the company’s line of Inspector products.
So when choosing a metric, keep in mind
that it might not be available where you actually want to use it. Referential metrics are typically used where encoding takes place, where
non-referential metrics can be applied anywhere the video on demand (VOD) file exists,
or where a live stream can be accessed.
When choosing a metric, it’s important to understand exactly what the scores represent and
what they don’t. For example, with the SSIMPLUS metric, which runs from 1–100, a score
from 80–100 predicts that a subjective viewer
would rate the video as excellent. These subjective ratings drop to good, fair, poor, and bad in
20-point increments. Most MOS-based metrics,
including TekMOS, score like their subjective
counterparts, on a scale from 1–5, with 5 being
the best and 1 considered unacceptable. This
type of scoring makes the results very easy to
understand and communicate.
In contrast, PSNR measures decibels on a
scale from 1–100. Though these numbers are
not universally accepted, Netflix has posited
that values in excess of 45dB yield no perceivable benefits, while values below 30 are almost
always accompanied by visual artifacts. These
observations have proven extremely useful for
my work, but only when comparing full-resolution output to full-resolution source. When
applied to lower rungs in an encoding ladder,
higher numbers are better, but lose their ability to predict a subjective rating. For example,
for 360p video compared to the original 1080p
source, you’ll seldom see a PSNR score higher than 39dB, even if there are no visible compression artifacts.
Though SSIM, and particularly Multi-Scale
SSIM (MS SSIM), are more accurate metrics
than PSRN, it’s scoring system anticipates a very
small range from - 1 to + 1, with higher scores
better. Most high-quality video is around .98
and above, which complicates comparisons.
While you can mathematically calculate how
much better .985 is than .982, at the end of the
day, it still feels irrelevant.
VMAF scores also rank from 1–100. While
higher scores are always better, individual
scores, like a rating of 55 for a 540p file, have no
predictive value of subjective quality. You can’t
tell if that means the video is perfect or awful.
That said, when analyzing an encoding ladder,
VMAF scores typically run from the low teens or
lower for 180p streams, to 98+ for 1080p streams,
which meaningfully distinguishes the scores.
In addition, VMAF differences of 6 points or
more equals a just-noticeable difference (JND),
which is very useful for analyzing a number of
encoding-related scenarios, including codec
The scoring range of VMAF over the diverse
rungs of the encoding ladder makes it attractive for choosing the best resolution/data rate
streams in the ladder. In contrast, PSNR might
range from 30–50dB, with the lower four rungs
compressed between 30–37. This reduces its value as a predictor of the perceptible difference
between these rungs.