k s ERROR-BASED METRICS
The first class of metrics are error-based. They
compare the compressed image to the original
and create a score that mathematically represents the differences between the two images, also called noise or error. The PSNR ratio is a
good example. Metrics based upon this approach
are simple and easy to compute, but scores often don’t correlate well with subjective ratings
because human eyes perceive errors differently.
As an example, I was once testing an encoding tool, and the output files produced a dismal PSNR score. I played the compressed video
several times and couldn’t see why. Then I compared the encoded image to the original and
noticed a slight color shift that accounted for
the poor score. During real-time playback without the original to compare, no viewer would
have noticed the shift, so in that case, PSNR was
a poor predictor of subjective performance.
Why do companies, including Netflix and
Mozilla (relating to the AV1 codec), continue
to publish PSNR results? First, because it’s the
best-known metric, so the scores are easy to
understand. Second, despite its age, PSNR continues to provide very useful data in a number
of scenarios, some of which I’ll discuss below.
At a high level, perceptual-based models like
the SSIM attempt to incorporate how humans
perceive errors, or “human visual system models,” to more accurately predict how humans
will actually rate videos. For example, according to Wikipedia, while PSNR estimates absolute errors, “SSIM is a perception-based model
that considers image degradation as perceived
change in structural information, while also incorporating important perceptual phenomena,
including both luminance masking and contrast
masking terms.” In other words, perceptual-based metrics measure the errors and attempt to mathematically model how humans
Perceptual-based models range from simple, like SSIM, to very complex, like SSIMWave’s
SSIMPLUS metric, or Tektronix’s Picture Quality Rating (PQR) and Attention-weighted Difference Mean Opinion Score (ADMOS). All three
of these ratings can incorporate display type into
the scoring, including factors like size, brightness, and viewing distance, which obviously impact how errors are perceived.
ADMOS also offers attention weighting, which
prioritizes quality in the frame regions that viewers will focus on while watching the video. So, a
blurred face in the center of the screen would
reduce the score far more than blurred edges,
while a purely error-based model would likely
rate them the same.
While these metrics take years of research,
trial and error, and testing to formulate, at the
end of the day, they are just math—formulas
that compare two videos, crunch the numbers,
and output the results. They don’t “learn” over
time, as do those metrics in the next category. In
addition, depending upon the metric, they may
or may not incorporate temporal playback quality into the evaluation.
Similarly, most of these metrics were developed when comparisons were full resolution compressed frame to full resolution original frame.
The invention of the encoding ladder, and the
decisions relating thereto, create a new type of
analysis. For example, when creating the encoding ladder for a 1080p source video, you may compare the quality of two 1.5Mbps streams, one at
540p, the other at 720p. All metrics can compute
scores for both alternatives; you simply scale each
video up to 1080p and compare it to the source.
But few of these older metrics were designed for
this analysis. (More on this in a moment.)
MACHINE LEARNING AND METRIC FUSION
The final category of metrics involves the concept of machine learning, which is illustrated in
Figure 1 (on the next page) from a Tektronix presentation on TekMOS, the company’s new quality metric. Briefly, MOS stands for mean opinion
score, or the results from a round of subjective
testing, typically using a rating from 1 (
unacceptable) to 5 (excellent).
In training mode, which is shown in the figure, the metric converts each frame into a set
of numerical datapoints, representing multiple values such as brightness, contrast, and
the like. Then it compares those values to over
2,000 frames with MOS scores from actual subjective evaluations, so that it “learns” the values that produce a good or bad subjective MOS
score. In measurement mode, TekMOS takes
what it learned from those 2,000-plus trials, inputs the numerical datapoints from the frame
it’s analyzing, and outputs a MOS score.
Like the metrics discussed above, machine
learning algorithms start with a mathematical