w We’re all familiar with facial rec-
ognition, which has become more
prevalent thanks to Facebook and
integrated facial-matching tech-
nologies like those integrated in-
to the Photos app on Apple’s iOS
Video facial recognition is a bit
more of a black art, though. After
all, video is at least 24 still images
per second, and sometimes up to
60 images per second. The sheer
amount of information available
in a single still image, or frame, is
staggering, which is why most facial recognition systems for single
still images require several seconds to process each frame.
Add to this the complexities of
the way that intraframe compression works—
where codecs like H.264 use full frames, or
I-frames, coupled with differential frames
like P- or B-frames that don’t store the entire image—and the complexities of decoding and indexing each individual frame rise
In addition, regardless of whether it’s a still
image or a single frame of video, complexity
also rises if there are several people in the
shot. All in all, the processing required to handle just the facial recognition just for a few seconds of video is staggering.
On top of that, a professionally edited video
will often cut back and forth between one of
several presenters, the audience, and graphics (e.g., websites or PowerPoint slides).
So not only does the facial recognition portion of an EVP solution need to identify when
a presenter appears on screen, but also when
that person disappears and then reappears
again within a given threshold of time.
In other words, facial recognition needs to
have both a tolerance threshold and an aggregation function, so users can search for
a person and receive results that generalize
sections of a video in which a particular presenter appears.
Based on the above, the good news is that
this multiface, multiframe facial recognition
might actually help solve the problem of find-
ing the right video clip to help with our bud-
get problem. If you suddenly remember that it
was two copresenters who were talking about
the new budget and how it impacts your de-
partment, might it be possible for the EVP to
search for more than one person at a time?
The answer is yes, although very few solutions offer this option.
One that does offer this in a somewhat rudimentary form is the new Microsoft Stream service. Designed to replace the legacy Office365
Video service, Stream makes it possible to
choose more than one person to search for, at
least for on-demand content.
To do so, Stream offers features like audio
transcriptions and face detection as a way to
find relevant content.
Beyond that, Stream also offers the ability
to search text that appears in a video, “even for
specific words or people shown on screen, whe-
ther in a single video or across all your compa-
According to the Stream site, built-in ma-
chine learning “intelligence also drives acces-
sibility features, so every person can engage
according to their need.”
Stream is available for those who have an
Office365 subscription, but it is not limited
to subscribers. Pricing for those without Of-
fice365 is on a per-user, per-month basis. The
basic service, for $3 per month per user, offers
a way to aggregate, organize, and search vid-
eos. For an additional $2 per user per month,
Stream offers two features key to our scenar-
io: search “using deep search based on in-con-
tent signals like speech to text,” and search
“using face detection and audio transcripts.”
face detection as
ways to find relevant
content for an extra
$2 per month beyond
the $3 per month,
per user base fee.