Imagine that you’re trying to plan for the up- coming fiscal year, and suddenly remem- ber that a key budget element that impacts your department was mentioned in an all- hands meeting a week or so ago.
You search through the agenda document,
but don’t see anything that fits what you remember, and the C-suite hasn’t yet released
full budget figures for next quarter, so there’s
no way to find that particular detail.
One option is to email the comptroller’s office and ask the team to give you the budget
information, but you don’t remember enough
detail about the particular point the speaker
or speakers made in the all-hands meeting.
What to do?
A few years ago, searching through hours
of all-hands meeting video would’ve been your
only option. But thanks to computer vision, machine learning, and robust metadata extraction,
today’s enterprise video platforms (EVPs) are
able to help you find what you need to know to
appear knowledgeable when you ask for additional details.
Technologies that perform facial recognition
and speech-to-text indexing have been around
for at least 2 decades—in fact, a research proj-
ect in Europe that I was part of around 2004 pos-
ited the very scenario mentioned above—but
the tools were complex and didn’t easily tie into
EVP solutions. Still, even those tools were capa-
ble of up to 32,000 pieces of metadata per frame.
Today’s solutions offer more accurate facial
recognition and slightly better speech-to-text
transcription, the latter in part derivative of
what we’d done on both our research project
and other projects that focused on virtual
beam-forming microphones for use in corporate environments. Yet the true power of
these solutions isn’t in the discrete tools but
rather the holistic approach to searching indexed content.
Going back to our scenario, let’s consider
ways to approach the problem.
First, if the presenter was properly recorded—meaning the recording used a good lavaliere microphone and didn’t combine that
microphone with the audience mics in the
final recording—then recent advances in
speech-to-text processing should yield adequate enough results to find the keywords
you’re looking for.
The same may be true with several presenters, assuming they don’t talk over each other.
Some solutions even offer a way to differentiate
between speakers. While it’s often a fairly rudimentary differentiation (e.g., Speaker 1, Speaker 2), it still makes it possible for the seeker
to filter searches by particular speakers.
What if you don’t remember who the speaker was, or if the audio is illegible enough to
throw off the speech-to-text transcription engine? Several solutions offer the ability to find
video based on a speaker’s image.
By Tim Siglin