Global Big Data Conference

Industry News Details

Companies are commercializing multimodal AI models to analyze videos and more Posted on : Mar 22 - 2022

Earlier this month, researchers at the Allen Institute for AI — a nonprofit founded by late Microsoft cofounder Paul Allen — released an interactive demo of a system they describe as part of a “new generation” of AI applications that can analyze, search across, and respond to questions about videos “at scale.” Called Merlot Reserve, the researchers had the system “watch” 20 million YouTube videos to learn the relationships between images, sounds, and subtitles, allowing it to, for example, answer questions such as “What meal does the person in the video want to eat?” or “Has the boy in this video swam in the ocean before?”

Merlot Reserve and its predecessor, Merlot, aren’t the first “multimodal” AI systems of their kind. Systems that can process and relate information from audio, visuals and text have been around for years. These technologies continue to improve in their ability to understand the world more like humans. San Francisco research lab OpenAI’s DALL-E, which was released in 2021, can generate images of objects — real or imagined — from simple text descriptions like “an armchair in the shape of an avocado.” A more recent system out of Google called VATT can not only caption events in videos (e.g., “a man swimming”) but classify audio clips and recognize objects in images.

However, until recently, these multimodal AI systems were strictly for the domain of research. That’s changing — increasingly, they’re becoming commercialized.

“Different multimodal technologies including automatic speech recognition, image labeling and recognition, neural networks and traditional machine learning models [can help to] gain an understanding of text, voice, and images — [especially when paired] with text processing,” Ross Blume, the cofounder of CLIPr, told VentureBeat via email. CLIPr is among the nascent cohort of companies using multimodal AI systems for applications like analyzing video. Tech giants including Meta (formerly Facebook) and Google are represented in the group, as are startups like Twelve Labs, which claims that its systems can recognize features in videos including objects, text on screen, speech, and people.

“[My fellow cofounders and I] sought out a solution to help us easily extract important and relevant clips from videos as an alternative to skipping around at 10-15 second intervals, and when we weren’t able to find a solution, we decided to build one … Our namesake video indexing platform … ingests recorded video and helps make it searchable by transcription, topics, and subtopics,” Blume said. “Analyzing prosody is also critical for us, which is the rhythm, stress and intonation of speech. We leverage it against image analysis, such as meeting presentation slides, to help evaluate the accuracy of these tonal changes or [look] for animated gestures with the participants who are on video.”

Blume claims that CLIPr has clients in a “variety” of industries, chiefly media publishing, enterprise, and events. In the future, the startup aims to apply its technology to livestream video and create “role-specific” bots that can, for example, take keynote sessions from an event and automatically create a highlight reel.

“It is our belief that video is the most important and underutilized form of modern communication, and our goal is to make video as accessible as written content,” Blume continued. View more

Get the