Social Media

Light
Dark

Twelve Labs is building models that can understand videos at a deep level

Generating text with AI is just the tip of the iceberg. AI models that possess an understanding of both images and text have the potential to unlock a wide array of innovative applications.

Consider Twelve Labs, a San Francisco-based startup. This company is dedicated to training AI models capable of addressing complex video-language alignment challenges, as described by its co-founder and CEO, Jae Lee. Their initial focus is on creating an infrastructure for multimodal video comprehension, with their first major project being a “CTRL+F for videos” or semantic search. The overarching vision of Twelve Labs is to empower developers to build applications that can perceive, listen to, and interpret the world much like humans do.

Twelve Labs’ models aim to bridge the gap between natural language and the content within a video, encompassing actions, objects, and background sounds. This technology enables developers to create applications that can search through video content, classify scenes, extract topics, automatically summarize video clips, split them into chapters, and perform various other tasks.

Lee notes that Twelve Labs’ technology can be harnessed for purposes like ad insertion and content moderation. For instance, it can help distinguish between violent and instructional videos featuring knives. Additionally, it has applications in media analytics and can automatically generate highlights from videos, as well as headlines and tags for blog posts.

Given the well-documented issue of bias in AI models, I inquired about the steps Twelve Labs takes to address this concern. Lee mentioned the company’s commitment to meeting internal bias and fairness standards for its models before releasing them. Furthermore, they plan to introduce model-ethics-related benchmarks and datasets in the future.

Lee emphasizes the distinctiveness of Twelve Labs’ product in comparison to large language models like ChatGPT. Their AI models are explicitly tailored to process and comprehend video, seamlessly integrating visual, audio, and speech components within videos, pushing the technical boundaries of video understanding.

While companies like Google, Microsoft, and Amazon are developing multimodal models for video understanding, Twelve Labs distinguishes itself with the quality of its models and its platform’s fine-tuning capabilities. These features allow customers to customize the platform’s models with their own data for domain-specific video analysis.

On the model front, Twelve Labs is introducing Pegasus-1, a new multimodal model that can comprehend a wide range of prompts related to holistic video analysis. It can generate detailed reports about videos or provide concise highlights with timestamps, offering valuable tools for enterprise organizations looking to leverage their vast video data for various business opportunities.

Since its private beta launch in May, Twelve Labs has attracted 17,000 developers. They are now collaborating with multiple companies across diverse industries, including sports, media, entertainment, e-learning, and security, including the NFL.

The company is actively raising funds and has recently closed a $10 million strategic funding round with backing from Nvidia, Intel, and Samsung Next. This brings their total funding to $27 million, supporting their mission to advance the field of video understanding and offer powerful models to a wide range of customers and industries.

Leave a Reply

Your email address will not be published. Required fields are marked *