I present an AI-driven system for the automatic retrieval and segmentation of video content in which specific artworks are discussed. Given only the title of a work of art, the system identifies and extracts short, relevant video portions where that artwork is explicitly explained—even when it appears within broader, more general content.
The pipeline follows a multi-step process. First, I perform a keyword-based search across large-scale media archives to retrieve a ranked list of candidate videos—the top-K most likely to contain references to the target artwork. Each selected video is then transcribed using Whisper, with speaker diarization to distinguish different voices.
Next, I segment the transcription into longer monologue-style blocks, where a single speaker talks continuously for at least 30 seconds. These segments, along with the artwork title, are processed by a large language model (LLM), which identifies the portions of speech specifically related to the artwork. All original timecodes are preserved, enabling precise extraction of temporally-aligned subclips.
The output is a curated set of “shorts”—concise video segments that explain the chosen artwork—ready for use in educational, curatorial, or commercial settings. Museums can assemble engaging displays, educators can embed authentic expert commentary into lessons, and media organizations can trace and manage rights related to artwork representations across archives.
Additionally, the LLM can automatically generate relevant questions based on the content of each segment. This makes it possible to associate specific shorts with the questions they answer, enhancing both discoverability and pedagogical value within the archive.