Recently, a growing number of the talks within conferences and
particularly scientific ones, is being recorded for later retrieval.
With the growth of the amount of recorded data, it becomes rather
complex to find the appropriate video or video-sequence of a talk. For
instance current search engines are not able to answer complex queries
such as ”Find a sequence of a recorded talk, in an academic training
lecture, in 2007, in Italy, where a colleague of professor X, talked
about image indexing after the coffee break”.
This example illustrates the so-called semantic gap which as defined in  “is
an important issue in many computer vision systems, but particularly
for indexing. It refers to the lack of coincidence between machine
low-level digital representations of visual data and the human
high-level cognitive understanding of the same data”. This is
particularly relevant for information retrieval activities. Low-level
features can be automatically extracted during a video indexing stage
(color, slide transition, slide animation, etc.), while higher-level
features are based on rich human semantics (concepts, topics, people,
etc.) and therefore involve human intervention , , .