Recently, a growing number of the talks within conferences and particularly scientific ones, is being recorded for later retrieval. With the growth of the amount of recorded data, it becomes rather complex to find the appropriate video or video-sequence of a talk. For instance current search engines are not able to answer complex queries such as ”Find a sequence of a recorded talk, in an academic training lecture, in 2007, in Italy, where a colleague of professor X, talked about image indexing after the coffee break”.
This example illustrates the so-called semantic gap which as defined in [1] “is an important issue in many computer vision systems, but particularly for indexing. It refers to the lack of coincidence between machine low-level digital representations of visual data and the human high-level cognitive understanding of the same data”. This is particularly relevant for information retrieval activities. Low-level features can be automatically extracted during a video indexing stage (color, slide transition, slide animation, etc.), while higher-level features are based on rich human semantics (concepts, topics, people, etc.) and therefore involve human intervention [2], [3], [4].