This paper introduces an intelligent multimedia information system, which exploits machine learning and database technologies. The system extracts semantic contents of videos automatically by using the visual, auditory and textual modalities, then, stores the extracted contents in an appropriate format to retrieve them efficiently in subsequent requests for information. The semantic contents are extracted from these three modalities of data separately. Afterwards, the outputs from these modalities are fused to increase the accuracy of the object extraction process. The semantic contents that are extracted using the information fusion are stored in an intelligent and fuzzy object-oriented database system. In order to answer user queries efficiently, a multidimensional indexing mechanism that combines the extracted high-level semantic information with the low-level video features is developed. The proposed multimedia information system is implemented as a prototype and its performance is evaluated using news video datasets for answering content and concept-based queries considering all these modalities and their fused data. The performance results show that the developed multimedia information system is robust and scalable for large scale multimedia applications.

This work is supported by the research grants from TUBITAK with the grant numbers “MFAG-114R082”. We thank to all of previous researchers of Multimedia DB Lab. at METU and Ahmet Cosar, who have contributed to this research.
Appendix: Example Screenshots of Developed System
Appendix: Example Screenshots of Developed System
Example screen for semantic concept extractor: a given video is divided into shots (upper-left table). For the selected shot (shot 18) four keyframes are detected (second table). The selected keyframe is segmented and the objects in segmented parts are recognized (image). One of the segmented parts is selected and marked in red. The semantic content extractor determines this object as a football player with a score of 1.0 (shown in the table under the image)
Example screen for a query-by-content (QBE): an image (image at the lower part of screen) is given as an example and videos containing similar images are queried. The video shot given at the top is one of the answers returned by the system. We see a car accident in the query image and the answer image contains a car crashing into a shop
An example multimodal query containing visual, audio and text modals: In this query, we search for video shots that are related to tennis videos containing tennis court and tennis players in visual modal; applause and crowd events in audio modal; the tennis player Federer in text modal. A number of video shots are selected and displayed based on their matching scores in decreasing order. The best matched result is shown at the top of the screen capture
