Synonyms
Image/Video/Music search; Multimedia information retrieval
Definition
Semantic modeling and knowledge representation is essential to a multimedia information retrieval system for supporting effective data organization and search. Semantic modeling and knowledge representation for multimedia data (e.g., imagery, video, and music) consists of three steps: feature extraction, semantic labeling, and features-to-semantics mapping. Feature extraction obtains perceptual characteristics such as color, shape, texture, salient-object, and motion features from multimedia data; semantic labeling associates multimedia data with cognitive concepts; and features-to-semantics mapping constructs correspondence between perceptual features and cognitive concepts. Analogically to data representation for text documents, improving semantic modeling and knowledge representation for multimedia data leads to enhanced data organization and query performance.
Historical Background
The principal design goal of a multimedia information retrieval system is to return data (images, video clips, or music) that accurately match users’ queries (for example, a search for pictures of a deer). To achieve this design goal, the system must first comprehend a user’s query concept thoroughly, and then find data in the low-level input space (formed by a set of perceptual features) that match the concept accurately. For traditional relational databases, a query concept is explicitly specified by a user using SQL. For multimedia information retrieval, however, articulating a query concept (e.g., a deer) using low level features (e.g., color, shape, texture, and salient-object features) is infeasible. Semantic modeling and knowledge representation thus plays a key role in query-concept formulation and query processing for a multimedia query.
The QBIC system [8] introduced in 1995 is the first query-by-example system. QBIC uses color histograms to represent an image/video clip; two images/clips containing similar color histograms are considered to be similar. Such knowledge representation for multimedia data is clearly inadequate. In the subsequent 5 years, many researchers in the signal processing and computer vision communities proposed techniques to extract perceptual features, such as textures, shapes, and segments of objects, for improve image representation (see [13] for a survey). At the same time, the query-by-example paradigm was applied also to music retrieval.
Query by just one example was soon discovered insufficient to represent a query concept. Relevance feedback, a query refinement technique developed by the information retrieval community in the 1970’s [12], was then borrowed to provide additional examples to augment the shortcoming of knowledge under-representation. In 2001, the work of [14] showed that relevance feedback could be much improved by using the kernel methods [1] with active learning. The kernel methods project data from their input space formed by perceptual features to a much higher (possibly infinite) dimensional space, where a linear classifier can be learned to separate desired data (with respect to the query) from the others. The kernel methods enjoy both rich semantic modeling (the linear class boundary in a high-dimensional space represents a non-linear boundary in the input space) and computational efficiency (computation is performed in the projected, linear space). Active learning is applied to select the most ambiguous and diversified training instances along the class boundary to query the user for labels. Once these training instances have been labeled, maximal information is gained for refining the class boundary. This process of active learning continues until the search result is satisfactory. In order to further improve the effectiveness of query-concept learning through active learning, keywords (tagged by users [9] or obtained from query logs [10]) were subsequently integrated into the semantic modeling and knowledge representation framework.
Over a decade of research since QBIC, though productive, has not yielded a large-scale real-world deployment of multimedia information retrieval system. The key reason is that semantic modeling and knowledge representation for multimedia data is intrinsically inter-disciplinary. Its success demands collaborative effort from researchers of signal processing, computer vision, machine learning, and databases. Recent works in addressing issues of perceptual similarity [11] and scalability in statistical learning [5] are inter-disciplinary approaches that hold promises to lead to a Web-scale deployment. The survey conducted by [6] provides a complementary view on the historical background.
Foundations
Semantic Modeling
There are two realistic ways for users to specify a multimedia query semantic: query by keywords and query by examples. In order to support query by keywords, semantic annotation provides data with semantic labels (for example, landscape, sunset, animals, and so forth). Several researchers (e.g., [2]) have proposed semi-automatic annotation methods to propagate keywords from a small set of annotated images to the other images. Although semantic annotation can provide some relevant query results, annotation is often subjective and narrowly construed. When it is, query performance may be compromised. To thoroughly understand a query concept, with all of its semantics and subjectivity, a system must obtain the target concept from the user directly via query-concept learning. Semantic annotation can assist, but not replace, query-concept learning.
Both semantic annotation and query-concept learning require mapping features to semantics. This semantic modeling consists of three steps. First, a set of perceptual features (e.g., color, texture and shape) is extracted from each training instance. Second, each training feature-vector xi is assigned semantic labels gi. Third, a classifier f(.) is trained by a supervised learning (or semi-supervised learning) algorithm, based on the labeled instances, to predict the class labels of a query instance xq. Given a query instance xq represented by its low-level features, the semantic labels gq of xq can be predicted by gq = f(xq). (About how multimedia data and knowledge can be represented is discussed in the Knowledge Representation section.)
At first it might seem that traditional supervised learning methods could be directly applied to perform semantic annotation and query-concept learning. Unfortunately, traditional learning algorithms are not adequate to deal with the technical challenges posed by these two tasks. To illustrate, let D denote the number of low-level features. Let N denote the number of training instances, N+ the number of positive training instances, and N− the number of negative training instances (N = N+ + N−). And let U denote the number of unlabeled instances in the repository. Three major technical challenges arise:
- 1.
Scarcity of training data. The features-to-semantics mapping problem often comes up against the D > N challenge. For instance, in the query-concept learning scenario, the number of low-level features that characterize an image (D) is greater than the number of training instances that a user can provide (N) via her query history or relevance feedback. The theories underlying “classical” data analysis are based on the assumptions that D < N, and N approaches infinity. But when D > N, the basic methodology which was used in the classical situation is not similarly applicable [7].
- 2.
Imbalance of training classes. The target class in the training pool is typically outnumbered by the non-target classes (N− > > N+). When the prior of the non-target class dominates the target class, a class prediction favors the non-target class. This skew can substantially reduce recall in search performance [16].
- 3.
Scalability. A typical value of D can be in the order of hundreds, and U can be millions or even billions. Scalability challenges arise in at least two areas. First, searching data among U instances in a high-dimensional space is inefficient [3]. Second, when U > > N, training data may under-represent the knowledge required to model semantics.
Effective techniques for addressing the above challenges are inter-disciplinary. The signal processing and computer vision communities devise algorithms to extract useful features to represent multimedia data. The machine learning community develops models that can map features to semantics both effectively and efficiently. The database community improves indexing, metadata fusion, and query processing techniques to deal with scalability issues. All these endeavors may consult experts in neural processing or cognitive science (e.g., [15]) to develop representations and models that fit human perception.
Knowledge Representation
As mentioned, a piece of multimedia data can be represented at two levels: low-level features and high-level semantics/concepts. A set of low-level features consists of perceptual features, and these features can be put in the form of a vector or a bag. High-level concepts are organized into an ontology structure, depicting relationship between concepts. In between, descriptors can be formulated either explicitly or implicitly to provide building blocks for low-level to high-level mapping and reasoning. For instance, a high-level ski concept can be formed by descriptors of snow, ski equipment, and people. Each of these descriptors is in turn composed of color, texture, shape, or salient-point features. Texts when available can be used to augment low-level perceptual features (e.g., using word “white” to depict the color of the mountain), to label descriptors (e.g., snow), or to directly annotate high-level semantics/concepts (e.g., ski). Statistical methods such as SVMs and Latent Semantic Analysis techniques (e.g., LDA [4]) can be employed to perform mapping between the three levels.
Efforts of standardizing knowledge representation have been embarked on for over a decade by academia and industry. For instance, digital cameras save JPEG files with EXIF (Exchangeable Image File) data. EXIF records camera settings, scene information, and time (and location where a photo is taken in the near future). DOLCE devises descriptive ontology for linguistic and cognitive engineering. MPEG-7 proposes different description granularity to depict multimedia data. Standard knowledge representation is essential for supporting metadata exchange and system interoperability.
Key Applications
The launches of photo and video sharing sites such as Flickr, Google Photos, and YouTube between 2002 and 2008 renewed the interest on multimedia data management. The following applications are in high demand to manage large-scale multimedia data repositories:
- 1.
Content-based Video, Image, Music Search Engines
- 2.
Copy Right Infringement Detection
- 3.
Multimedia Digital Libraries
- 4.
Semi-automatic Photo/Video Annotation/Classification
An application scenario is used to illustrate how aforementioned science fundamentals can improve multimedia information retrieval. Figures 1–3 show an example query using a Perception-based Image Retrieval (PBIR) prototype developed at UC Santa Barbara. The figures demonstrate how a query concept is learned in an iterative process by the PBIR search engine to improve search results. The user interface shows two frames. The frame on the left-hand side is the feedback frame, on which the user marks images relevant to his or her query concept. On the right-hand side, the search engine returns what it interprets as matching this far from the image database.
Most images were annotated by users. To query “cat,” one first enters the keyword cat in the query box to get the first screen of results in Fig. 1. The right-side frame shows a couple of images containing domestic cats, but several images containing tigers or lions. This is because many tiger/lion images were annotated with “wild cat” or “cat.” To disambiguate the concept, the user clicks on a couple of domestic cat images on the feedback frame (left side, in gray/green borders). The search engine refines the class boundary accordingly, and then returns the second screen in Fig. 2. In this figure, the images in the result frame (right side) have been much improved. All returned images contain a domestic cat or two. After performing another couple of rounds of feedback to make some further refinements, more satisfactory results are shown in Fig. 3.
This example illustrates three critical points. First, keywords alone cannot retrieve images effectively because words may have varied meanings or senses. This is called the word-aliasing problem. Second, the number of labeled instances that can be collected from a user is limited. Through three feedback iterations, it is possible to gather just 16 × 3 = 48 training instances, whereas the feature dimension of this dataset is more than one hundred. Since most users would not be willing to give more than three iterations of feedback, the system encounters the problem of scarcity of training data. Third, the negatives outnumber the relevant or positive instances being clicked on. This is known as the problem of imbalanced training data. Besides, there are a large number of images in the repository. To achieve real-time performance in query refinement and in search, efficiently indexing schemes are needed to reduce search space.
Future Directions
Major advancements in three areas are necessary before large-scale multimedia systems can be realistic: accurate and efficient object segmentation, scalable statistical learning, and high-dimensional indexing. For details please consult the section of Foundations.
Cross-References
Recommended Reading
Aizerman MA, Braverman EM, Rozonoer LI. Theoretical foundations of the potential function method in pattern recognition learning. Autom Remote Control. 1964;25:821–37.
Barnard K, Forsyth D. Learning the semantics of words and pictures. Int Conf Comput Vision. 2000;2:408–15.
Beyer K, Goldstein J, Ramakrishnan R, Shaft U. When is nearest neighbor meaningful. In: Proceedings of the 7th International Conference on Database Theory; 1999. p. 217–35.
Blei DM, Ng A, Jordan M. Latent Dirichlet allocation. J Machine Learning Res. 2003;3(4/5):993–1022.
Chang EY, et al. Parallelizing support vector machines on distributed computers. In: Advances in Neural Information Proceedings of the Systems 20, Proceedings of the 21st Annual Conference on Neural Information Proceedings of the Systems; 2007.
Datta R, Joshi D, Li J, Wang JZ. Image retrieval: ideas, influences, and trends of the new age. ACM Comput Surv. 2008;40(65).
Donoho DL. Aide-Memoire. High-dimensional data analysis: the curses and blessings of dimensionality (American Math. Society Lecture). In: Mathematical Challenges Of The 21st Century Explored at American Mathematical Society Conference; 2000.
Flickner M, et al. Query by image and video content: QBIC system. IEEE Comput. 1995;28(9).
Goh K, Chang EY, Lai W-C. Concept-dependent multimodal active learning for image retrieval. In: Proceedings of the 12th ACM International Conference on Multimedia; 2004. p. 564–71.
Hoi C-H, Lyu MR. A novel log-based relevance feedback technique in content-based image retrieval. In: Proceedings of the 12th ACM International Conference on Multimedia; 2004. p. 24–31.
Li B, Chang EY. Discovery of a perceptual distance function for measuring image similarity. ACM Multimedia Syst J. (Special Issue on Content-Based Image Retrieval). 2003;8(6):512–22.
Rocchio JJ. Relevance feedback in information retrieval. In: Salton G, editor. The SMART retrieval system – experiments in automatic document processing. Englewood Cliffs: Prentice-Hall; 1971. p. 313–23. Chapter 14.
Rui Y, Huang TS, Chang S-F. Image retrieval: current techniques, promising directions and open issues. J Visual Commn Image Represent. 1999;10(1):39–62.
Tong S, Chang EY. Support vector machine active learning for image retrieval. In: Proceedings of the 9th ACM International Conference on Multimedia; 2001. p. 107–18.
Tversky A. Features of similarity. Psychol Rev. 1997;84(4):327–52.
Wu G, Chang EY. KBA: Kernel Boundary Alignment considering imbalanced data distribution. IEEE Trans Knowl Data Eng. 2005;17(6):786–95.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Section Editor information
Rights and permissions
Copyright information
© 2018 Springer Science+Business Media, LLC, part of Springer Nature
About this entry
Cite this entry
Chang, E.Y. (2018). Semantic Modeling and Knowledge Representation for Multimedia Data. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_1038
Download citation
DOI: https://doi.org/10.1007/978-1-4614-8265-9_1038
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering