Abstract
The semantic similarity of text document pairs can be used for valuable applications. There are various existing basic models proposed for representing document content and computing document similarity. Each basic model performs difference in different scenarios. Existing model selection or fusion approaches generate improved models based on these basic models on the granularity of document collection. These improved models are static for all document pairs and may be only proper for some of the document pairs. We propose a dynamic idea of model fusion, and an approach based on a Dynamic-Static Fusion Model (DSFM) on the granularity of document pairs, which is dynamic for each document pair. The dynamic module in DSFM learns to rank the basic models to predict the best basic model for a given document pair. We propose a model categorization method to construct ideal model labels of document pairs for learning in this dynamic module. The static module in DSFM is based on linear regression. We also propose a model selection method to select appropriate candidate basic models for fusion and improve the performance. The experiments on public document collections which contain paragraph pairs and sentence pairs with human-rated similarity illustrate the effectiveness of our approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Lee, M.D., Welsh, M.: An empirical evaluation of models of text document similarity. In: Proceedings of CogSci 2005, pp. 1254–1259 (2005)
STS2015, Semantic Textual Similarity for English in SemEval-2015. http://alt.qcri.org/semeval2015/task2/index.php?id=semantic-textual-similarity-for-english
Joachims, T.: Training linear SVMs in linear time. In: Proceedings of KDD 2006, pp. 217–226 (2006)
Li, H.: A short introduction to learning to rank. IEICE Trans. Inf. Syst. E94–D(10), 1854–1862 (2011)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of SIGIR 1999, pp. 50–57 (1999)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Schuhmacher, M., Ponzetto, S.P.: Knowledge-based graph document modeling. In: Proceedings of WSDM 2014, pp. 543–552 (2014)
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. OReilly Media Inc., Sebastopol (2009)
Řehůřek, R., Sojka, P.: Software framework for topic modeling with large corpora. In: Proceedings of LREC 2010 Workshop New Challenges for NLP Frameworks, pp. 46–50 (2010)
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: a nucleus for a Web of open data. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007)
Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: DBpedia spotlight: shedding light on the web of documents. In: Proceedings of the 7th International Conference on Semantic Systems (I-Semantics 2011), pp. 1–8 (2011)
Thornton, C., Hutter, F., Hoos, H., Leyton-Brown, K.: Auto-WEKA: combined selection and hyperparameter optimization of classifiaction algorithms. In: Proceedings of KDD 2013 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Li, J., Asano, Y., Shimizu, T., Yoshikawa, M. (2015). A Dynamic-Static Approach of Model Fusion for Document Similarity Computation. In: Wang, J., et al. Web Information Systems Engineering – WISE 2015. WISE 2015. Lecture Notes in Computer Science(), vol 9418. Springer, Cham. https://doi.org/10.1007/978-3-319-26190-4_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-26190-4_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26189-8
Online ISBN: 978-3-319-26190-4
eBook Packages: Computer ScienceComputer Science (R0)