skip to main content
10.1145/2566486.2568037acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

A hierarchical Dirichlet model for taxonomy expansion for search engines

Published: 07 April 2014 Publication History

Abstract

Emerging trends and products pose a challenge to modern search engines since they must adapt to the constantly changing needs and interests of users. For example, vertical search engines, such as Amazon, eBay, Walmart, Yelp and Yahoo! Local, provide business category hierarchies for people to navigate through millions of business listings. The category information also provides important ranking features that can be used to improve search experience. However, category hierarchies are often manually crafted by some human experts and they are far from complete. Manually constructed category hierarchies cannot handle the ever-changing and sometimes long-tail user information needs. In this paper, we study the problem of how to expand an existing category hierarchy for a search/navigation system to accommodate the information needs of users more comprehensively. We propose a general framework for this task, which has three steps: 1) detecting meaningful missing categories; 2) modeling the category hierarchy using a hierarchical Dirichlet model and predicting the optimal tree structure according to the model; 3) reorganizing the corpus using the complete category structure, i.e., associating each webpage with the relevant categories from the complete category hierarchy. Experimental results demonstrate that our proposed framework generates a high-quality category hierarchy and significantly boosts the retrieval performance.

References

[1]
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of 20th Intl. Conf. on VLDB, pages 487--499, 1994.
[2]
D. M. Blei, T. L. Griffiths, and M. I. Jordan. The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. Journal of the ACM (JACM), 57(2):7, 2010.
[3]
C. Blundell, Y. W. Teh, and K. A. Heller. Bayesian rose trees. arXiv preprint arXiv:1203.3468, 2012.
[4]
R. J. Brachman. What is-a is and isn't: An analysis of taxonomic links in semantic networks. IEEE Computer, 16(10):30--36, 1983.
[5]
A. Clauset, C. Moore, and M. E. Newman. Hierarchical structure and the prediction of missing links in networks. Nature, 453(7191):98--101, 2008.
[6]
J. Edmonds. Optimum branchings. Journal of Research of the National Bureau of Standards B, 71(4):233--240, 1967.
[7]
T. Fountain and M. Lapata. Taxonomy induction using hierarchical random graphs. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 466--476. Association for Computational Linguistics, 2012.
[8]
H. N. Gabow, Z. Galil, T. Spencer, and R. E. Tarjan. Efficient algorithms for finding minimum spanning trees in undirected and directed graphs. Combinatorica, 6(2):109--122, 1986.
[9]
T. Griffiths. Hierarchical topic models and the nested chinese restaurant process. Advances in neural information processing systems, 16:106--114, 2004.
[10]
W. K. Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57(1):97--109, 1970.
[11]
C. Kang, J. Lee, and Y. Chang. Predicting primary categories of business listings for local search. In Proceedings of the 21st ACM international conference on Information and knowledge management, CIKM '12, pages 2591--2594, New York, NY, USA, 2012. ACM.
[12]
S. Kotz, N. Balakrishnan, and N. Johnson. Continuous Multivariate Distributions, Models and Applications. Continuous Multivariate Distributions. Wiley, 2004.
[13]
Z. Kozareva, E. Riloff, and E. H. Hovy. Semantic class learning from the web with hyponym pattern linkage graphs. In ACL, volume 8, pages 1048--1056, 2008.
[14]
W. Li, D. Blei, and A. McCallum. Nonparametric bayes pachinko allocation. arXiv preprint arXiv:1206.5270, 2012.
[15]
X. Liu, Y. Song, S. Liu, and H. Wang. Automatic taxonomy construction from keywords. In KDD, pages 1433--1441, 2012.
[16]
X.-L. Mao, Z.-Y. Ming, T.-S. Chua, S. Li, H. Yan, and X. Li. Sshlda: a semi-supervised hierarchical topic model. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 800--809. Association for Computational Linguistics, 2012.
[17]
D. Mimno, W. Li, and A. McCallum. Mixtures of hierarchical topics with pachinko allocation. In Proceedings of the 24th international conference on Machine learning, pages 633--640. ACM, 2007.
[18]
R. Navigli, P. Velardi, and S. Faralli. A graph-based algorithm for inducing lexical taxonomies from scratch. In Proceedings of the Twenty-Second international joint conference on Artificial Intelligence-Volume Volume Three, pages 1872--1877. AAAI Press, 2011.
[19]
J. Nivre, J. Hall, J. Nilsson, A. Chanev, G. Eryigit, S. Kubler, S. Marinov, and E. Marsi. Maltparser: A language-independent system for data-driven dependency parsing. Natural Language Engineering, 13(2):95--135, 2007.
[20]
Y. Petinot, K. McKeown, and K. Thadani. A hierarchical model of web summaries. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2, HLT '11, pages 670--675, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics.
[21]
S. P. Ponzetto and M. Strube. Deriving a large scale taxonomy from wikipedia. In Proceedings of the 22nd national conference on Artificial intelligence - Volume 2, AAAI'07, pages 1440--1445. AAAI Press, 2007.
[22]
A. Ratnaparkhi. A maximum entropy model for Part-Of-speech tagging. In E. Brill and K. Church, editors, Proceedings of the Empirical Methods in Natural Language Processing, pages 133--142, 1996.
[23]
R. Snow, D. Jurafsky, and A. Y. Ng. Semantic taxonomy induction from heterogenous evidence. In ACL, 2006.
[24]
D. Stewart. Building Enterprise Taxonomies. Mokita Press, 2011.
[25]
C. Wang, M. Danilevsky, N. Desai, Y. Zhang, P. Nguyen, T. Taula, and J. Han. A phrase mining framework for recursive construction of a topical hierarchy. In KDD, pages 437--445, New York, NY, USA, 2013. ACM.
[26]
F. Wu and D. S. Weld. Automatically refining the wikipedia infobox ontology. In WWW, pages 635--644, 2008.
[27]
E. Zavitsanos, G. Paliouras, and G. A. Vouros. Non-parametric estimation of topic hierarchies from texts with hierarchical dirichlet processes. The Journal of Machine Learning Research, 12:2749--2775, 2011.
[28]
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS), 22(2):179--214, 2004.

Cited By

View all
  • (2022)Hierarchical Entity Resolution using an OracleProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526147(414-428)Online publication date: 10-Jun-2022
  • (2022)Taxonomy EnrichmentAutomated Taxonomy Discovery and Exploration10.1007/978-3-031-11405-2_4(49-81)Online publication date: 22-Sep-2022
  • (2021)All You Need to Know to Build a Product Knowledge GraphProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining10.1145/3447548.3470825(4090-4091)Online publication date: 14-Aug-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '14: Proceedings of the 23rd international conference on World wide web
April 2014
926 pages
ISBN:9781450327442
DOI:10.1145/2566486

Sponsors

  • IW3C2: International World Wide Web Conference Committee

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 April 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Dirichlet distribution
  2. local search
  3. missing categories
  4. taxonomy expansion

Qualifiers

  • Research-article

Conference

WWW '14
Sponsor:
  • IW3C2

Acceptance Rates

WWW '14 Paper Acceptance Rate 84 of 645 submissions, 13%;
Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Hierarchical Entity Resolution using an OracleProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526147(414-428)Online publication date: 10-Jun-2022
  • (2022)Taxonomy EnrichmentAutomated Taxonomy Discovery and Exploration10.1007/978-3-031-11405-2_4(49-81)Online publication date: 22-Sep-2022
  • (2021)All You Need to Know to Build a Product Knowledge GraphProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining10.1145/3447548.3470825(4090-4091)Online publication date: 14-Aug-2021
  • (2021)Enquire One’s Parent and Child Before Decision: Fully Exploit Hierarchical Structure for Self-Supervised Taxonomy ExpansionProceedings of the Web Conference 202110.1145/3442381.3449948(3291-3304)Online publication date: 19-Apr-2021
  • (2020)NEO: A Tool for Taxonomy Enrichment with New Emerging OccupationsThe Semantic Web – ISWC 202010.1007/978-3-030-62466-8_35(568-584)Online publication date: 2-Nov-2020
  • (2017)Guided HTMIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2016.262579029:2(330-343)Online publication date: 1-Feb-2017
  • (2015)Tackling data sparseness in recommendation using social media based topic hierarchy modelingProceedings of the 24th International Conference on Artificial Intelligence10.5555/2832581.2832586(2415-2421)Online publication date: 25-Jul-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media