ABSTRACT
Understanding intents from search queries can improve a user's search experience and boost a site's advertising profits. Query tagging via statistical sequential labeling models has been shown to perform well, but annotating the training set for supervised learning requires substantial human effort. Domain-specific knowledge, such as semantic class lexicons, reduces the amount of needed manual annotations, but much human effort is still required to maintain these as search topics evolve over time.
This paper investigates semi-supervised learning algorithms that leverage structured data (HTML lists) from the Web to automatically generate semantic-class lexicons, which are used to improve query tagging performance - even with far less training data. We focus our study on understanding the correct objectives for the semi-supervised lexicon learning algorithms that are crucial for the success of query tagging. Prior work on lexicon acquisition has largely focused on the precision of the lexicons, but we show that precision is not important if the lexicons are used for query tagging. A more adequate criterion should emphasize a trade-off between maximizing the recall of semantic class instances in the data, and minimizing the confusability. This ensures that the similar levels of precision and recall are observed on both training and test set, hence prevents over-fitting the lexicon features. Experimental results on retail product queries show that enhancing a query tagger with lexicons learned with this objective reduces word level tagging errors by up to 25% compared to the baseline tagger that does not use any lexicon features. In contrast, lexicons obtained through a precision-centric learning algorithm even degrade the performance of a tagger compared to the baseline. Furthermore, the proposed method outperforms one in which semantic class lexicons have been extracted from a database.
- Textgraphs: Graph-based algorithms for natural language processing. http://www.textgraphs.org.Google Scholar
- E. Agichtein and L. Gravano. Snowball: extracting relations from large plain-text collections. In the Proceedings of the 5th ACM Conference on Digital Libraries, San Antonio, Texas, USA, 2000. Google ScholarDigital Library
- M. J. Cafarella, A. Halevy, Z. D. Wang, E. Wu, and Y. Zhang. WebTables: Exploring the power of tables on the Web. In the Proceedings of VLDB, Auckland, New Zealand, 2008. Google ScholarDigital Library
- E. Eiloff and R. Jones. Learning dictionaries for information extraction by multi-level bootstrapping. In the Proceedings of the 16th National Conference on Artificial Intelligence, 1999. Google ScholarDigital Library
- O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderl, D. S. Weld, and E. Yates. Methods for domain-independent information extraction from the web: An experimental comparison. In the Proceedings of AAAI, 2004. Google ScholarDigital Library
- M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In the Proceedings of the 14th Conference on Computational Linguistics, 1992. Google ScholarDigital Library
- M. Komachi and H. Suzuki. Minimally supervised learning of semantic knowledge from query logs. In the Proceedings of IJCNLP, Hyderabad, India, 2008.Google Scholar
- J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In the Proceedings of ICML, pages 282--289, 2001. Google ScholarDigital Library
- X. Li, Y.-Y. Wang, and A. Acero. Learning query intent from regularized click graphs. In the Proceedings of the 31st SIGIR Conference, 2008. Google ScholarDigital Library
- X. Li, Y.-Y. Wang, and A. Acero. Extracting structured information from user queries with semi-supervised conditional random fields. In the Proceedings of the 32nd SIGIR Conference, 2009. Google ScholarDigital Library
- D. Lin and P. Pantel. Concept discovery from text. In the Proceedings of the 19th International Conference on Computational linguistics (COLING-02), 2002. Google ScholarDigital Library
- A. McCallum and W. Li. Early results for named entity recognition with conditional random fields, feature induction and Web-enhanced lexicons. In the Proceedings of the 7th Conference on Natural Language Learning (CoNLL), Edmonton, Canada, 2003. Google ScholarDigital Library
- L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the Web.Technical report, Stanford InfoLab, 1999.Google Scholar
- P. Pantel and M. Pennacchiotti. Espresso: Leveraging generic patterns for automatically harvesting semantic relations. In the Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, Sydney, Australia, 2006. Google ScholarDigital Library
- F. Peng and A. McCallum. Accurate information extraction from research papers using conditional random fields. In the Proceedings of Human Language Technology Conference and the Conference of North American Chapter of the Association for Computational Linguistics, 2004. Google ScholarDigital Library
- S. Sarawagi and W. W. Cohen. Semi-Markov conditional random fields for information extraction. In the Proceedings of Advances in Neural Information Processing Systems, Vancouver, Canada, 2005.Google Scholar
- F. Sha and F. Pereira. Shallow parsing with conditional random fields. In the Proceedings of Human Language Technology Conference and the Conference of the North American Chapter of the Association for Computational Linguistics, 2003. Google ScholarDigital Library
- P. P. Talukdar, T. Brants, M. Liberman, and F. Pereira. A context pattern induction method for named entity extraction. In the Proceedings of the 10th Conference on Computational Natural Language Learning (CoNLL-X), New York City, 2006. Google ScholarDigital Library
- P. P. Talukdar, J. Reisinger, M. Pasça, D. Ravichandran, R. Bhagat, and F. Pereira. Weakly-supervised acquisition of labeled class instances using graph random walks. In the Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2008. Google ScholarDigital Library
- R. C. Wang, N. Schlaefer, W. Cohen, and E. Nyberg. Automatic set expansion for list question answering. In the Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2008. Google ScholarDigital Library
- Y.-Y. Wang, A. Acero, C. Chelba, B. Frey, and L. Wong. Combination of statistical and rule-based approaches for spoken language understanding. In the Proceedings of the International Conference on Speech and Language Processing, Denver, Colorado, 2002.Google Scholar
- D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf. Learning with local and global consistency. In Advances in Neural Information Processing Systems, volume 16, pages 321--328, 2004.Google Scholar
- D. Zhou, B. Schölkopf, and T. Hofmann. Semi-supervised learning on directed graphs. In Advances in Neural Information Processing Systems, 2005.Google Scholar
- X. Zhu. Semi-Supervised Learning with Graphs. PhD thesis, Carnegie Mellon University, 2005. Google ScholarDigital Library
Index Terms
- Semi-supervised learning of semantic classes for query understanding: from the web and for the web
Recommendations
Semi-Supervised Sequence Labeling with Self-Learned Features
ICDM '09: Proceedings of the 2009 Ninth IEEE International Conference on Data MiningTypical information extraction (IE) systems can be seen as tasks assigning labels to words in a natural language sequence. The performance is restricted by the availability of labeled words. To tackle this issue, we propose a semi-supervised approach to ...
Extracting structured information from user queries with semi-supervised conditional random fields
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrievalWhen search is against structured documents, it is beneficial to extract information from user queries in a format that is consistent with the backend data structure. As one step toward this goal, we study the problem of query tagging which is to assign ...
Inductive Semi-supervised Multi-Label Learning with Co-Training
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningIn multi-label learning, each training example is associated with multiple class labels and the task is to learn a mapping from the feature space to the power set of label space. It is generally demanding and time-consuming to obtain labels for training ...
Comments