Abstract
There are many expressive and structural differences between product names and general named entities such as person names, location names and organization names. To date, there has been little research on product named entity recognition (NER), which is crucial and valuable for information extraction in the field of market intelligence. This paper focuses on product NER (PRO NER) in Chinese text. First, we describe our efforts on data annotation, including well-defined specifications, data analysis and development of a corpus with annotated product named entities. Second, a hierarchical hidden Markov model-based approach to PRO NER is proposed and evaluated. Extensive experiments show that the proposed method outperforms the cascaded maximum entropy model and obtains promising results on the data sets of two different electronic product domains (digital and cell phone).




Similar content being viewed by others
Notes
For purposes of clarity and precision, the singular forms “product named entity” and “named entity” are abbreviated “PRO NE” and “NE”, respectively, while the plural forms “product named entities” and “named entities” are abbreviated “PRO NEs” and “NEs,” respectively.
References
Aberdeen, J., et al. (1995). MITRE: Description of the ALEMBIC system used for MUC-6. In Proceedings of the Sixth Message Understanding Conference (MUC-6) (pp. 141–155).
Bick, E. (2004). A named entity recognizer for Danish. In Lino et al. (Eds.), Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC2004), Lisbon (pp. 305–308).
Bikel, D. M., Miller, S., Schwartz, R., & Weischedel, R. (1997). Nymble: A high-performance learning name-finder. In Proceedings of the Fifth Conference on Applied Natural Language Processing (pp. 194–201), ACL.
Borthwick, A. (1999). A maximum entropy approach to named entity recognition. PhD Dissertation. Computer Science Department, New York University.
Carletta, J. (1996). Assessing agreement on classification tasks: The Kappa statistic. Computational Linguistics, 22(2), 249–254.
Collier, N., Nobata, C., & Tsujii, J. (2000). Extracting the names of genes and gene products with a hidden Markov model. In Proceedings of the 18th International Conference on Computational Linguistics (COLING’2000), Saarbrucken, Germany (pp. 201–207).
Fine, S., Singer, Y., & Tishby, N. (1998). The hierarchical hidden Markov model: Analysis and applications. Machine Learning, 32(1), 41–62.
Jelinek, F., & Mercer, E. L. (1980). Interpolated estimation of Markov source parameters from sparse data. In D. Gelsema & L. Kanal (Eds.), Pattern recognition in practice. North-Holland.
McCallum, A., Freitag, D., & Pereira, F. (2000). Maximum entropy Markov models for information extraction and segmentation. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML-2000), Stanford, CA (pp. 591–598).
Niu, C., Li, W., Ding, J., & Srihari, R. K. (2003). A bootstrapping approach to named entity classification using successive learners. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL 2003) (Sapporo, pp. 335–342).
Pierre, J. M. (2002). Mining knowledge from text collections using automatically generated metadata. In Proceedings of Fourth International Conference on Practical Aspects of Knowledge Management (PAKM2002), Vienna (pp. 537–548).
Sekine, S., Grishman, R., & Shinou, H. (1998). A decision tree method for finding and classifying names in Japanese texts. In Proceedings of the Sixth Workshop on Very Large Corpora, Canada, http://www.cs.nyu.edu/~sekine/papers/wvlc98.pdf.
Sigel, S., & Castellan, N. J. (1988). Non-parametric statistics for the behavioral sciences (2nd ed.). McGraw-Hill.
Wu, Y., Zhao, J., & Xu, B. (2003). Chinese named entity recognition combining statistical model with human knowledge. In The Workshop attached with 41st ACL for Multilingual and Mix-language Named Entity Recognition: Combining Statistical and Symbolic Models, Sappora (pp. 65–72).
Xiong, D., Yu, H., & Liu, Q. (2004). Tagging complex NEs with Maxent models: Layered structures versus extended Tagset. In Proceedings of the First International Joint Conference on Natural Language Processing (IJCNLP-04), Sanya (pp. 638–643).
Yi, E., Lee, G. G., & Park, S.-J. (2004). SVM-based biological named entity recognition using minimum edit-distance feature boosted by virtual examples. In Proceedings of the First International Joint Conference on Natural Language Processing (IJCNLP-04), Sanya (pp. 22–24).
Yu, S., Duan, H., Zhu, X., Swen, B., & Chang, B. (2003). Word segmentation, POS tagging and phonetic notation. International Journal of The Chinese and Oriental Languages Information Processing Society, 13(2), 121–159.
Acknowledgments
This work is supported by the National High Technology Development 863 Program of China under Grant No. 2006AA01Z144, the National Natural Science Foundation of China under Grant No. 60673042, and the Natural Science Foundation of Beijing under Grants No. 4052027 and 4073043. This research is also carried out as part of a cooperative project with Fujitsu R&D Center Co., Ltd. We would like to thank Dr. Hao YU, Dr. Yingju XIA, and Dr. Fumihito Nishino for helpful conversations and feedback on the corpus. We would like to thank Dr. Yang LIU of the University of Texas at Dallas, Dr. Ying ZHAO of Tsinghua University, and Mr. Matthew Trueman for their useful suggestions for modifying earlier drafts of the paper. We are grateful to the anonymous reviewers for very helpful comments on an earlier draft. Their insights and suggestions have led to many improvements in the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
This research was conducted under the framework of the Chinese Linguistic Data Consortium (ChineseLDC). In the first phase, ChineseLDC created a series of fundamental Chinese language resources, including Comprehensive Chinese Lexicon, Chinese Grammatical Knowledge Base (frequent words), Word-segmented and POS-tagged Chinese Corpus, Syntactic Treebank, Chinese–English Parallel Corpus, Chinese Semantic Lexicon, etc. Construction of the Product Named Entity Tagged Corpus and development of the Automatic Product Named Entity Recognition Tool are among the tasks of the second phase of ChineseLDC.
Appendix: Peking University’s TagSet for POS Tagging Chinese Texts (Yu et al. 2003)
Appendix: Peking University’s TagSet for POS Tagging Chinese Texts (Yu et al. 2003)
Rights and permissions
About this article
Cite this article
Zhao, J., Liu, F. Product named entity recognition in Chinese text. Lang Resources & Evaluation 42, 197–217 (2008). https://doi.org/10.1007/s10579-008-9066-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-008-9066-8