Abstract
In this paper, we propose a weakly-supervised occupation detection approach which can automatically detect occupation information for micro-blogging users. The weakly-supervised approach makes use of two types of user information (tweets and personal descriptions) through a rule-based user occupation detection and a MCS-based (MCS: a multiple classifier system) user occupation detection. First, the rule-based occupation detection uses the personal descriptions of some users to create pseudo-training data. Second, based on the pseudo-training data, the MCS-based occupation detection uses tweets to do further occupation detection. However, the pseudo-training data is severely skewed and noisy, which brings a big challenge to the MCS-based occupation detection. Therefore, we propose a class-based random sampling method and a cascaded ensemble learning method to overcome these data problems. The experiments show that the weakly-supervised occupation detection achieves a good performance. In addition, although our study is made on Chinese, the approach indeed is language-independent.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Agichtein, E., Gravano, L.: Snowball: Extracting relations from large plaintext collections. In: Proceedings of the 5th ACM International Conference on Digital Libraries (2000)
Artiles, J., Gonzalo, J., Sekine, S.: WePS 2 Evaluation Campaign: Overview of the Web People Search Attribute Extraction Task. In: 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference (2009)
Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: IJCAI (2007)
Barandela, R., Sanchez, J., Garcia, V., Rangel, E.: Strategies for Learning in Class Imbalance Problems. Pattern Recognition (2003)
Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research (2002)
Chen, Y., Lee, S.Y.M., Huang, C.: PolyUHK: A Robust Information Extraction System for Web Personal Names. In: 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference (2009)
Gruhl, D., Nagarajan, M., Pieper, J., Robson, C., Sheth, A.: Context and domain knowledge enhanced entity spotting in informal text. In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum, L., Maynard, D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823, pp. 260–276. Springer, Heidelberg (2009)
Han, H., Wang, W.Y., Mao, B.H.: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In: Proc. Int’l J. Conf. Intelligent Computing, pp. 878–887 (2005)
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In: Proc. Int’l J. Conf. Neural Networks, pp.1322–1328 (2008)
He, H., Garcia, E.: Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering. Knowledge and Data Engineering 21(9), 1263–1284 (2009)
Kubat, M., Matwin, S.: Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In: Proc. Int’l Conf. Machine Learning (1997)
Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. John Wiley & Sons, Inc., Hoboken (2004)
Li, S., Wang, Z., Zhou, G., Lee, S.Y.M.: Semi-supervised Learning for Imbalanced Sentiment Classification. In: Proceedings of IJCAI (2011)
Liu, X., Zhang, S., Wei, F., Zhou, M.: Recognizing Named Entities in Tweets. In: ACL (2011)
Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory Under Sampling for Class Imbalance Learning. In: Proc. Int’l Conf. Data Mining, pp. 965–969 (2006)
Minkov, E., Wang, R.C., Cohen, W.W.: Extracting personal names from emails: Applying named entity recognition to informal text. In: HLT/EMNLP (2005)
Sarawagi, S.: Information Extraction. Foundations and Trends in Databases (2008)
Wang, B.X., Japkowicz, N.: Imbalanced Data Set Learning with Synthetic Samples. In: Proc. IRIS Machine Learning Workshop (2004)
Zhang, J., Mani, I.: KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction. In: Proc. Int’l Conf. Machine Learning (ICML 2003), Workshop Learning from Imbalanced Data Sets (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chen, Y., Pei, B. (2014). Weakly-Supervised Occupation Detection for Micro-blogging Users. In: Zong, C., Nie, JY., Zhao, D., Feng, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2014. Communications in Computer and Information Science, vol 496. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-45924-9_27
Download citation
DOI: https://doi.org/10.1007/978-3-662-45924-9_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-45923-2
Online ISBN: 978-3-662-45924-9
eBook Packages: Computer ScienceComputer Science (R0)