Skip to main content
Log in

A probabilistic model with multi-dimensional features for object extraction

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

To identify recruitment information in different domains, we propose a novel model of hierarchical tree-structured conditional random fields (HT-CRFs). In our approach, first, the concept of a Web object (WOB) is discussed for the description of special Web information. Second, in contrast to traditional methods, the Boolean model and multi-rule are introduced to denote a one-dimensional text feature for a better representation of Web objects. Furthermore, a two-dimensional semantic texture feature is developed to discover the layout of a WOB, which can emphasize the structural attributes and the specific semantics term attributes of WOBs. Third, an optimal WOB information extraction (IE) based on HT-CRF is performed, addressing the problem of a model having an excessive dependence on the page structure and optimizing the efficiency of the model’s training. Finally, we compare the proposed model with existing decoupled approaches for WOB IE. The experimental results show that the accuracy rate of WOB IE is significantly improved and that time complexity is reduced.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Cui H, Kan M Y, Chua T S. Soft pattern matching models for definitional question answering. ACM Transactions on Information Systems, 2007, 25(2)

  2. Nyberg E, Mitamura T, Callan J, Carbonell J, et al. The JAVELIN question-answering system at TREC 2003: a multi-strategy approach with dynamic planning. In: Proceedings of the 12th Text Retrieval Conference. 2003

  3. Mooney R J, Bunescu R. Mining knowledge from text using information extraction. ACM SIGKDD Explorations Newsletter, 2005, 7(1): 3–10

    Article  Google Scholar 

  4. Kobayashi N, Iida R, Inui K, Matsumoto Y. Opinion mining on the web by extracting subject-attribute-value relations. In: Proceedings of AAAI-CAAW’06. 2006

  5. Loth R, Battistelli D, Chaumartin F, et al. Linguistic information extraction for job ads. In: Proceedings of the 9th International Conference on Adaptivity Personalization and Fusion of Heterogeneous Information. 2010

  6. Ye S, Chua T. Learning object models from semistructured web documents. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(3): 334–349

    Article  Google Scholar 

  7. Jinlin C, Ping Z, Cook T. Detecting web content function using generalized hidden Markov model. In: Proceedings of the IEEE 5th International Conference on Machine Learning and Applications. 2006, 279–284

  8. Freitag D, McCallum A. Information extraction with HMM structures learned by stochastic optimization. In: Proceedings of American Association for Artificial Intelligence (AAAI-00). 2000, 584–589

  9. Haileong C, Hweetou N. A maximum entropy approach to information extraction from semi-structured and free text. In: Proceedings of American Association for Artificial Intelligence (AAAI-02). 2002, 786–791

  10. Finn A, Kushmerick N. A multi-level boundary classification approach to information extraction. In: Proceedings of the 15th European Conference on Machine Learning. 2004, 111–122

  11. Zhu Z. Weakly-supervised relation classification for information extraction. In: Proceedings of the 13th ACM International Conference on Information and Knowledge Management. 2004, 581–588

  12. Wallach H. Conditional random fields: an introduction. University of Pennsylvania CIS Technical Report MS-CIS-04-21. 2004

  13. Kristjansson T, Culotta A, Viola P, McCallum A. Interactive information extraction with constrained conditional random fields. In: Proceedings of American Association for Artificial Intelligence (AAAI-04). 2004, 412–418

  14. Lafferty J, Xiaojin Z, Yan L. Kernel conditional random fields: representation and clique selection. In: Proceedings of the 21st International Conference on Machine Learning (ICML-2004). 2004

  15. Trevor C, Blunsom P. Semantic role labelling with tree conditional random fields. In: Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL). 2005, 169–172

  16. Chen M M, Chen Y X, Brent M R, Tenney A E. Constrained optimization for validation-guided conditional random field learning. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 2009, 189–198

  17. Xiao J H, Wang X L, Liu B Q. The study of a nonstationary maximum entropy Markov model and its application on the pos-tagging task. In: Processings of ACM Transactions on Asian Language Information. 2007, 6(2)

  18. Lafferty J, Mccallum A, Pereira F. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning (ICML 2001). 2001, 282–289

  19. Cohn T, Blunsom P. Semantic role labelling with tree conditional random fields. In: Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL). 2005, 169–172

  20. Xu Z T. Hierarchical conditional random fields for Chinese part-of speech tagging. Midterm Report for National Undergraduate Innovational Experimental Program. 2007

  21. Peng F C, Feng F F, McCallum A. Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th international conference on Computational Linguistics. 2004

  22. Peng F C, McCallum A. Accurate information extraction from research papers using conditional random fields. In: Proceedings of the Human Language Technology Conference on the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2004). 2004, 329–336

  23. Li W, McCallum A. Rapid development of Hindi named entity recognition using conditional random fields and feature induction. Journal ACM Transactions on Asian Language Information Processing (TALIP), 2003, 2(3): 290–294

    Article  Google Scholar 

  24. Zhu J, Nie Z Q, Wen J R, Ma W Y. 2D conditional random fields for web information extraction. In: Proceedings of the 22nd International Conference on Machine Learning. 2005, 1044–1051

  25. Tang J, Hong MC, Li J Z, Liang B. Tree-structured conditional random fields for semantic annotation. In: Proceedings of the 5th International Semantic Web Conference (ISWC 2006). 2006, 4273(5): 640–653

    Google Scholar 

  26. Zhu J, Zhang B, Nie Z Q, Wen J R, Hong H W. Webpage understanding: an integrated approach. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2007, 903–912

  27. Truyen T T, Phung D Q, Bui H H, Venkatesh S. Hierarchical semimarkov conditional random fields for recursive sequential data. In: Proceedings of the 22nd Annual Conference on Neural Information Processing Systems. 2008

  28. Zhu J. Nie Z Q, Zhang B, Wen J R. Dynamic hierarchical Markov random fields for integrated web data extraction. Journal of Machine Learning Research, 2008, 9: 1583–1614

    MATH  Google Scholar 

  29. Nie Z Q, Zhang Y Z, Wen J R, Ma W Y. Object-level ranking: bringing order to web objects. In: Proceedings of WWW Conference. 2005, 567–574

  30. Yang X Y, Liu J. Maximum entropy random fields for texture analysis. Pattern Recognition Letters, 2002, 23(1): 93–101

    Article  MATH  Google Scholar 

  31. Salton G, Wong A, Yang C S. A vector space model for automatic indexing. Communication of the ACM, 1975, 18(5): 613–620

    Article  MATH  Google Scholar 

  32. Cai D, Yu S P, Wen J R, Ma W Y. VIPS: a visionbased page segmentation algorithm. Microsoft Technical Report, MSR-TR-2003-79, 2003

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jing Wang.

Additional information

Jing Wang is a lecturer in the School of Computer Science and Technology, Xidian University, China. She Received her PhD, MS, and BS in Computer Science from Xidian University in 2011, 2008, and 2003, respectively. Her research interests are data mining and machine learning.

Zhijing Liu, is a professor and advisor for doctoral students. He graduated from the Department of Computer Engineering of Northwestern Telecommunications Engineering Institute in 1982. He currently serves as the head of the Research Center of Computer Information Research Application, and is the director of the China and America Associated Laboratory of key Technologies of Mobile Electronic Commerce. His research works focus on the fields of data mining and vision computing.

Hui Zhao is a PhD candidate at the School of Electronic and Information Engineering, Xi’an Jiaotong University. He Received his BS and MS from Xidian University in 2008. His research interests include data mining, information retrieval, and mobile learning.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, J., Liu, Z. & Zhao, H. A probabilistic model with multi-dimensional features for object extraction. Front. Comput. Sci. 6, 513–526 (2012). https://doi.org/10.1007/s11704-012-1093-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-012-1093-3

Keywords

Navigation