Skip to main content

Named Entity Recognition and Identification for Finding the Owner of a Home Page

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2012)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7301))

Included in the following conference series:

Abstract

Entity-based applications, such as expert search or online social networks where users search for persons, require high-quality datasets of named entity references. Obtaining such high-quality datasets can be achieved by automatically extracting metadata from Web pages. In this work, we focus on the identification of the named entity that corresponds to the owner of a particular Web page, for example, a home page or an organizational staff Web page. More specifically, from a set of named entities that have already been extracted from a Web page, we identify the one which corresponds to the owner of the home page. First, we develop a set of features which are combined in a scoring function to select the named entity of the Web page owner. Second, we formulate the problem as a classification problem in which a pair of a Web page and named entity is classified as being associated or not. We evaluate the proposed approaches on a set of Web pages in which we have previously identified named entities. Our experimental results show that we can identify the named entity corresponding to the owner of a home page with accuracy over 90%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bikel, D.M., Miller, S., Schwartz, R., Weischedel, R.: Nymble: a high-performance learning name-finder. In: Procs. of the 5th ANLC, pp. 194–201 (1997)

    Google Scholar 

  2. Chang, C.C., Lin, C.J.: Libsvm: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011)

    Google Scholar 

  3. Changuel, S., Labroche, N., Bouchon-Meunier, B.: Automatic web pages author extraction. In: Procs. of the 8th FQAS, pp. 300–311 (2009)

    Google Scholar 

  4. Chieu, H.L., Ng, H.T.: Named entity recognition with a maximum entropy approach. In: Procs. of the 7th Conference on Natural Language Learning at HLT-NAACL 2003, CONLL 2003, vol. 4, pp. 160–163 (2003)

    Google Scholar 

  5. Culotta, A., Bekkerman, R., McCallum, A.: Extracting social networks and contact information from email and the web. In: CEAS (2004)

    Google Scholar 

  6. Culotta, A., Wick, M., Hall, R., McCallum, A.: First-order probabilistic models for coreference resolution. In: Procs. of HLT/NAACL, pp. 81–88 (2007)

    Google Scholar 

  7. Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Procs. of the 43rd Annual Meeting on ACL, pp. 363–370 (2005)

    Google Scholar 

  8. Gollapalli, S.D., Giles, C.L., Mitra, P., Caragea, C.: On identifying academic homepages for digital libraries. In: Procs. of the 11th JCDL, pp. 123–132 (2011)

    Google Scholar 

  9. Kato, Y., Kawahara, D., Inui, K., Kurohashi, S., Shibata, T.: Extracting the author of web pages. In: Procs. of the 2nd ACM WICOW, pp. 35–42 (2008)

    Google Scholar 

  10. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Procs. of the 18th ICML, pp. 282–289 (2001)

    Google Scholar 

  11. Minkov, E., Wang, R.C., Cohen, W.W.: Extracting personal names from email: applying named entity recognition to informal text. In: Procs. of the Conf. on HLT and EMNLP, HLT 2005, pp. 443–450 (2005)

    Google Scholar 

  12. Ng, V., Cardie, C.: Improving machine learning approaches to coreference resolution. In: Procs. of the 40th Annual Meeting on ACL, ACL 2002, pp. 104–111 (2002)

    Google Scholar 

  13. Shi, Y., Wang, M.: A dual-layer crfs based joint decoding method for cascaded segmentation and labeling tasks. In: Procs. of the 20th IJCAI, pp. 1707–1712 (2007)

    Google Scholar 

  14. Takeuchi, K., Collier, N.: Use of support vector machines in extended named entity recognition. In: Procs. of the 6th Conference on Natural Language Learning, COLING 2002, vol. 20, pp. 1–7 (2002)

    Google Scholar 

  15. Tang, J., Zhang, D., Yao, L.: Social network extraction of academic researchers. In: Procs. of the 7th ICDM, pp. 292–301 (2007)

    Google Scholar 

  16. Zheng, S., Zhou, D., Li, J., Giles, C.L.: Extracting author meta-data from web using visual features. In: Procs. of the 7th ICDMW, pp. 33–40 (2007)

    Google Scholar 

  17. Zhu, J., Nie, Z., Wen, J.R., Zhang, B., Ma, W.Y.: 2d conditional random fields for web information extraction. In: Procs. of the 22nd ICML, pp. 1044–1051 (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Plachouras, V., Rivière, M., Vazirgiannis, M. (2012). Named Entity Recognition and Identification for Finding the Owner of a Home Page. In: Tan, PN., Chawla, S., Ho, C.K., Bailey, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2012. Lecture Notes in Computer Science(), vol 7301. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30217-6_46

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-30217-6_46

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-30216-9

  • Online ISBN: 978-3-642-30217-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics