Skip to main content
Log in

Profile generation from web sources: an information extraction system

  • Original Article
  • Published:
Social Network Analysis and Mining Aims and scope Submit manuscript

Abstract

The Internet space has a vast collection of information which is not always structured. These sources of information such as social media, news articles, blogs, speeches and videos often contain information that could be utilized to generate decision making tools such as reports about events and individuals. Using this information is a long and tedious process if done manually. Over the years, a lot of research has been done in data mining and natural language processing techniques to facilitate the consumption of this vast amount of data. The current work describes ProfileGen, an information extraction system that uses a variety of these data sources to form a profile of a given person. There are two parts to this application: The first part uses information publicly available on social media sites, news articles on news websites and blogs and compiles this information to form a corpus about the given person, and in the second part, the information is ranked using machine learning techniques, so as to provide information in the order of importance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Adnan K, Akbar R (2019) An analytical study of information extraction from unstructured and multidimensional big data. J Big Data 6:1

    Article  Google Scholar 

  • Ambavi H, Garg A, Sharma M, Sharma R, Choudhari J, Singh M (2019) BioGen: automated biography generation. In: 2019 ACM/IEEE joint conference on digital libraries (JCDL), pp. 21–24. IEEE

  • Amir G, Murtaza H (2015) Beyond the hype: big data concepts, methods, and analytics. Int J Inf Manag 35(2):137–144

    Article  Google Scholar 

  • Arulanandam R, Savarimuthu B, Purvis M (2014) Extracting crime information from online newspaper articles. In: Proceedings of the second Australasian web conference—Volume 155, Auckland, New Zealand, pp. 31-38

  • Barzilay R, Noemie E, Kathleen M (2001) Sentence ordering in multidocument summarization. In: Proceedings of the first international conference on Human language technology research

  • Biadsy F, Hirschberg J, Filatova E (2008) An unsupervised approach to biography pro-duction using Wikipedia. In: Proceedings of the 46th annual meeting of the association for computational linguistics

  • Bird S, Edward L, Ewan K (2009) Natural language processing with python. O‘Reilly Media Inc

    MATH  Google Scholar 

  • Crystal D (1997) A dictionary of linguistics and phonetics. 4th edition

  • David B, Smith Noah A (2014) Unsupervised discovery of biographical structure from text. Trans Assoc Comput Linguist 2:363–376

    Article  Google Scholar 

  • Erkan G, Radev DR (2004) LexRank: graph-based lexical centrality as salience in text summarization. J Artif Intell Res 22:457–479

    Article  Google Scholar 

  • Filatova E, Prager J (2005) Tell me what you do and I’ll tell you what you are: Learning occupation-related activities for biographies

  • Finkel JR, Grenager T, Manning CD (2005) Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43nd annual meeting of the association for computational linguistics (ACL), pp. 363–370

  • Garera N, Yarowsky D (2009) Structural, transitive and latent models for biographic fact extraction. In: Proceedings of the 12th conference of the European chapter of the association for computational linguistics (EACL), pp. 300–308

  • Garrido AL, Buey MG, Muñoz G, Casado-Rubio JL (2016) Information extraction on weather forecasts with semantic technologies. Natural language processing and information systems (NLDB). Lecture notes in computer science. vol 9612. Springer

  • Gogar T, Hubacek O, Sedivy J (2016) Deep neural networks for web page information extraction. Artificial intelligence applications and innovations (AIAI). IFIP advances in information and communication technology, vol 475. Springer

  • Honnibal, Matthew and Montani, Ines and Van Landeghem, Sofie and Boyd, Adriane (2020). spaCy: Industrial-strength Natural Language Processing in Python. Zenodo.  https://doi.org/10.5281/zenodo.1212303

    Article  Google Scholar 

  • Kumar S, Agarwal N, Lim M, Liu H (2009) Mapping socio-cultural dynamics in indonesian blogosphere. In: Proceedings of the third international conference on computational cultural dynamics

  • Lauw H, Shafer JC, Agrawal R, Ntoulas A (2010) Homophily in the digital world: a livejournal case study. Internet Comput 14(2):15–23

    Article  Google Scholar 

  • Lee H, Chang A, Peirsman Y, Chambers N, Surdeanu M, Jurafsky D (2013) Deterministic coreference resolution based on entity-centric, precision-ranked rules. Comput Linguist 39(4):885

    Article  Google Scholar 

  • Lee H, Peirsman Y, Chang A, Chambers N, Surdeanu M, Jurafsky D (2011) Stanford’s multi-pass sieve coreference resolution system at the CoNLL-2011 Shared Task. In Proceedings of the CoNLL-2011 shared task

  • Lin CY, Hovy E (2003) Automatic evaluation of summaries using n-gram cooccurrence statistics. In: Proceedings of the 2003 human language technology conference of the North American chapter of the association for computational linguistics. pp 71–78

  • Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D (2014) The Stanford CoreNLP natural language processing toolkit. In: Proceedings of the 52nd annual meeting of the association for computational linguistics: system demonstrations, pp. 55–60

  • Mirończuk MM (2019) information extraction system for transforming unstructured text data in fire reports into structured forms: a polish case study. Fire Technol. https://doi.org/10.1007/s10694-019-00891-z

    Article  Google Scholar 

  • Nallapati R, Zhou B, Gulcehre C, Xiang B (2016) Abstractive text summarization using sequence-to-sequence rnns and beyond. In: Conference on computational natural language learning (CoNLL)

  • Narayan S, Cohen SB, Lapata M (2018) Ranking sentences for extractive summarization with reinforcement learning. In: Proceedings of the 2003 human language technology conference of the North American chapter of the association for computational linguistics

  • Nikola M, Cassie G, Robert H, Goran N (2019) A framework for information extraction from tables in biomedical literature. Int J Doc Anal Recognit 22(1):55–78

    Article  Google Scholar 

  • Qiu JX, Gao S, Alawad M, Schaefferkoetter N, Alamudun F, Yoon HJ, Wu XC, Tourassi G (2019) Semi-supervised information extraction for cancer pathology reports. In: IEEE EMBS international conference on biomedical and health informatics (BHI)

  • Raghunathan K, Lee H, Rangarajan S, Chambers N, Surdeanu M, Jurafsky D, Manning CD (2010) A multi-pass sieve for coreference resolution EMNLP-2010, Boston, USA

  • Recasens M, de Marneffe MC, Potts C (2013) The life and death of discourse entities: identifying singleton mentions. In: Proceedings of NAACL

  • Ritterman J, Osborne M, Klein E (2009) Using prediction markets and twitter to predict swine flu pandemic. In: Proceedings of the 1st international workshop on mining social media, pp. 9–17

  • Ulicny B, Kokar M, Matheus C (2010) Metrics for monitoring a socialpolitical blogosphere: a malaysian case study. Internet Comput 14(2):34–44

    Google Scholar 

  • Zhou L, Ticrea M, Hovy E (2005) Multi-document biography summarization

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to H. Vathsala.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ranjan, R., Vathsala, H. & Koolagudi, S.G. Profile generation from web sources: an information extraction system. Soc. Netw. Anal. Min. 12, 2 (2022). https://doi.org/10.1007/s13278-021-00827-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13278-021-00827-y

Keywords

Navigation