Skip to main content
Log in

Profiling Web users using big data

  • Original Article
  • Published:
Social Network Analysis and Mining Aims and scope Submit manuscript

Abstract

Profiling Web users is a fundamental issue for Web mining and social network analysis. Its basic tasks include extracting basic information, mining user preferences, and inferring user demographics (Tang et al. in ACM Trans Knowl Discov Data 5(1):2:1–2:44, 2010). Although methodologies for handling the three tasks are different, they all usually contain two stages: first identify relevant pages (data) of a user and then use machine learning models (e.g., SVM, CRFs, or DL) to extract/mine/infer profile attributes from each page. The methods were successful in the traditional Web, but are facing more and more challenges with the rapid evolution of the Web each persons information is distributed over the Web and is changing dynamically. As a result, available data for a user on the Web is redundant, and some sources may be out-of-date or incorrect. The traditional two-stage method suffers from data inconsistency and error propagation between the two stages. In this paper, we revisit the problem of Web user profiling in the big data era and propose a simple but very effective approach, referred to as MagicFG, for profiling Web users by leveraging the power of big data. To avoid error propagation, the approach processes all the extracting/mining/inferring subtasks in one unified framework. To improve the profiling performance, we present the concept of contextual credibility. The proposed framework also supports the incorporation of human knowledge. It defines human knowledge as Markov logics statements and formalizes them into a factor graph model. The MagicFG method has been deployed in an online system AMiner.org for profiling millions of researchers—e.g., extracting E-mail, inferring Gender, and mining research interests. Our empirical study in the real system shows that the proposed method offers significantly improved (+ 4–6%; \(p\ll 0.01\), t test) profiling performance in comparison with several baseline methods using rules, classification, and sequential labeling.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. https://aminer.org, AMiner aims to understand scientific text and networks. The system extracts researchers profiles automatically from the Web. So far, the system has built more than 130,000,000 researcher profiles and provides a set of unique functions, including expert search, social influence analysis, collaboration recommendation, and community evolution. The system has been in operation since 2006 and has attracted more than 8,320,000 independent IP accesses from over 220 countries/regions.

  2. http://emailbreaker.com.

  3. https://emailhunter.com.

  4. http://www.getsidekick.com.

  5. https://www.gender-api.com/.

  6. https://genderize.io/.

  7. One example of the heuristic rule: “\((([a-z0-9]+)(\backslash .| dot | \backslash . )?)+(@| at | \backslash [at\backslash ] |\backslash [at\backslash ])(([a-z0-9\backslash ]+)(\backslash .| dot | \backslash . \backslash [dot\backslash ] ))+([a-z]+)\)”.

  8. We call the string before “@” of an E-mail candidate as the prefix of the E-mail, and the string after “@” as its domain.

  9. In our experiments, we use Face++, http://www.faceplusplus.com/.

  10. https://aminer.org/profiling/.

  11. The Gender dataset is larger than the previously released one with more balanced distribution of two Genders.

  12. https://aminer.org/gender.

References

  • Alani H, Kim S, Millard DE, Weal MJ, Hall W, Lewis PH, Shadbolt NR (2003) Automatic ontology-based knowledge extraction from web documents. IEEE Intell Syst 18(1):14–21

    Article  Google Scholar 

  • Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. ACM Press, New York

    Google Scholar 

  • Balog K, Azzopardi L, de Rijke M (2006) Formal models for expert finding in enterprise corpora. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, pp 43–55

  • Banko M, Cafarella MJ, Soderland S, Broadhead M, Etzioni O (2007) Open information extraction from the web. In: Proceedings of the 20th international joint conference on artificial intelligence, pp 2670–2676

  • Basu S, Bilenko M, Mooney RJ (2004) A probabilistic framework for semi-supervised clustering. In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining, pp 59–68

  • Bi B, Shokouhi M, Kosinski M, Graepel T (2013) Inferring the demographics of search users: social data meets search queries. In: Proceedings of the 22nd international conference on world wide web, pp 131–140

  • Blanco L, Bronzi M, Crescenzi V, Merialdo P, Papotti P (2010) Redundancy-driven web data extraction and integration. In: Procceedings of the 13th international workshop on the web and databases, pp 7:1–7:6

  • Brajnik G, Guida G, Tasso C (1987) User modeling in intelligent information retrieval. Inf Process Manag 23(4):305–320

    Article  Google Scholar 

  • Chan PK (1999) Constructing web user profiles: a non-invasive learning approach. In: KDD-99 workshop on web usage analysis and user profiling, pp 39–55

  • Collins M (2002) Ranking algorithms for named-entity extraction: boosting and the voted perceptron. In: Proceedings of the 40th annual meeting on association for computational linguistics, pp 489–496

  • Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    MATH  Google Scholar 

  • Cox DR (1958) The regression analysis of binary sequences. J Roy Stat Soc Ser B (Methodol) 20(2):215–242

    MathSciNet  MATH  Google Scholar 

  • Cunningham H, Maynard D, Bontcheva K, Tablan V (2002) GATE: a framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 168–175

  • Dong Y, Yang Y, Tang J, Yang Y, Chawla NV (2014) Inferring user demographics and social strategies in mobile social networks. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 15–24

  • Downey D, Etzioni O, Soderland S (2005) A probabilistic model of redundancy in information extraction. In: Proceedings of the 19th international joint conference on artificial intelligence, pp 1034–1041

  • Efstathiades H, Antoniades D, Pallis G, Dikaiakos MD (2016) Users key locations in online social networks: identification and applications. Soc Netw Anal Min 6(1):66:1–66:17

    Article  Google Scholar 

  • Eltaher M, Lee J (2015) User profiling of Flickr: integrating multiple types of features for gender classification. J Adv Inf Technol 6(2):84–87

    Article  Google Scholar 

  • Figueiredo F, Ribeiro B, Almeida JM, Faloutsos C (2016) TribeFlow: mining and predicting user trajectories. In: Proceedings of the 25th international conference on world wide web, pp 695–706

  • Finkel JR, Grenager T, Manning C (2005) Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43rd annual meeting on association for computational linguistics, pp 363–370

  • Ghahramani Z, Jordan MI (1997) Factorial hidden Markov models. Mach Learn 29(2–3):245–273

    Article  MATH  Google Scholar 

  • Hammersley JM, Clifford P (1971) Markov fields on finite graphs and lattices

  • Hu J, Zeng HJ, Li H, Niu C, Chen Z (2007) Demographic prediction based on user’s browsing behavior. In: Proceedings of the 16th international conference on world wide web, pp 151–160

  • Ikeda K, Hattori G, Ono C, Asoh H, Higashino T (2013) Twitter user profiling based on text and community mining for market analysis. Knowl Based Syst 51(1):35–47

    Article  Google Scholar 

  • Joseph K, Wei W, Carley KM (2016) Exploring patterns of identity usage in tweets: a new problem, solution and case study. In: Proceedings of the 25th international conference on world wide web, pp 401–412

  • Kristjansson T, Culotta A, Viola P, McCallum A (2004) Interactive information extraction with constrained conditional random fields. In: Proceedings of the 19th national conference on artificial intelligence, pp 412–418

  • Krulwich B (1997) Lifestyle finder: intelligent user profiling using large-scale demographic data. AI Mag 18(2):37–45

    Google Scholar 

  • Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th international conference on machine learning, pp 282–289

  • Li R, Wang S, Deng H, Wang R, Chang KCC (2012) Towards social user profiling: unified and discriminative influence model for inferring home locations. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1023–1031

  • Li J, Ritter A, Hovy E (2014) Weakly supervised user profile extraction from Twitter. In: Proceedings of the 52nd annual meeting of the association for computational linguistics, pp 165–174

  • Makazhanov A, Rafiei D, Waqar M (2014) Predicting political preference of Twitter users. Soc Netw Anal Min 4(1):193:1–193:15

    Article  Google Scholar 

  • McCallum A, Freitag D, Pereira FCN (2000) Maximum entropy Markov models for information extraction and segmentation. In: Proceedings of the 17th international conference on machine learning, pp 591–598

  • Michelson M, Knoblock C (2007) Unsupervised information extraction from unstructured, ungrammatical data sources on the world wide web. Int J Doc Anal Recogn 10(3):211–226

    Article  Google Scholar 

  • Pazzani M, Billsus D (1997) Learning and revising user profiles: the identification of interesting web sites. Mach Learn 27(3):313–331

    Article  Google Scholar 

  • Pedro JS, Siersdorfer S, Sanderson M (2011) Content redundancy in YouTube and its application to video tagging. ACM Trans Inf Syst 29(3):13:1–13:31

    Article  Google Scholar 

  • Richardson M, Domingos P (2006) Markov logic networks. Mach Learn 62(1–2):107–136

    Article  Google Scholar 

  • Ritze D, Lehmberg O, Oulabi Y, Bizer C (2016) Profiling the potential of web tables for augmenting cross-domain knowledge bases. In: Proceedings of the 25th international conference on world wide web, pp 251–261

  • Sarawagi S, Cohen WW (2004) Semi-Markov conditional random fields for information extraction. In: Proceedings of the 17th neural information processing systems, pp 1185–1192

  • Sarraute C, Brea J, Burroni J, Blanc P (2015) Inference of demographic attributes based on mobile phone usage patterns and social network topology. Soc Netw Anal Min 5(1):39:1–39:18

    Article  Google Scholar 

  • Soltysiak SJ, Crabtree IB (1998) Automatic learning of user profiles—towards the personalisation of agent services. BT Technol J 16(3):110–117

    Article  Google Scholar 

  • Szell M, Thurner S (2012) How women organize social networks different from men. ArXiv preprint arXiv:1205.4683

  • Tang J, Hong M, Li J, Liang B (2006) Tree-structured conditional random fields for semantic annotation. In: Proceedings of the 5th international conference on the semantic web, pp 640–653

  • Tang J, Hong M, Zhang D, Liang B, Li J (2007a) Emerging technologies of text mining: techniques and applications. Chap. Information extraction: methodologies and applications, pp 1–33. Idea Group Inc.

  • Tang J, Zhang D, Yao L (2007b) Social network extraction of academic researchers. In: Proceedings of the 7th IEEE international conference on data mining, pp 292–301

  • Tang J, Zhang J, Yao L, Li J, Zhang L, Su Z (2008) Arnetminer: extraction and mining of academic social networks. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, pp 990–998

  • Tang J, Yao L, Zhang D, Zhang J (2010) A combination approach to web user profiling. ACM Trans Knowl Discov Data 5(1):2:1–2:44

    Article  Google Scholar 

  • Tang W, Zhuang H, Tang J (2011a) Learning to infer social ties in large networks. In: ECML/PKDD’11, pp 381–397

  • Tang C, Ross K, Saxena N, Chen R (2011b) What’s in a name: a study of names, gender inference, and gender behavior in Facebook. In: Proceedings of the 16th international conference on database systems for advanced applications, pp 344–356

  • Tang J, Fang Z, Sun J (2013) Incorporating social context and domain knowledge for entity recognition. In: Proceedings of the 24th international conference on world wide web, pp 517–526

  • Tang J, Lou T, Kleinberg J, Wu S (2016) Transfer learning to infer social ties across heterogeneous networks. ACM Trans Inf Syst 34(2):7:1–7:43

    Article  Google Scholar 

  • Weninger T, Han J (2013) Exploring structure and content on the web: extraction and integration of the semi-structured web. In: Proceedings of the 6th ACM international conference on web search and data mining, pp 779–780

  • Weninger T, Hsu WH, Han J (2010) CETR: content extraction via tag ratios. In: Proceedings of the 19th international conference on world wide web, pp 971–980

  • Wu S, Liu J, Fan J (2015) Automatic web content extraction by combination of learning and grouping. In: Proceedings of the 24th international conference on world wide web, pp 1264–1274

  • Wu L, Ge Y, Liu Q, Chen E, Long B, Huang Z (2016) Modeling users’ preferences and social links in social networking services: a joint-evolving perspective. In: Proceedings of the 30th AAAI conference on artificial intelligence, pp 279–286

  • Yedidia JS, Freeman WT, Weiss Y (2000) Generalized belief propagation. In: Proceedings of the 13th neural information processing systems, pp 689–695

  • Yu K, Guan G, Zhou M (2005) Resume information extraction with cascaded hybrid model. In: Proceedings of the 43rd annual meeting on association for computational linguistics, pp 499–506

Download references

Acknowledgements

The work is supported by the National Basic Research Program of China (2014CB340506), National Natural Science Foundation of China (61631013, 61561130160), a research fund supported by MSRA, and the Royal Society-Newton Advanced Fellowship Award.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jie Tang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gu, X., Yang, H., Tang, J. et al. Profiling Web users using big data. Soc. Netw. Anal. Min. 8, 24 (2018). https://doi.org/10.1007/s13278-018-0495-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13278-018-0495-0

Keywords

Navigation