Profiling Web users using big data

Gu, Xiaotao; Yang, Hong; Tang, Jie; Zhang, Jing; Zhang, Fanjin; Liu, Debing; Hall, Wendy; Fu, Xiao

doi:10.1007/s13278-018-0495-0

Profiling Web users using big data

Original Article
Published: 22 March 2018

Volume 8, article number 24, (2018)
Cite this article

Social Network Analysis and Mining Aims and scope Submit manuscript

Xiaotao Gu¹^na1,
Hong Yang¹^na1,
Jie Tang¹,
Jing Zhang²,
Fanjin Zhang¹,
Debing Liu¹,
Wendy Hall³ &
…
Xiao Fu⁴

1468 Accesses
13 Citations
Explore all metrics

Abstract

Profiling Web users is a fundamental issue for Web mining and social network analysis. Its basic tasks include extracting basic information, mining user preferences, and inferring user demographics (Tang et al. in ACM Trans Knowl Discov Data 5(1):2:1–2:44, 2010). Although methodologies for handling the three tasks are different, they all usually contain two stages: first identify relevant pages (data) of a user and then use machine learning models (e.g., SVM, CRFs, or DL) to extract/mine/infer profile attributes from each page. The methods were successful in the traditional Web, but are facing more and more challenges with the rapid evolution of the Web each persons information is distributed over the Web and is changing dynamically. As a result, available data for a user on the Web is redundant, and some sources may be out-of-date or incorrect. The traditional two-stage method suffers from data inconsistency and error propagation between the two stages. In this paper, we revisit the problem of Web user profiling in the big data era and propose a simple but very effective approach, referred to as MagicFG, for profiling Web users by leveraging the power of big data. To avoid error propagation, the approach processes all the extracting/mining/inferring subtasks in one unified framework. To improve the profiling performance, we present the concept of contextual credibility. The proposed framework also supports the incorporation of human knowledge. It defines human knowledge as Markov logics statements and formalizes them into a factor graph model. The MagicFG method has been deployed in an online system AMiner.org for profiling millions of researchers—e.g., extracting E-mail, inferring Gender, and mining research interests. Our empirical study in the real system shows that the proposed method offers significantly improved (+ 4–6%; $p\ll 0.01$, t test) profiling performance in comparison with several baseline methods using rules, classification, and sequential labeling.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Web Usage Mining—Process, Tools and Practices

Hybrid Data Aggregation Technique to Categorize the Web Users to Discover Knowledge About the Web Users

Article 08 August 2017

Accurate Online Social Network User Profiling

Notes

https://aminer.org, AMiner aims to understand scientific text and networks. The system extracts researchers profiles automatically from the Web. So far, the system has built more than 130,000,000 researcher profiles and provides a set of unique functions, including expert search, social influence analysis, collaboration recommendation, and community evolution. The system has been in operation since 2006 and has attracted more than 8,320,000 independent IP accesses from over 220 countries/regions.
http://emailbreaker.com.
https://emailhunter.com.
http://www.getsidekick.com.
https://www.gender-api.com/.
https://genderize.io/.
One example of the heuristic rule: “$(([a-z0-9]+)(\backslash .| dot | \backslash . )?)+(@| at | \backslash [at\backslash ] |\backslash [at\backslash ])(([a-z0-9\backslash ]+)(\backslash .| dot | \backslash . \backslash [dot\backslash ] ))+([a-z]+)$”.
We call the string before “@” of an E-mail candidate as the prefix of the E-mail, and the string after “@” as its domain.
In our experiments, we use Face++, http://www.faceplusplus.com/.
https://aminer.org/profiling/.
The Gender dataset is larger than the previously released one with more balanced distribution of two Genders.
https://aminer.org/gender.

References

Alani H, Kim S, Millard DE, Weal MJ, Hall W, Lewis PH, Shadbolt NR (2003) Automatic ontology-based knowledge extraction from web documents. IEEE Intell Syst 18(1):14–21
Article Google Scholar
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. ACM Press, New York
Google Scholar
Balog K, Azzopardi L, de Rijke M (2006) Formal models for expert finding in enterprise corpora. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, pp 43–55
Banko M, Cafarella MJ, Soderland S, Broadhead M, Etzioni O (2007) Open information extraction from the web. In: Proceedings of the 20th international joint conference on artificial intelligence, pp 2670–2676
Basu S, Bilenko M, Mooney RJ (2004) A probabilistic framework for semi-supervised clustering. In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining, pp 59–68
Bi B, Shokouhi M, Kosinski M, Graepel T (2013) Inferring the demographics of search users: social data meets search queries. In: Proceedings of the 22nd international conference on world wide web, pp 131–140
Blanco L, Bronzi M, Crescenzi V, Merialdo P, Papotti P (2010) Redundancy-driven web data extraction and integration. In: Procceedings of the 13th international workshop on the web and databases, pp 7:1–7:6
Brajnik G, Guida G, Tasso C (1987) User modeling in intelligent information retrieval. Inf Process Manag 23(4):305–320
Article Google Scholar
Chan PK (1999) Constructing web user profiles: a non-invasive learning approach. In: KDD-99 workshop on web usage analysis and user profiling, pp 39–55
Collins M (2002) Ranking algorithms for named-entity extraction: boosting and the voted perceptron. In: Proceedings of the 40th annual meeting on association for computational linguistics, pp 489–496
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
MATH Google Scholar
Cox DR (1958) The regression analysis of binary sequences. J Roy Stat Soc Ser B (Methodol) 20(2):215–242
MathSciNet MATH Google Scholar
Cunningham H, Maynard D, Bontcheva K, Tablan V (2002) GATE: a framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 168–175
Dong Y, Yang Y, Tang J, Yang Y, Chawla NV (2014) Inferring user demographics and social strategies in mobile social networks. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 15–24
Downey D, Etzioni O, Soderland S (2005) A probabilistic model of redundancy in information extraction. In: Proceedings of the 19th international joint conference on artificial intelligence, pp 1034–1041
Efstathiades H, Antoniades D, Pallis G, Dikaiakos MD (2016) Users key locations in online social networks: identification and applications. Soc Netw Anal Min 6(1):66:1–66:17
Article Google Scholar
Eltaher M, Lee J (2015) User profiling of Flickr: integrating multiple types of features for gender classification. J Adv Inf Technol 6(2):84–87
Article Google Scholar
Figueiredo F, Ribeiro B, Almeida JM, Faloutsos C (2016) TribeFlow: mining and predicting user trajectories. In: Proceedings of the 25th international conference on world wide web, pp 695–706
Finkel JR, Grenager T, Manning C (2005) Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43rd annual meeting on association for computational linguistics, pp 363–370
Ghahramani Z, Jordan MI (1997) Factorial hidden Markov models. Mach Learn 29(2–3):245–273
Article MATH Google Scholar
Hammersley JM, Clifford P (1971) Markov fields on finite graphs and lattices
Hu J, Zeng HJ, Li H, Niu C, Chen Z (2007) Demographic prediction based on user’s browsing behavior. In: Proceedings of the 16th international conference on world wide web, pp 151–160
Ikeda K, Hattori G, Ono C, Asoh H, Higashino T (2013) Twitter user profiling based on text and community mining for market analysis. Knowl Based Syst 51(1):35–47
Article Google Scholar
Joseph K, Wei W, Carley KM (2016) Exploring patterns of identity usage in tweets: a new problem, solution and case study. In: Proceedings of the 25th international conference on world wide web, pp 401–412
Kristjansson T, Culotta A, Viola P, McCallum A (2004) Interactive information extraction with constrained conditional random fields. In: Proceedings of the 19th national conference on artificial intelligence, pp 412–418
Krulwich B (1997) Lifestyle finder: intelligent user profiling using large-scale demographic data. AI Mag 18(2):37–45
Google Scholar
Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th international conference on machine learning, pp 282–289
Li R, Wang S, Deng H, Wang R, Chang KCC (2012) Towards social user profiling: unified and discriminative influence model for inferring home locations. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1023–1031
Li J, Ritter A, Hovy E (2014) Weakly supervised user profile extraction from Twitter. In: Proceedings of the 52nd annual meeting of the association for computational linguistics, pp 165–174
Makazhanov A, Rafiei D, Waqar M (2014) Predicting political preference of Twitter users. Soc Netw Anal Min 4(1):193:1–193:15
Article Google Scholar
McCallum A, Freitag D, Pereira FCN (2000) Maximum entropy Markov models for information extraction and segmentation. In: Proceedings of the 17th international conference on machine learning, pp 591–598
Michelson M, Knoblock C (2007) Unsupervised information extraction from unstructured, ungrammatical data sources on the world wide web. Int J Doc Anal Recogn 10(3):211–226
Article Google Scholar
Pazzani M, Billsus D (1997) Learning and revising user profiles: the identification of interesting web sites. Mach Learn 27(3):313–331
Article Google Scholar
Pedro JS, Siersdorfer S, Sanderson M (2011) Content redundancy in YouTube and its application to video tagging. ACM Trans Inf Syst 29(3):13:1–13:31
Article Google Scholar
Richardson M, Domingos P (2006) Markov logic networks. Mach Learn 62(1–2):107–136
Article Google Scholar
Ritze D, Lehmberg O, Oulabi Y, Bizer C (2016) Profiling the potential of web tables for augmenting cross-domain knowledge bases. In: Proceedings of the 25th international conference on world wide web, pp 251–261
Sarawagi S, Cohen WW (2004) Semi-Markov conditional random fields for information extraction. In: Proceedings of the 17th neural information processing systems, pp 1185–1192
Sarraute C, Brea J, Burroni J, Blanc P (2015) Inference of demographic attributes based on mobile phone usage patterns and social network topology. Soc Netw Anal Min 5(1):39:1–39:18
Article Google Scholar
Soltysiak SJ, Crabtree IB (1998) Automatic learning of user profiles—towards the personalisation of agent services. BT Technol J 16(3):110–117
Article Google Scholar
Szell M, Thurner S (2012) How women organize social networks different from men. ArXiv preprint arXiv:1205.4683
Tang J, Hong M, Li J, Liang B (2006) Tree-structured conditional random fields for semantic annotation. In: Proceedings of the 5th international conference on the semantic web, pp 640–653
Tang J, Hong M, Zhang D, Liang B, Li J (2007a) Emerging technologies of text mining: techniques and applications. Chap. Information extraction: methodologies and applications, pp 1–33. Idea Group Inc.
Tang J, Zhang D, Yao L (2007b) Social network extraction of academic researchers. In: Proceedings of the 7th IEEE international conference on data mining, pp 292–301
Tang J, Zhang J, Yao L, Li J, Zhang L, Su Z (2008) Arnetminer: extraction and mining of academic social networks. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, pp 990–998
Tang J, Yao L, Zhang D, Zhang J (2010) A combination approach to web user profiling. ACM Trans Knowl Discov Data 5(1):2:1–2:44
Article Google Scholar
Tang W, Zhuang H, Tang J (2011a) Learning to infer social ties in large networks. In: ECML/PKDD’11, pp 381–397
Tang C, Ross K, Saxena N, Chen R (2011b) What’s in a name: a study of names, gender inference, and gender behavior in Facebook. In: Proceedings of the 16th international conference on database systems for advanced applications, pp 344–356
Tang J, Fang Z, Sun J (2013) Incorporating social context and domain knowledge for entity recognition. In: Proceedings of the 24th international conference on world wide web, pp 517–526
Tang J, Lou T, Kleinberg J, Wu S (2016) Transfer learning to infer social ties across heterogeneous networks. ACM Trans Inf Syst 34(2):7:1–7:43
Article Google Scholar
Weninger T, Han J (2013) Exploring structure and content on the web: extraction and integration of the semi-structured web. In: Proceedings of the 6th ACM international conference on web search and data mining, pp 779–780
Weninger T, Hsu WH, Han J (2010) CETR: content extraction via tag ratios. In: Proceedings of the 19th international conference on world wide web, pp 971–980
Wu S, Liu J, Fan J (2015) Automatic web content extraction by combination of learning and grouping. In: Proceedings of the 24th international conference on world wide web, pp 1264–1274
Wu L, Ge Y, Liu Q, Chen E, Long B, Huang Z (2016) Modeling users’ preferences and social links in social networking services: a joint-evolving perspective. In: Proceedings of the 30th AAAI conference on artificial intelligence, pp 279–286
Yedidia JS, Freeman WT, Weiss Y (2000) Generalized belief propagation. In: Proceedings of the 13th neural information processing systems, pp 689–695
Yu K, Guan G, Zhou M (2005) Resume information extraction with cascaded hybrid model. In: Proceedings of the 43rd annual meeting on association for computational linguistics, pp 499–506

Download references

Acknowledgements

The work is supported by the National Basic Research Program of China (2014CB340506), National Natural Science Foundation of China (61631013, 61561130160), a research fund supported by MSRA, and the Royal Society-Newton Advanced Fellowship Award.

Author information

Xiaotao Gu and Hong Yang have contributed equally to this work.

Authors and Affiliations

Department of Computer Science, Tsinghua University, Beijing, China
Xiaotao Gu, Hong Yang, Jie Tang, Fanjin Zhang & Debing Liu
Department of Computer Science, Renmin University, Beijing, China
Jing Zhang
Huawei Technologies Co. Ltd., Shenzhen, China
Wendy Hall
Electronics and Computer Science, University of Southampton, Southampton, UK
Xiao Fu

Authors

Xiaotao Gu
View author publications
You can also search for this author inPubMed Google Scholar
Hong Yang
View author publications
You can also search for this author inPubMed Google Scholar
Jie Tang
View author publications
You can also search for this author inPubMed Google Scholar
Jing Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Fanjin Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Debing Liu
View author publications
You can also search for this author inPubMed Google Scholar
Wendy Hall
View author publications
You can also search for this author inPubMed Google Scholar
Xiao Fu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Jie Tang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gu, X., Yang, H., Tang, J. et al. Profiling Web users using big data. Soc. Netw. Anal. Min. 8, 24 (2018). https://doi.org/10.1007/s13278-018-0495-0

Download citation

Received: 16 January 2017
Accepted: 16 February 2018
Published: 22 March 2018
DOI: https://doi.org/10.1007/s13278-018-0495-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Profiling Web users using big data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Web Usage Mining—Process, Tools and Practices

Hybrid Data Aggregation Technique to Categorize the Web Users to Discover Knowledge About the Web Users

Accurate Online Social Network User Profiling

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now