Actor Identification in Implicit Relational Data Sources

Farrugia, Michael; Quigley, Aaron

doi:10.1007/978-3-642-13422-7_5

Michael Farrugia⁴ &
Aaron Quigley⁴

Part of the book series: Studies in Computational Intelligence ((SCI,volume 288))

1369 Accesses

Abstract

Large scale network data sets have become increasingly accessible to researchers. While computer networks, networks of webpages and biological networks are all important sources of data, it is the study of social networks that is driving many new research questions. Researchers are finding that the popularity of online social networking sites may produce large dynamic data sets of actor connectivity. Sites such as Facebook have 250 million active users and LinkedIn 43 million active users. Such systems offer researchers potential access to rich large scale networks for study. However, while data sets can be collected directly from sources that specifically define the actors and ties between those actors, there are many other data sources that do not have an explicit network structure defined. To transform such non-relational data into a relational format two facets must be identified - the actors and the ties between the actors. In this chapter we survey a range of techniques that can be employed to identify unique actors when inferring networks from non explicit network data sets.We present our methods for unique node identification of social network actors in a business scenario where a unique node identifier is not available. We validate these methods through the study of a large scale real world case study of over 9 million records.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

We Shall Not Only Survive to the Future of Social Networks

Latent Relational Point Process: Network Reconstruction from Discrete Event Data

Network structure from rich but noisy data

Article 12 March 2018

References

Albert, R., Barabási, A.: Statistical mechanics of complex networks. Reviews of Modern Physics 74(1), 47–97 (2002)
Article MathSciNet Google Scholar
Albert, R., Jeong, H., Barabási, A.: Diameter of the world wide web. Nature 401(6749), 130–131 (1999)
Article Google Scholar
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th international conference on Very Large Data Bases, pp. 586–597. VLDB Endowment (2002)
Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval. ACM Press, New York (1999)
Google Scholar
Barabási, A., Oltvai, Z.: Network biology: understanding the cell’s functional organization. Nature Reviews Genetics 5(2), 101–113 (2004)
Article Google Scholar
Bausch, S., Han, L.: Social Networking Sites Grow 47 Percent, Year Over Year, Reaching 45 Percent of Web Users, According to Nielsen. NetRatings, Nielsen/Netratings, press release 11 (2006)
Google Scholar
Baxter, I., Quigley, A., Bier, L., Moura, L., Sant’Anna, M.: Clonedr: Clone detection and removal. In: 1st International Workshop on Soft Computing Applied to Software Engineering, SCASE 1999 (1999)
Google Scholar
Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: Proceedings of the KDD 2003 workshop on data cleaning, record linkage, and object consolidation, Washington DC, vol. 3, pp. 25–27 (2003)
Google Scholar
Bell, M., Iida, Y.: Transportation network analysis. Wiley, Chichester (1997)
Google Scholar
Benjelloun, O., Garcia-Molina, H., Kawai, H., Larson, T., Menestrina, D., Su, Q., Thavisomboon, S., Widom, J.: Generic entity resolution in the serf project. IEEE Data Engineering Bulletin 29(2), 13–20 (2006)
Google Scholar
Berry, M.J., Linoff, G.: Data mining techniques: for marketing, sales, and customer support. John Wiley & Sons, Inc., New York (1997)
Google Scholar
Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. ACM Trans. Knowl. Discov. Data 1(1), 5 (2007), http://doi.acm.org/10.1145/1217299.1217304
Article Google Scholar
Bilenko, M., Basu, S., Sahami, M.: Adaptive product normalization: Using online learning for record linkage in comparison shopping. In: ICDM 2005: Proceedings of the Fifth IEEE International Conference on Data Mining, pp. 58–65. IEEE Computer Society, Washington (2005), http://dx.doi.org/10.1109/ICDM.2005.18
Chapter Google Scholar
Bilenko, M., Mooney, R.J.: On evaluation and training-set construction for duplicate detection. In: Proceedings of the KDD 2003 workshop on data cleaning, record linkage, and object consolidation, Washington DC, pp. 7–12 (2003)
Google Scholar
Bilgic, M., Licamele, L., Getoor, L., Shneiderman, B.: D-dupe: An interactive tool for entity resolution in social networks, pp. 43–50 (2006), doi:10.1109/VAST.2006.261429
Google Scholar
Boyd, D.M., Ellison, N.B.: Social network sites: Definition, history, and scholarship. Journal of Computer-Mediated Communication 13(1) (2007)
Google Scholar
Van de Bunt, G., Van Duijn, M., Snijders, T.: Friendship networks through time: An actororiented dynamic statistical network model. Computational & Mathematical Organization Theory 5(2), 167–192 (1999)
Article MATH Google Scholar
Burges, C.: A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery 2(2), 121–167 (1998)
Article Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), Software available at, http://www.csie.ntu.edu.tw/~cjlin/libsvm
Christen, P.: A comparison of personal name matching: Techniques and practical issues. Tech. Rep. TR-CS-06-02 (2006)
Google Scholar
Christen, P.: Automatic record linkage using seeded nearest neighbour and support vector machine classification. In: KDD 2008: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 151–159. ACM, New York (2008), http://doi.acm.org/10.1145/1401890.1401913
Chapter Google Scholar
Christen, P., Churches, T., Hegland, M.: Febrl-a parallel open source data linkage system. Lecture notes in computer science pp. 638–647 (2004)
Google Scholar
Christen, P., Goiser, K.: Quality and complexity measures for data linkage and deduplication. Quality Measures in Data Mining 43, 127–152 (2006)
Article Google Scholar
Churches, T., Christen, P., Lim, K., Zhu, J.: Preparation of name and address data for record linkage using hidden Markov models. BMC Medical Informatics and Decision Making 2(1), 9 (2002)
Article Google Scholar
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks, pp. 73–78 (2003)
Google Scholar
Collberg, C., Kobourov, S., Nagra, J., Pitts, J., Wampler, K.: A system for graph-based visualization of the evolution of software. In: Proceedings of the 2003 ACM symposium on Software visualization. ACM, New York (2003)
Google Scholar
Domingos, P., Richardson, M.: Mining the network value of customers. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 57–66. ACM, New York (2001)
Chapter Google Scholar
Dunn, H.: Record linkage. American Journal of Public Health 36(12), 1412 (1946)
Article Google Scholar
Eagle, N., Pentland, A., Lazer, D.: Inferring Social Network Structure using Mobile Phone Data. PNAS (2007)
Google Scholar
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate Record Detection: A Survey. IEEE Transactions on knowledge and data engineering, 1–16 (2007)
Google Scholar
Farrugia, M., Quigley, A.: Enhancing airline customer relationship management data by inferring ties between passengers. In: Proceedings of the international conference on Social Computing (2009)
Google Scholar
Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Association, 1183–1210 (1969)
Google Scholar
Gadd, T.: PHONIX: The algorithm. Program–Electronic Library and Information Systems 24(4), 363–366 (1990)
Article Google Scholar
Hernández, M., Stolfo, S.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2(1), 9–37 (1998)
Article Google Scholar
Hill, S., Provost, F., Volinsky, C.: Network-based marketing: Identifying likely adopters via consumer networks. Statistical Science 21(2), 256 (2006)
Article MATH MathSciNet Google Scholar
Hirschman, L., Chinchor, N.: Muc-7 coreference task definition - version 3.0 (1997)
Google Scholar
InfoGlide Software: Fighting workers’ compensation fraud using identity recognition. Tech. rep., InfoGlide Software (2009)
Google Scholar
Jaro, M.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 414–420 (1989)
Google Scholar
Krackhardt, D., Hanson, J.: Informal networks: the company. Knowledge in Organizations (1996)
Google Scholar
Krebs, V.: Mapping networks of terrorist cells. Connections 24(3), 43–52 (2002)
Google Scholar
Lait, A., Randell, B.: An assessment of name matching algorithms. Technical Report Series-University of Newcastle Upon Tyne Computing Science (1996)
Google Scholar
Odell, M., Russel, R.: The soundex coding system. US Patent (1918)
Google Scholar
Marsden, P., Campbell, K.: Measuring tie strength. Social Forces 63, 482 (1984)
Article Google Scholar
Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys (CSUR) 33(1), 31–88 (2001)
Article Google Scholar
Newcombe, H., Kennedy, J., Axford, S., James, A.P.: Automatic linkage of vital and health records. Science 130, 954–959 (1959)
Article Google Scholar
Petróczi, A., Nepusz, T., Bazsó, F.: Measuring tie-strength in virtual social networks. Connections 27(2), 39–52 (2006)
Google Scholar
Piatetsky-Shapiro, G., Djeraba, C., Getoor, L., Grossman, R., Feldman, R., Zaki, M.: What are the grand challenges for data mining. KDD-2006 Panel Report. SIGKDD Explorations 8(2), 70–77 (2006)
Article Google Scholar
Porter, E., Winkler, W.: Approximate String Comparison and Its Effects on an Advanced Record Linkage System. U.S. Bureau of the Census, Statistical Research Division (1997)
Google Scholar
Rahm, E., Do, H.: Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin 23(4), 3–13 (2000)
Google Scholar
Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision 47(1), 7–42 (2002) Has 1205 Citations
Article MATH Google Scholar
Scott, J.: Social Network Analysis: A Handbook, 2nd edn. SAGE Publications, Thousand Oaks (2000)
Google Scholar
Sole, R., Murtra, B., Valverde, S., Steels, L.: Language Networks: their structure, function and evolution. Trends in Cognitive Sciences (2006)
Google Scholar
Statistics New Zeland: Data Integration Manual (2006), http://www.stats.govt.nz/NR/rdonlyres/35662748-4DBC-41DA-A519-E6D9D7748C20/0/DataIntegrationManual.pdf
Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Reading (2005)
Google Scholar
Wasserman, S., Faust, K.: Social network analysis: Methods and applications. Cambridge Univ. Pr., Cambridge (1994)
Google Scholar
Winkler, W.: String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. pp. 354–359 (1990)
Google Scholar
Winkler, W.: Improved decision rules in the fellegi-sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, pp. 274–279. American Statistical Association (1993)
Google Scholar
Winkler, W.: The state of record linkage and current research problems. Statistical Research Division, US Bureau of the Census, Wachington, DC (1999)
Google Scholar
Zachary, W.: An information flow model for conflict and fission in small groups. Journal of Anthropological Research, 452–473 (1977)
Google Scholar

Download references

Author information

Authors and Affiliations

UCD Dublin, Dublin, Ireland
Michael Farrugia & Aaron Quigley

Authors

Michael Farrugia
View author publications
You can also search for this author in PubMed Google Scholar
Aaron Quigley
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Information Management, National University of Kaohsiung, No. 700 Kaohsiung University Rd, 811, Kaohsiung, Taiwan 5
I-Hsien Ting & Tien-Hwa Ho &
Department of Information Management , National University of Kaohsiung, No. 700 Kaohsiung University Rd, 811, Kaohsiung, Taiwan 5
Hui-Ju Wu

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Farrugia, M., Quigley, A. (2010). Actor Identification in Implicit Relational Data Sources. In: Ting, IH., Wu, HJ., Ho, TH. (eds) Mining and Analyzing Social Networks. Studies in Computational Intelligence, vol 288. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13422-7_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-13422-7_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13421-0
Online ISBN: 978-3-642-13422-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics