Abstract
Large scale network data sets have become increasingly accessible to researchers. While computer networks, networks of webpages and biological networks are all important sources of data, it is the study of social networks that is driving many new research questions. Researchers are finding that the popularity of online social networking sites may produce large dynamic data sets of actor connectivity. Sites such as Facebook have 250 million active users and LinkedIn 43 million active users. Such systems offer researchers potential access to rich large scale networks for study. However, while data sets can be collected directly from sources that specifically define the actors and ties between those actors, there are many other data sources that do not have an explicit network structure defined. To transform such non-relational data into a relational format two facets must be identified - the actors and the ties between the actors. In this chapter we survey a range of techniques that can be employed to identify unique actors when inferring networks from non explicit network data sets.We present our methods for unique node identification of social network actors in a business scenario where a unique node identifier is not available. We validate these methods through the study of a large scale real world case study of over 9 million records.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Albert, R., Barabási, A.: Statistical mechanics of complex networks. Reviews of Modern Physics 74(1), 47–97 (2002)
Albert, R., Jeong, H., Barabási, A.: Diameter of the world wide web. Nature 401(6749), 130–131 (1999)
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th international conference on Very Large Data Bases, pp. 586–597. VLDB Endowment (2002)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval. ACM Press, New York (1999)
Barabási, A., Oltvai, Z.: Network biology: understanding the cell’s functional organization. Nature Reviews Genetics 5(2), 101–113 (2004)
Bausch, S., Han, L.: Social Networking Sites Grow 47 Percent, Year Over Year, Reaching 45 Percent of Web Users, According to Nielsen. NetRatings, Nielsen/Netratings, press release 11 (2006)
Baxter, I., Quigley, A., Bier, L., Moura, L., Sant’Anna, M.: Clonedr: Clone detection and removal. In: 1st International Workshop on Soft Computing Applied to Software Engineering, SCASE 1999 (1999)
Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: Proceedings of the KDD 2003 workshop on data cleaning, record linkage, and object consolidation, Washington DC, vol. 3, pp. 25–27 (2003)
Bell, M., Iida, Y.: Transportation network analysis. Wiley, Chichester (1997)
Benjelloun, O., Garcia-Molina, H., Kawai, H., Larson, T., Menestrina, D., Su, Q., Thavisomboon, S., Widom, J.: Generic entity resolution in the serf project. IEEE Data Engineering Bulletin 29(2), 13–20 (2006)
Berry, M.J., Linoff, G.: Data mining techniques: for marketing, sales, and customer support. John Wiley & Sons, Inc., New York (1997)
Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. ACM Trans. Knowl. Discov. Data 1(1), 5 (2007), http://doi.acm.org/10.1145/1217299.1217304
Bilenko, M., Basu, S., Sahami, M.: Adaptive product normalization: Using online learning for record linkage in comparison shopping. In: ICDM 2005: Proceedings of the Fifth IEEE International Conference on Data Mining, pp. 58–65. IEEE Computer Society, Washington (2005), http://dx.doi.org/10.1109/ICDM.2005.18
Bilenko, M., Mooney, R.J.: On evaluation and training-set construction for duplicate detection. In: Proceedings of the KDD 2003 workshop on data cleaning, record linkage, and object consolidation, Washington DC, pp. 7–12 (2003)
Bilgic, M., Licamele, L., Getoor, L., Shneiderman, B.: D-dupe: An interactive tool for entity resolution in social networks, pp. 43–50 (2006), doi:10.1109/VAST.2006.261429
Boyd, D.M., Ellison, N.B.: Social network sites: Definition, history, and scholarship. Journal of Computer-Mediated Communication 13(1) (2007)
Van de Bunt, G., Van Duijn, M., Snijders, T.: Friendship networks through time: An actororiented dynamic statistical network model. Computational & Mathematical Organization Theory 5(2), 167–192 (1999)
Burges, C.: A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery 2(2), 121–167 (1998)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), Software available at, http://www.csie.ntu.edu.tw/~cjlin/libsvm
Christen, P.: A comparison of personal name matching: Techniques and practical issues. Tech. Rep. TR-CS-06-02 (2006)
Christen, P.: Automatic record linkage using seeded nearest neighbour and support vector machine classification. In: KDD 2008: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 151–159. ACM, New York (2008), http://doi.acm.org/10.1145/1401890.1401913
Christen, P., Churches, T., Hegland, M.: Febrl-a parallel open source data linkage system. Lecture notes in computer science pp. 638–647 (2004)
Christen, P., Goiser, K.: Quality and complexity measures for data linkage and deduplication. Quality Measures in Data Mining 43, 127–152 (2006)
Churches, T., Christen, P., Lim, K., Zhu, J.: Preparation of name and address data for record linkage using hidden Markov models. BMC Medical Informatics and Decision Making 2(1), 9 (2002)
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks, pp. 73–78 (2003)
Collberg, C., Kobourov, S., Nagra, J., Pitts, J., Wampler, K.: A system for graph-based visualization of the evolution of software. In: Proceedings of the 2003 ACM symposium on Software visualization. ACM, New York (2003)
Domingos, P., Richardson, M.: Mining the network value of customers. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 57–66. ACM, New York (2001)
Dunn, H.: Record linkage. American Journal of Public Health 36(12), 1412 (1946)
Eagle, N., Pentland, A., Lazer, D.: Inferring Social Network Structure using Mobile Phone Data. PNAS (2007)
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate Record Detection: A Survey. IEEE Transactions on knowledge and data engineering, 1–16 (2007)
Farrugia, M., Quigley, A.: Enhancing airline customer relationship management data by inferring ties between passengers. In: Proceedings of the international conference on Social Computing (2009)
Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Association, 1183–1210 (1969)
Gadd, T.: PHONIX: The algorithm. Program–Electronic Library and Information Systems 24(4), 363–366 (1990)
Hernández, M., Stolfo, S.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2(1), 9–37 (1998)
Hill, S., Provost, F., Volinsky, C.: Network-based marketing: Identifying likely adopters via consumer networks. Statistical Science 21(2), 256 (2006)
Hirschman, L., Chinchor, N.: Muc-7 coreference task definition - version 3.0 (1997)
InfoGlide Software: Fighting workers’ compensation fraud using identity recognition. Tech. rep., InfoGlide Software (2009)
Jaro, M.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 414–420 (1989)
Krackhardt, D., Hanson, J.: Informal networks: the company. Knowledge in Organizations (1996)
Krebs, V.: Mapping networks of terrorist cells. Connections 24(3), 43–52 (2002)
Lait, A., Randell, B.: An assessment of name matching algorithms. Technical Report Series-University of Newcastle Upon Tyne Computing Science (1996)
Odell, M., Russel, R.: The soundex coding system. US Patent (1918)
Marsden, P., Campbell, K.: Measuring tie strength. Social Forces 63, 482 (1984)
Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys (CSUR) 33(1), 31–88 (2001)
Newcombe, H., Kennedy, J., Axford, S., James, A.P.: Automatic linkage of vital and health records. Science 130, 954–959 (1959)
Petróczi, A., Nepusz, T., Bazsó, F.: Measuring tie-strength in virtual social networks. Connections 27(2), 39–52 (2006)
Piatetsky-Shapiro, G., Djeraba, C., Getoor, L., Grossman, R., Feldman, R., Zaki, M.: What are the grand challenges for data mining. KDD-2006 Panel Report. SIGKDD Explorations 8(2), 70–77 (2006)
Porter, E., Winkler, W.: Approximate String Comparison and Its Effects on an Advanced Record Linkage System. U.S. Bureau of the Census, Statistical Research Division (1997)
Rahm, E., Do, H.: Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin 23(4), 3–13 (2000)
Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision 47(1), 7–42 (2002) Has 1205 Citations
Scott, J.: Social Network Analysis: A Handbook, 2nd edn. SAGE Publications, Thousand Oaks (2000)
Sole, R., Murtra, B., Valverde, S., Steels, L.: Language Networks: their structure, function and evolution. Trends in Cognitive Sciences (2006)
Statistics New Zeland: Data Integration Manual (2006), http://www.stats.govt.nz/NR/rdonlyres/35662748-4DBC-41DA-A519-E6D9D7748C20/0/DataIntegrationManual.pdf
Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Reading (2005)
Wasserman, S., Faust, K.: Social network analysis: Methods and applications. Cambridge Univ. Pr., Cambridge (1994)
Winkler, W.: String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. pp. 354–359 (1990)
Winkler, W.: Improved decision rules in the fellegi-sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, pp. 274–279. American Statistical Association (1993)
Winkler, W.: The state of record linkage and current research problems. Statistical Research Division, US Bureau of the Census, Wachington, DC (1999)
Zachary, W.: An information flow model for conflict and fission in small groups. Journal of Anthropological Research, 452–473 (1977)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Farrugia, M., Quigley, A. (2010). Actor Identification in Implicit Relational Data Sources. In: Ting, IH., Wu, HJ., Ho, TH. (eds) Mining and Analyzing Social Networks. Studies in Computational Intelligence, vol 288. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13422-7_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-13422-7_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13421-0
Online ISBN: 978-3-642-13422-7
eBook Packages: EngineeringEngineering (R0)