Skip to main content

Actor Identification in Implicit Relational Data Sources

  • Chapter
Mining and Analyzing Social Networks

Part of the book series: Studies in Computational Intelligence ((SCI,volume 288))

Abstract

Large scale network data sets have become increasingly accessible to researchers. While computer networks, networks of webpages and biological networks are all important sources of data, it is the study of social networks that is driving many new research questions. Researchers are finding that the popularity of online social networking sites may produce large dynamic data sets of actor connectivity. Sites such as Facebook have 250 million active users and LinkedIn 43 million active users. Such systems offer researchers potential access to rich large scale networks for study. However, while data sets can be collected directly from sources that specifically define the actors and ties between those actors, there are many other data sources that do not have an explicit network structure defined. To transform such non-relational data into a relational format two facets must be identified - the actors and the ties between the actors. In this chapter we survey a range of techniques that can be employed to identify unique actors when inferring networks from non explicit network data sets.We present our methods for unique node identification of social network actors in a business scenario where a unique node identifier is not available. We validate these methods through the study of a large scale real world case study of over 9 million records.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Albert, R., Barabási, A.: Statistical mechanics of complex networks. Reviews of Modern Physics 74(1), 47–97 (2002)

    Article  MathSciNet  Google Scholar 

  2. Albert, R., Jeong, H., Barabási, A.: Diameter of the world wide web. Nature 401(6749), 130–131 (1999)

    Article  Google Scholar 

  3. Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th international conference on Very Large Data Bases, pp. 586–597. VLDB Endowment (2002)

    Google Scholar 

  4. Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval. ACM Press, New York (1999)

    Google Scholar 

  5. Barabási, A., Oltvai, Z.: Network biology: understanding the cell’s functional organization. Nature Reviews Genetics 5(2), 101–113 (2004)

    Article  Google Scholar 

  6. Bausch, S., Han, L.: Social Networking Sites Grow 47 Percent, Year Over Year, Reaching 45 Percent of Web Users, According to Nielsen. NetRatings, Nielsen/Netratings, press release 11 (2006)

    Google Scholar 

  7. Baxter, I., Quigley, A., Bier, L., Moura, L., Sant’Anna, M.: Clonedr: Clone detection and removal. In: 1st International Workshop on Soft Computing Applied to Software Engineering, SCASE 1999 (1999)

    Google Scholar 

  8. Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: Proceedings of the KDD 2003 workshop on data cleaning, record linkage, and object consolidation, Washington DC, vol. 3, pp. 25–27 (2003)

    Google Scholar 

  9. Bell, M., Iida, Y.: Transportation network analysis. Wiley, Chichester (1997)

    Google Scholar 

  10. Benjelloun, O., Garcia-Molina, H., Kawai, H., Larson, T., Menestrina, D., Su, Q., Thavisomboon, S., Widom, J.: Generic entity resolution in the serf project. IEEE Data Engineering Bulletin 29(2), 13–20 (2006)

    Google Scholar 

  11. Berry, M.J., Linoff, G.: Data mining techniques: for marketing, sales, and customer support. John Wiley & Sons, Inc., New York (1997)

    Google Scholar 

  12. Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. ACM Trans. Knowl. Discov. Data 1(1), 5 (2007), http://doi.acm.org/10.1145/1217299.1217304

    Article  Google Scholar 

  13. Bilenko, M., Basu, S., Sahami, M.: Adaptive product normalization: Using online learning for record linkage in comparison shopping. In: ICDM 2005: Proceedings of the Fifth IEEE International Conference on Data Mining, pp. 58–65. IEEE Computer Society, Washington (2005), http://dx.doi.org/10.1109/ICDM.2005.18

    Chapter  Google Scholar 

  14. Bilenko, M., Mooney, R.J.: On evaluation and training-set construction for duplicate detection. In: Proceedings of the KDD 2003 workshop on data cleaning, record linkage, and object consolidation, Washington DC, pp. 7–12 (2003)

    Google Scholar 

  15. Bilgic, M., Licamele, L., Getoor, L., Shneiderman, B.: D-dupe: An interactive tool for entity resolution in social networks, pp. 43–50 (2006), doi:10.1109/VAST.2006.261429

    Google Scholar 

  16. Boyd, D.M., Ellison, N.B.: Social network sites: Definition, history, and scholarship. Journal of Computer-Mediated Communication 13(1) (2007)

    Google Scholar 

  17. Van de Bunt, G., Van Duijn, M., Snijders, T.: Friendship networks through time: An actororiented dynamic statistical network model. Computational & Mathematical Organization Theory 5(2), 167–192 (1999)

    Article  MATH  Google Scholar 

  18. Burges, C.: A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery 2(2), 121–167 (1998)

    Article  Google Scholar 

  19. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), Software available at, http://www.csie.ntu.edu.tw/~cjlin/libsvm

  20. Christen, P.: A comparison of personal name matching: Techniques and practical issues. Tech. Rep. TR-CS-06-02 (2006)

    Google Scholar 

  21. Christen, P.: Automatic record linkage using seeded nearest neighbour and support vector machine classification. In: KDD 2008: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 151–159. ACM, New York (2008), http://doi.acm.org/10.1145/1401890.1401913

    Chapter  Google Scholar 

  22. Christen, P., Churches, T., Hegland, M.: Febrl-a parallel open source data linkage system. Lecture notes in computer science pp. 638–647 (2004)

    Google Scholar 

  23. Christen, P., Goiser, K.: Quality and complexity measures for data linkage and deduplication. Quality Measures in Data Mining 43, 127–152 (2006)

    Article  Google Scholar 

  24. Churches, T., Christen, P., Lim, K., Zhu, J.: Preparation of name and address data for record linkage using hidden Markov models. BMC Medical Informatics and Decision Making 2(1), 9 (2002)

    Article  Google Scholar 

  25. Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks, pp. 73–78 (2003)

    Google Scholar 

  26. Collberg, C., Kobourov, S., Nagra, J., Pitts, J., Wampler, K.: A system for graph-based visualization of the evolution of software. In: Proceedings of the 2003 ACM symposium on Software visualization. ACM, New York (2003)

    Google Scholar 

  27. Domingos, P., Richardson, M.: Mining the network value of customers. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 57–66. ACM, New York (2001)

    Chapter  Google Scholar 

  28. Dunn, H.: Record linkage. American Journal of Public Health 36(12), 1412 (1946)

    Article  Google Scholar 

  29. Eagle, N., Pentland, A., Lazer, D.: Inferring Social Network Structure using Mobile Phone Data. PNAS (2007)

    Google Scholar 

  30. Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate Record Detection: A Survey. IEEE Transactions on knowledge and data engineering, 1–16 (2007)

    Google Scholar 

  31. Farrugia, M., Quigley, A.: Enhancing airline customer relationship management data by inferring ties between passengers. In: Proceedings of the international conference on Social Computing (2009)

    Google Scholar 

  32. Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Association, 1183–1210 (1969)

    Google Scholar 

  33. Gadd, T.: PHONIX: The algorithm. Program–Electronic Library and Information Systems 24(4), 363–366 (1990)

    Article  Google Scholar 

  34. Hernández, M., Stolfo, S.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2(1), 9–37 (1998)

    Article  Google Scholar 

  35. Hill, S., Provost, F., Volinsky, C.: Network-based marketing: Identifying likely adopters via consumer networks. Statistical Science 21(2), 256 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  36. Hirschman, L., Chinchor, N.: Muc-7 coreference task definition - version 3.0 (1997)

    Google Scholar 

  37. InfoGlide Software: Fighting workers’ compensation fraud using identity recognition. Tech. rep., InfoGlide Software (2009)

    Google Scholar 

  38. Jaro, M.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 414–420 (1989)

    Google Scholar 

  39. Krackhardt, D., Hanson, J.: Informal networks: the company. Knowledge in Organizations (1996)

    Google Scholar 

  40. Krebs, V.: Mapping networks of terrorist cells. Connections 24(3), 43–52 (2002)

    Google Scholar 

  41. Lait, A., Randell, B.: An assessment of name matching algorithms. Technical Report Series-University of Newcastle Upon Tyne Computing Science (1996)

    Google Scholar 

  42. Odell, M., Russel, R.: The soundex coding system. US Patent (1918)

    Google Scholar 

  43. Marsden, P., Campbell, K.: Measuring tie strength. Social Forces 63, 482 (1984)

    Article  Google Scholar 

  44. Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys (CSUR) 33(1), 31–88 (2001)

    Article  Google Scholar 

  45. Newcombe, H., Kennedy, J., Axford, S., James, A.P.: Automatic linkage of vital and health records. Science 130, 954–959 (1959)

    Article  Google Scholar 

  46. Petróczi, A., Nepusz, T., Bazsó, F.: Measuring tie-strength in virtual social networks. Connections 27(2), 39–52 (2006)

    Google Scholar 

  47. Piatetsky-Shapiro, G., Djeraba, C., Getoor, L., Grossman, R., Feldman, R., Zaki, M.: What are the grand challenges for data mining. KDD-2006 Panel Report. SIGKDD Explorations 8(2), 70–77 (2006)

    Article  Google Scholar 

  48. Porter, E., Winkler, W.: Approximate String Comparison and Its Effects on an Advanced Record Linkage System. U.S. Bureau of the Census, Statistical Research Division (1997)

    Google Scholar 

  49. Rahm, E., Do, H.: Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin 23(4), 3–13 (2000)

    Google Scholar 

  50. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision 47(1), 7–42 (2002) Has 1205 Citations

    Article  MATH  Google Scholar 

  51. Scott, J.: Social Network Analysis: A Handbook, 2nd edn. SAGE Publications, Thousand Oaks (2000)

    Google Scholar 

  52. Sole, R., Murtra, B., Valverde, S., Steels, L.: Language Networks: their structure, function and evolution. Trends in Cognitive Sciences (2006)

    Google Scholar 

  53. Statistics New Zeland: Data Integration Manual (2006), http://www.stats.govt.nz/NR/rdonlyres/35662748-4DBC-41DA-A519-E6D9D7748C20/0/DataIntegrationManual.pdf

  54. Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Reading (2005)

    Google Scholar 

  55. Wasserman, S., Faust, K.: Social network analysis: Methods and applications. Cambridge Univ. Pr., Cambridge (1994)

    Google Scholar 

  56. Winkler, W.: String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. pp. 354–359 (1990)

    Google Scholar 

  57. Winkler, W.: Improved decision rules in the fellegi-sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, pp. 274–279. American Statistical Association (1993)

    Google Scholar 

  58. Winkler, W.: The state of record linkage and current research problems. Statistical Research Division, US Bureau of the Census, Wachington, DC (1999)

    Google Scholar 

  59. Zachary, W.: An information flow model for conflict and fission in small groups. Journal of Anthropological Research, 452–473 (1977)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Farrugia, M., Quigley, A. (2010). Actor Identification in Implicit Relational Data Sources. In: Ting, IH., Wu, HJ., Ho, TH. (eds) Mining and Analyzing Social Networks. Studies in Computational Intelligence, vol 288. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13422-7_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-13422-7_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-13421-0

  • Online ISBN: 978-3-642-13422-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics