Skip to main content

Advertisement

Log in

Uncovering the plot: detecting surprising coalitions of entities in multi-relational schemas

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Many application domains such as intelligence analysis and cybersecurity require tools for the unsupervised identification of suspicious entities in multi-relational/network data. In particular, there is a need for automated semi-automated approaches to ‘uncover the plot’, i.e., to detect non-obvious coalitions of entities bridging many types of relations. We cast the problem of detecting such suspicious coalitions and their connections as one of mining surprisingly dense and well-connected chains of biclusters over multi-relational data. With this as our goal, we model data by the Maximum Entropy principle, such that in a statistically well-founded way we can gauge the surprisingness of a discovered bicluster chain with respect to what we already know. We design an algorithm for approximating the most informative multi-relational patterns, and provide strategies to incrementally organize discovered patterns into the background model. We illustrate how our method is adept at discovering the hidden plot in multiple synthetic and real-world intelligence analysis datasets. Our approach naturally generalizes traditional attribute-based maximum entropy models for single relations, and further supports iterative, human-in-the-loop, knowledge discovery.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. Note that, to save computation, we do not update the MaxEnt model after adding each \(B^*\). However, in line with the local score, we know that adding a bicluster typically only changes the distribution locally, and as we never re-visit the same relation \(R\) in a single chain \(C\) the information by \(B_i\) is unlikely to influence much the informativeness of \(B_{i+1}\).

  2. http://dac.cs.vt.edu/projects.

  3. http://www.alchemyapi.com/.

References

  • Califano A, Stolovitzky G, Tu Y (2000) Analysis of gene expression microarrays for phenotype classification. In: Proceedings of the 8th international conference on intelligent systems for molecular biology, pp 75–85

  • Cerf L, Besson J, Robardet C, Boulicaut JF (2009) Closed patterns meet n-ary relations. ACM Trans Knowl Discov Data 3(1):3:1–3:36

    Article  Google Scholar 

  • Cerf L, Besson J, Nguyen KNT, Boulicaut JF (2013) Closed and noise-tolerant patterns in n-ary relations. Data Min Knowl Discov 26(3):574–619

    Article  MATH  MathSciNet  Google Scholar 

  • Cheng Y, Church GM (2000) Biclustering of expression data. In: Proceedings of the eighth international conference on intelligent systems for molecular biology, AAAI Press, pp 93–103

  • Cover T, Thomas J (2006) Elements of information theory. Wiley, New York

    MATH  Google Scholar 

  • Csiszar I (1975) \(I\)-Divergence geometry of probability distributions and minimization problems. Ann Probab 3(1):146–158

    Article  MATH  MathSciNet  Google Scholar 

  • Darroch JN, Ratcliff D (1972) Generalized iterative scaling for log-linear models. Ann Math Stat 43(5):1470–1480

    Article  MATH  MathSciNet  Google Scholar 

  • Davis WLI, Schwarz P, Terzi E (2009) Finding representative association rules from large rule collections. In: Proceedings of the 9th SIAM international conference on data mining (SDM). Sparks, NV, SIAM, pp 521–532

  • De Bie T (2011) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446

    Article  MATH  MathSciNet  Google Scholar 

  • Dehaspe L, Toironen H (2000) Discovery of relational association rules. In: Dĕzeroski S (ed) Relational data mining. Springer, New York Inc, pp 189–208

  • Dzeroski S, Lavrac N (eds) (2001) Relational data mining. Springer, Berlin

    MATH  Google Scholar 

  • Geerts F, Goethals B, Mielikainen T (2004) Tiling databases. In: Proceedings of discovery science. Springer, Berlin, pp 278–289

  • Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Discov Data 1(3):167–176

    Article  Google Scholar 

  • Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something I don’t know: randomization strategies for iterative data mining. In: Proceedings of the 15th ACM international conference on knowledge discovery and data mining (SIGKDD). ACM, Paris, France, pp 379–388

  • Hossain M, Gresock J, Edmonds Y, Helm R, Potts M, Ramakrishnan N (2012a) Connecting the dots between PubMed abstracts. PLoS ONE 7(1)

  • Hossain MS, Butler P, Boedihardjo AP, Ramakrishnan N (2012b) Storytelling in entity networks to support intelligence analysts. In: Proceedings of the 18th ACM international conference on knowledge discovery and data mining (SIGKDD). ACM, Beijing, China, pp 1375–1383

  • Hughes FJ (2005) Discovery, proof, choice: the art and science of the process of intelligence analysis, case study 6, “All Fall Down”, unpublished report

  • Jaynes ET (1957) Information theory and statistical mechanics. Phys Rev Ser II 106(4):620–630

    MATH  MathSciNet  Google Scholar 

  • Jin Y, Murali TM, Ramakrishnan N (2008) Compositional mining of multirelational biological datasets. ACM Trans Knowl Discov Data 2(1):2:1–2:35

    Article  Google Scholar 

  • Kiernan J, Terzi E (2008) Constructing comprehensive summaries of large event sequences. In: Proceedings of the 14th ACM international conference on knowledge discovery and data mining (SIGKDD). Las Vegas, NV, pp 417–425

  • Kontonasios KN, Vreeken J, De Bie T (2011) Maximum entropy modelling for assessing results on real-valued data. In: Proceedings of the 11th IEEE international conference on data mining (ICDM). Vancouver, Canada, IEEE, pp 350–359

  • Kontonasios KN, Vreeken J, De Bie T (2013) Maximum entropy models for iteratively identifying subjectively interesting structure in real-valued data. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD). Springer, Prague, Czech Republic, pp 256–271

  • Kumar D, Ramakrishnan N, Helm RF, Potts M (2006) Algorithms for storytelling. In: Proceedings of the 12th ACM international conference on knowledge discovery and data Mining (SIGKDD), Philadelphia, PA, pp 604–610

  • Lavrac N, Flach P (2001) An extended transformation approach to inductive logic programming. ACM Trans Comput Logic 2(4):458–494

    Article  Google Scholar 

  • Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinf 1(1):24–45

    Article  Google Scholar 

  • Mampaey M, Vreeken J, Tatti N (2012) Summarizing data succinctly with the most informative itemsets. ACM Trans Knowl Discov Data 6:1–44

    Article  Google Scholar 

  • Ojala M, Garriga GC, Gionis A, Mannila H (2010) Evaluating query result significance in databases via randomizations. In: Proceedings of the 10th SIAM international conference on data mining (SDM). Columbus, OH, pp 906–917

  • Rasch G (1960) Probabilistic models for some intelligence and attainnment tests. Danmarks paedagogiske Institut

  • Rissanen J (1978) Modeling by shortest data description. Automatica 14(1):465–471

    Article  MATH  Google Scholar 

  • Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464

    Article  MATH  Google Scholar 

  • Segal E, Taskar B, Gasch A, Friedman N, Koller D (2001) Rich probabilistic models for gene expression. Bioinformatics 17(suppl 1):S243–S252

    Article  Google Scholar 

  • Shahaf D, Guestrin C (2010) Connecting the dots between news articles. In: Proceedings of the 16th ACM international conference on knowledge discovery and data mining (SIGKDD). ACM, Washington, DC, pp 623–632

  • Shahaf D, Guestrin C (2012) Connecting two (or less) dots: discovering structure in news articles. ACM Trans Knowl Discov Data 5(4):24:1–24:31

    Article  Google Scholar 

  • Sheng Q, Moreau Y, De Moor B (2003) Biclustering microarray data by gibbs sampling. Bioinformatics 19(suppl 2):196–205

    Article  Google Scholar 

  • Spyropoulou E, De Bie T (2011) Interesting multi-relational patterns. Proceedings of the 11th IEEE international conference on data mining (ICDM). Vancouver, Canada, pp 675–684

  • Spyropoulou E, De Bie T, Boley M (2013) Mining interesting patterns in multi-relational data with n-ary relationships. Discovery Science, vol 8140, Lecture Notes in Computer Science. Springer, Berlin, pp 217–232

  • Spyropoulou E, De Bie T, Boley M (2014) Interesting pattern mining in multi-relational data. Data Min Knowl Discov 28(3):808–849

    Article  MathSciNet  Google Scholar 

  • Tatti N (2006) Computational complexity of queries based on itemsets. Inf Process Lett 98(5):183–187. doi:10.1016/j.ipl.2006.02.003

    Article  MATH  MathSciNet  Google Scholar 

  • Tatti N, Vreeken J (2012) Comparing apples and oranges - measuring differences between exploratory data mining results. Data Min Knowl Disc 25(2):173–207

    Article  MATH  MathSciNet  Google Scholar 

  • Tibshirani R, Hastie T, Eisen M, Ross D, Botstein D, Brown P (1999) Clustering methods for the analysis of dna microarray data. Stanford University, Tech. rep

  • Uno T, Kiyomi M, Arimura H (2005) Lcm ver.3: collaboration of array, bitmap and prefix tree for frequent itemset mining. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations, ACM, New York, NY, USA, OSDM ’05, pp 77–86

  • Wang C, Parthasarathy S (2006) Summarizing itemset patterns using probabilistic models. In: Proceedings of the 12th ACM international conference on knowledge discovery and data mining (SIGKDD), Philadelphia, PA, pp 730–735

  • Zaki M, Hsiao CJ (2005) Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans Knowl Data Eng 17(4):462–478

    Article  Google Scholar 

  • Zaki MJ, Ramakrishnan N (2005) Reasoning about sets using redescription mining. In: Proceedings of the 11th ACM international conference on knowledge discovery and data mining (SIGKDD). ACM, Chicago, IL, pp 364–373

Download references

Acknowledgments

Jilles Vreeken is supported by the Cluster of Excellence “Multimodal Computing and Interaction” within the Excellence Initiative of the German Federal Government.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hao Wu.

Additional information

Responsible editors: Toon Calders, Floriana Esposito, Eyke Hüllermeier, Rosa Meo.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, H., Vreeken, J., Tatti, N. et al. Uncovering the plot: detecting surprising coalitions of entities in multi-relational schemas. Data Min Knowl Disc 28, 1398–1428 (2014). https://doi.org/10.1007/s10618-014-0370-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-014-0370-1

Keywords

Navigation