Uncovering the plot: detecting surprising coalitions of entities in multi-relational schemas

Wu, Hao; Vreeken, Jilles; Tatti, Nikolaj; Ramakrishnan, Naren

doi:10.1007/s10618-014-0370-1

Uncovering the plot: detecting surprising coalitions of entities in multi-relational schemas

Published: 22 July 2014

Volume 28, pages 1398–1428, (2014)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Hao Wu^1,2,
Jilles Vreeken^3,4,
Nikolaj Tatti^5,6 &
…
Naren Ramakrishnan^2,7

473 Accesses
10 Citations
Explore all metrics

Abstract

Many application domains such as intelligence analysis and cybersecurity require tools for the unsupervised identification of suspicious entities in multi-relational/network data. In particular, there is a need for automated semi-automated approaches to ‘uncover the plot’, i.e., to detect non-obvious coalitions of entities bridging many types of relations. We cast the problem of detecting such suspicious coalitions and their connections as one of mining surprisingly dense and well-connected chains of biclusters over multi-relational data. With this as our goal, we model data by the Maximum Entropy principle, such that in a statistically well-founded way we can gauge the surprisingness of a discovered bicluster chain with respect to what we already know. We design an algorithm for approximating the most informative multi-relational patterns, and provide strategies to incrementally organize discovered patterns into the background model. We illustrate how our method is adept at discovering the hidden plot in multiple synthetic and real-world intelligence analysis datasets. Our approach naturally generalizes traditional attribute-based maximum entropy models for single relations, and further supports iterative, human-in-the-loop, knowledge discovery.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The detection of criminal groups in real-world fused data: using the graph-mining algorithm “GraphExtract”

Article Open access 17 August 2018

David Robinson & Chris Scogings

Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Clustering of Links and Clustering of Nodes: Fusion of Knowledge in Social Networks

Notes

Note that, to save computation, we do not update the MaxEnt model after adding each \(B^*\). However, in line with the local score, we know that adding a bicluster typically only changes the distribution locally, and as we never re-visit the same relation \(R\) in a single chain \(C\) the information by \(B_i\) is unlikely to influence much the informativeness of \(B_{i+1}\).
http://dac.cs.vt.edu/projects.
http://www.alchemyapi.com/.

References

Califano A, Stolovitzky G, Tu Y (2000) Analysis of gene expression microarrays for phenotype classification. In: Proceedings of the 8th international conference on intelligent systems for molecular biology, pp 75–85
Cerf L, Besson J, Robardet C, Boulicaut JF (2009) Closed patterns meet n-ary relations. ACM Trans Knowl Discov Data 3(1):3:1–3:36
Article Google Scholar
Cerf L, Besson J, Nguyen KNT, Boulicaut JF (2013) Closed and noise-tolerant patterns in n-ary relations. Data Min Knowl Discov 26(3):574–619
Article MATH MathSciNet Google Scholar
Cheng Y, Church GM (2000) Biclustering of expression data. In: Proceedings of the eighth international conference on intelligent systems for molecular biology, AAAI Press, pp 93–103
Cover T, Thomas J (2006) Elements of information theory. Wiley, New York
MATH Google Scholar
Csiszar I (1975) \(I\)-Divergence geometry of probability distributions and minimization problems. Ann Probab 3(1):146–158
Article MATH MathSciNet Google Scholar
Darroch JN, Ratcliff D (1972) Generalized iterative scaling for log-linear models. Ann Math Stat 43(5):1470–1480
Article MATH MathSciNet Google Scholar
Davis WLI, Schwarz P, Terzi E (2009) Finding representative association rules from large rule collections. In: Proceedings of the 9th SIAM international conference on data mining (SDM). Sparks, NV, SIAM, pp 521–532
De Bie T (2011) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446
Article MATH MathSciNet Google Scholar
Dehaspe L, Toironen H (2000) Discovery of relational association rules. In: Dĕzeroski S (ed) Relational data mining. Springer, New York Inc, pp 189–208
Dzeroski S, Lavrac N (eds) (2001) Relational data mining. Springer, Berlin
MATH Google Scholar
Geerts F, Goethals B, Mielikainen T (2004) Tiling databases. In: Proceedings of discovery science. Springer, Berlin, pp 278–289
Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Discov Data 1(3):167–176
Article Google Scholar
Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something I don’t know: randomization strategies for iterative data mining. In: Proceedings of the 15th ACM international conference on knowledge discovery and data mining (SIGKDD). ACM, Paris, France, pp 379–388
Hossain M, Gresock J, Edmonds Y, Helm R, Potts M, Ramakrishnan N (2012a) Connecting the dots between PubMed abstracts. PLoS ONE 7(1)
Hossain MS, Butler P, Boedihardjo AP, Ramakrishnan N (2012b) Storytelling in entity networks to support intelligence analysts. In: Proceedings of the 18th ACM international conference on knowledge discovery and data mining (SIGKDD). ACM, Beijing, China, pp 1375–1383
Hughes FJ (2005) Discovery, proof, choice: the art and science of the process of intelligence analysis, case study 6, “All Fall Down”, unpublished report
Jaynes ET (1957) Information theory and statistical mechanics. Phys Rev Ser II 106(4):620–630
MATH MathSciNet Google Scholar
Jin Y, Murali TM, Ramakrishnan N (2008) Compositional mining of multirelational biological datasets. ACM Trans Knowl Discov Data 2(1):2:1–2:35
Article Google Scholar
Kiernan J, Terzi E (2008) Constructing comprehensive summaries of large event sequences. In: Proceedings of the 14th ACM international conference on knowledge discovery and data mining (SIGKDD). Las Vegas, NV, pp 417–425
Kontonasios KN, Vreeken J, De Bie T (2011) Maximum entropy modelling for assessing results on real-valued data. In: Proceedings of the 11th IEEE international conference on data mining (ICDM). Vancouver, Canada, IEEE, pp 350–359
Kontonasios KN, Vreeken J, De Bie T (2013) Maximum entropy models for iteratively identifying subjectively interesting structure in real-valued data. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD). Springer, Prague, Czech Republic, pp 256–271
Kumar D, Ramakrishnan N, Helm RF, Potts M (2006) Algorithms for storytelling. In: Proceedings of the 12th ACM international conference on knowledge discovery and data Mining (SIGKDD), Philadelphia, PA, pp 604–610
Lavrac N, Flach P (2001) An extended transformation approach to inductive logic programming. ACM Trans Comput Logic 2(4):458–494
Article Google Scholar
Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinf 1(1):24–45
Article Google Scholar
Mampaey M, Vreeken J, Tatti N (2012) Summarizing data succinctly with the most informative itemsets. ACM Trans Knowl Discov Data 6:1–44
Article Google Scholar
Ojala M, Garriga GC, Gionis A, Mannila H (2010) Evaluating query result significance in databases via randomizations. In: Proceedings of the 10th SIAM international conference on data mining (SDM). Columbus, OH, pp 906–917
Rasch G (1960) Probabilistic models for some intelligence and attainnment tests. Danmarks paedagogiske Institut
Rissanen J (1978) Modeling by shortest data description. Automatica 14(1):465–471
Article MATH Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article MATH Google Scholar
Segal E, Taskar B, Gasch A, Friedman N, Koller D (2001) Rich probabilistic models for gene expression. Bioinformatics 17(suppl 1):S243–S252
Article Google Scholar
Shahaf D, Guestrin C (2010) Connecting the dots between news articles. In: Proceedings of the 16th ACM international conference on knowledge discovery and data mining (SIGKDD). ACM, Washington, DC, pp 623–632
Shahaf D, Guestrin C (2012) Connecting two (or less) dots: discovering structure in news articles. ACM Trans Knowl Discov Data 5(4):24:1–24:31
Article Google Scholar
Sheng Q, Moreau Y, De Moor B (2003) Biclustering microarray data by gibbs sampling. Bioinformatics 19(suppl 2):196–205
Article Google Scholar
Spyropoulou E, De Bie T (2011) Interesting multi-relational patterns. Proceedings of the 11th IEEE international conference on data mining (ICDM). Vancouver, Canada, pp 675–684
Spyropoulou E, De Bie T, Boley M (2013) Mining interesting patterns in multi-relational data with n-ary relationships. Discovery Science, vol 8140, Lecture Notes in Computer Science. Springer, Berlin, pp 217–232
Spyropoulou E, De Bie T, Boley M (2014) Interesting pattern mining in multi-relational data. Data Min Knowl Discov 28(3):808–849
Article MathSciNet Google Scholar
Tatti N (2006) Computational complexity of queries based on itemsets. Inf Process Lett 98(5):183–187. doi:10.1016/j.ipl.2006.02.003
Article MATH MathSciNet Google Scholar
Tatti N, Vreeken J (2012) Comparing apples and oranges - measuring differences between exploratory data mining results. Data Min Knowl Disc 25(2):173–207
Article MATH MathSciNet Google Scholar
Tibshirani R, Hastie T, Eisen M, Ross D, Botstein D, Brown P (1999) Clustering methods for the analysis of dna microarray data. Stanford University, Tech. rep
Uno T, Kiyomi M, Arimura H (2005) Lcm ver.3: collaboration of array, bitmap and prefix tree for frequent itemset mining. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations, ACM, New York, NY, USA, OSDM ’05, pp 77–86
Wang C, Parthasarathy S (2006) Summarizing itemset patterns using probabilistic models. In: Proceedings of the 12th ACM international conference on knowledge discovery and data mining (SIGKDD), Philadelphia, PA, pp 730–735
Zaki M, Hsiao CJ (2005) Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans Knowl Data Eng 17(4):462–478
Article Google Scholar
Zaki MJ, Ramakrishnan N (2005) Reasoning about sets using redescription mining. In: Proceedings of the 11th ACM international conference on knowledge discovery and data mining (SIGKDD). ACM, Chicago, IL, pp 364–373

Download references

Acknowledgments

Jilles Vreeken is supported by the Cluster of Excellence “Multimodal Computing and Interaction” within the Excellence Initiative of the German Federal Government.

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Virginia Tech, Arlington, VA, USA
Hao Wu
Discovery Analytics Center, Virginia Tech, Arlington, VA, USA
Hao Wu & Naren Ramakrishnan
Max Planck Institute for Informatics, Saarbrücken, Germany
Jilles Vreeken
Cluster of Excellence MMCI, Saarland University, Saarbrücken, Germany
Jilles Vreeken
HIIT, Department of Information and Computer Science, Aalto University, Aalto, Finland
Nikolaj Tatti
Department of Computer Science, KU Leuven, Leuven, Belgium
Nikolaj Tatti
Department of Computer Science, Virginia Tech, Arlington, VA, USA
Naren Ramakrishnan

Authors

Hao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jilles Vreeken
View author publications
You can also search for this author in PubMed Google Scholar
Nikolaj Tatti
View author publications
You can also search for this author in PubMed Google Scholar
Naren Ramakrishnan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hao Wu.

Additional information

Responsible editors: Toon Calders, Floriana Esposito, Eyke Hüllermeier, Rosa Meo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, H., Vreeken, J., Tatti, N. et al. Uncovering the plot: detecting surprising coalitions of entities in multi-relational schemas. Data Min Knowl Disc 28, 1398–1428 (2014). https://doi.org/10.1007/s10618-014-0370-1

Download citation

Received: 02 November 2013
Accepted: 21 June 2014
Published: 22 July 2014
Issue Date: September 2014
DOI: https://doi.org/10.1007/s10618-014-0370-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Uncovering the plot: detecting surprising coalitions of entities in multi-relational schemas

Abstract

Access this article

Similar content being viewed by others

The detection of criminal groups in real-world fused data: using the graph-mining algorithm “GraphExtract”

Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Clustering of Links and Clustering of Nodes: Fusion of Knowledge in Social Networks

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Uncovering the plot: detecting surprising coalitions of entities in multi-relational schemas

Abstract

Access this article

Similar content being viewed by others

The detection of criminal groups in real-world fused data: using the graph-mining algorithm “GraphExtract”

Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Clustering of Links and Clustering of Nodes: Fusion of Knowledge in Social Networks

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation