Skip to main content
Log in

Solving submodular text processing problems using influence graphs

  • Review Article
  • Published:
Social Network Analysis and Mining Aims and scope Submit manuscript

Abstract

Submodular functions appear in a considerable number of important natural language processing problems such as text summarization and dataset selection. Current graph-based approaches to solving such problems do not pay special attention to the submodularity and simplistically do not learn the graph model. Instead, they roughly set the edge weights in the graph proportional to the similarity of their two endpoints. We argue that such a shallow modeling needs to be replaced by a deeper approach which learns the graph edge weights. As such, we propose a new method for learning the graph model corresponding the submodular function that is going to be maximized. In a number of real-world networks, our method leads to a 50% error reduction compared to the previously used baseline methods. Furthermore, we apply our proposed method followed by an influence maximization algorithm to two NLP tasks: text summarization and k-means initialization for topic selection. Using these case studies, we experimentally show the significance of our learning method over the previous shallow methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Agirre E, Martínez D, de Lacalle OL, Soroa A (2006) Two graph-based algorithms for state-of-the-art WSD. In: Proceedings of the 2006 conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 585–593

  • Alexandrescu A, Kirchhoff K (2007) Data-driven graph construction for semi-supervised graph-based learning in NLP. In: Proceedings of the main conference human language technologies 2007: the conference of the North American chapter of the association for computational linguistics, pp 204–211

  • Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, pp 1027–1035

  • Badanidiyuru A, Mirzasoleiman B, Karbasi A, Krause A (2014) Streaming submodular maximization. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining—KDD’14, pp 671–680

  • Baralis E, Cagliero L, Mahoto N, Fiori A (2013) GRAPHSUM: discovering correlations among multiple terms for graph-based summarization. Inf Sci 249:96–109

    Article  MathSciNet  Google Scholar 

  • Beliga S, Mestrovic A, Martincic-Ipsic S (2015) An overview of graph-based keyword extraction methods and approaches. J Inf Organ Sci 39(1):1–20

    Google Scholar 

  • Berton L, Valverde-Rebaza J, de Andrade Lopes A (2015) Link prediction in graph construction for supervised and semi-supervised learning. In: 2015 international joint conference on neural networks (IJCNN). IEEE, pp 1–8

  • Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 30(1–7):107–117

    Article  Google Scholar 

  • Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the K-means clustering algorithm. Expert Syst Appl 40(1):200–210

    Article  Google Scholar 

  • Chekuri C, Jayram TS, Vondrak J (2015) On multiplicative weight updates for concave and submodular function maximization. In: Proceedings of the 2015 conference on innovations in theoretical computer science—ITCS’15, pp 201–210

  • Chen W, Wang C, Wang Y (2010) Scalable influence maximization for prevalent viral marketing in large-scale social networks. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1029–1038

  • Cieri C, Graff D, Liberman M, Martey N, Strassel S (1999) The TDT-2 text and speech corpus. In: Proceedings of the broadcast news workshop’99, p 57

  • Erkan G, Radev DR (2004) LexRank: graph-based lexical centrality as salience in text summarization. J Artif Intell Res 22:457–479

    Article  Google Scholar 

  • Galluccio L, Michel O, Comon P, Hero AO III (2012) Graph based k-means clustering. Signal Process 92(9):1970–1984

    Article  Google Scholar 

  • Granovetter MS (1973) The strength of weak ties. Am J Sociol 78(6):1360–1380

    Article  Google Scholar 

  • Herings P, Van der Laan G, Talman D (2001) Measuring the power of nodes in digraphs. Technical report, Tinbergen Institute

  • Huang B, Yang Y, Mahmood A, Wang H (2012) Microblog topic detection based on LDA model and single-pass clustering. Int Conf Rough Sets Curr Trends Comput 2012:166–171

    Article  Google Scholar 

  • Kazemi E, Zadimoghaddam M, Karbasi A (2018) Scalable deletion-robust submodular maximization: data summarization with privacy and fairness constraints. In: International conference on machine learning, pp 2549–2558

  • Kleinberg JM (1999) Authoritative sources in a hyperlinked environment. J ACM (JACM) 46(5):604–632

    Article  MathSciNet  Google Scholar 

  • Klimt B, Yang Y (2004) Introducing the Enron corpus. In: CEAS

  • Kulesza A, Taskar B (2012) Determinantal point processes for machine learning. Found Trends® Mach Learn 5(2–3):123–286

    Article  Google Scholar 

  • Leskovec J (2018) Stanford large network dataset collection. https://snap.stanford.edu/data/index.html. Accessed 1 May 2019.

  • Leskovec J, Grobelnik M, Milic-Frayling N (2004) Learning semantic graph mapping for document summarization. In: Proceedings of ECML/PKDD-2004 workshop on knowledge discovery and ontologies

  • Leskovec J, Kleinberg J, Faloutsos C (2007) Graph evolution. ACM Trans Knowl Discov Data 1(2):1–39

    Google Scholar 

  • Leskovec J, Lang KJ, Dasgupta A, Mahoney MW (2009) Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math 6(1):29–123

    Article  MathSciNet  Google Scholar 

  • Lewis DD (2004) Reuters-21578 text categorization test collection. http://www.daviddlewis.com/resources/testcollections/reuters21578/. Accessed 1 May 2019

  • Li W, Joo J, Qi H, Zhu S-C (2017) Joint image-text news topic detection and tracking by multimodal topic and-or graph. IEEE Trans Multimed 19(2):367–381

    Article  Google Scholar 

  • Lin C-Y (2004) ROUGE: a package for automatic evaluation of summaries. In: Proceedings of the workshop on text summarization branches out (WAS 2004), Barcelona, Spain, July 25–26

  • Lin H, Bilmes J (2010) Multi-document summarization via budgeted maximization of submodular functions. In: HLT’10 human language technologies: the 2010 annual conference of the North American chapter of the association for computational linguistics, pp 912–920

  • Lin H, Bilmes J (2011) A class of submodular functions for document summarization. Comput Linguist 1:510–520

    Google Scholar 

  • Lin H, Bilmes J (2012) Learning mixtures of submodular shells with application to document summarization. arXiv preprint arXiv:1210.4871

  • Matsuo Y, Sakaki T, Uchiyama K, Ishizuka M (2006) Graph-based word clustering using a web search engine. In: Proceedings of the 2006 conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 542–550

  • Mei Q, Guo J, Radev D (2010) DivRank: the interplay of prestige and diversity in information networks. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1009–1018

  • Mihalcea R (2004) Graph-based ranking algorithms for sentence extraction, applied to text summarization. In: Proceedings of the ACL 2004 on interactive poster and demonstration sessions. Association for Computational Linguistics, p 20

  • Mihalcea R, Tarau P (2004) TextRank: bringing order into texts. Proc EMNLP 85:404–411

    Google Scholar 

  • Mirzasoleiman B, Sarkar R, Krause A (2013) Distributed submodular maximization: identifying representative elements in massive data. Adv Neural Inf Process Syst 26:2049–2057

    Google Scholar 

  • Mirzasoleiman B, Karbasi A, Sarkar R, Krause A (2016) Distributed submodular maximization. J Mach Learn Res 17(1):8330–8373

    MathSciNet  MATH  Google Scholar 

  • Mirzasoleiman B, Jegelka S, Krause A (2018) Streaming non-monotone submodular maximization: personalized video summarization on the fly. In: Thirty-second AAAI conference on artificial intelligence

  • Nemhauser GL, Wolsey LA, Fisher ML (1978) An analysis of approximations for maximizing submodular set functions—I. Math Program 14(1):265–294

    Article  MathSciNet  Google Scholar 

  • Pilehvar MT, Navigli R (2015) From senses to texts: an all-in-one graph-based approach for measuring semantic similarity. Artif Intell 228:95–128

    Article  MathSciNet  Google Scholar 

  • Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: Proceedings of the 20th conference on uncertainty in artificial intelligence, pp 487–494

  • Spina D, Gonzalo J, Amigó E (2014) Learning similarity functions for topic detection in online reputation monitoring. In: Proceedings of the 37th international ACM SIGIR conference on research and development in information retrieval—SIGIR’14, pp 527–536

  • Tang Y, Xiao X, Shi Y (2014) Influence maximization: near-optimal time complexity meets practical efficiency. Kdd 2014:75–86

    Google Scholar 

  • Tixier AJ, Meladianos P, Vazirgiannis M (2017) Combining graph degeneracy and submodularity for unsupervised extractive summarization. In: Proceedings of the workshop on new frontiers in summarization, pp 48–58

  • Vardasbi A, Faili H, Asadpour M (2017) SWIM: stepped weighted shell decomposition influence maximization for large-scale networks. ACM Trans Inf Syst 36(1):1–33

    Article  Google Scholar 

  • Véronis J (2004) Hyperlex: lexical cartography for information retrieval. Comput Speech Lang 18(3):223–252

    Article  Google Scholar 

  • Wang D, Zhu S, Li T, Gong Y (2009) Multi-document summarization using sentence-based topic models. In: Proceedings of the ACL-IJCNLP 2009 conference short papers, pp 297–300

  • Wang C, Yu X, Li Y, Zhai C, Han J (2013) Content coverage maximization on word networks for hierarchical topic summarization. In: Proceedings of the 22nd ACM international conference on conference on information and knowledge management—CIKM’13, pp 249–258

  • Watts DJ, Strogatz SH (1998) Collective dynamics of ‘small-world’ networks. Nature 393(6684):440–442

    Article  Google Scholar 

  • Weng, J, Yao Y, Leonardi E, Lee F (2011) Event detection in Twitter. Development, pp 401–408

  • Xie P, Xing EP (2013) Integrating document clustering and topic modeling. arXiv preprint arXiv:1309.6874

  • Yang J, Leskovec J (2015) Defining and evaluating network communities based on ground-truth. Knowl Inf Syst 42(1):181–213

    Article  Google Scholar 

  • Yang Y, Pierce T, Carbonell J (1998) A study of retrospective and on-line event detection. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval—SIGIR’98, pp 28–36

  • Yasunaga, M, Zhang R, Meelu K, Pareek A, Srinivasan K, Radev D (2017) Graph-based neural multi-document summarization. arXiv preprint arXiv:1706.06681

  • Zheng M, Bu J, Chen C, Wang C, Zhang L, Qiu G, Cai D (2011) Graph regularized sparse coding for image representation. IEEE Trans Image Process 20(5):1327–1336

    Article  MathSciNet  Google Scholar 

  • Zhou, T, Ouyang H, Chang Y, Bilmes J, Guestrin C (2016) Scaling submodular maximization via pruned submodularity graphs. arXiv preprint arXiv:1606.00399

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ali Vardasbi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vardasbi, A., Faili, H. & Asadpour, M. Solving submodular text processing problems using influence graphs. Soc. Netw. Anal. Min. 9, 21 (2019). https://doi.org/10.1007/s13278-019-0559-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13278-019-0559-9

Keywords

Navigation