skip to main content
10.1145/3097983.3098105acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Public Access

MetaPAD: Meta Pattern Discovery from Massive Text Corpora

Published: 04 August 2017 Publication History

Abstract

Mining textual patterns in news, tweets, papers, and many other kinds of text corpora has been an active theme in text mining and NLP research. Previous studies adopt a dependency parsing-based pattern discovery approach. However, the parsing results lose rich context around entities in the patterns, and the process is costly for a corpus of large scale. In this study, we propose a novel typed textual pattern structure, called meta pattern, which is extended to a frequent, informative, and precise subsequence pattern in certain context. We propose an efficient framework, called MetaPAD, which discovers meta patterns from massive corpora with three techniques: (1) it develops a context-aware segmentation method to carefully determine the boundaries of patterns with a learnt pattern quality assessment function, which avoids costly dependency parsing and generates high-quality patterns; (2) it identifies and groups synonymous meta patterns from multiple facets---their types, contexts, and extractions; and (3) it examines type distributions of entities in the instances extracted by each group of patterns, and looks for appropriate type levels to make discovered patterns precise. Experiments demonstrate that our proposed framework discovers high-quality typed textual patterns efficiently from different genres of massive corpora and facilitates information extraction.

References

[1]
Rakesh Agrawal and Ramakrishnan Srikant 1995. Mining sequential patterns. In ICDE. 3--14.
[2]
Gabor Angeli, Melvin Johnson Premkumar, and Christopher D Manning 2015. Leveraging linguistic structure for open domain information extraction Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015).
[3]
Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. Dbpedia: a nucleus for a web of open data. The semantic web. 722--735.
[4]
Michele Banko, Michael J Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. 2007. Open information extraction from the web. In IJCAI, Vol. Vol. 7. 2670--2676.
[5]
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor 2008. Freebase: a collaboratively created graph database for structuring human knowledge SIGMOD. 1247--1250.
[6]
Leo Breiman. 2001. Random forests. Machine learning, Vol. 45, 1 (2001), 5--32.
[7]
Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R Hruschka Jr, and Tom M Mitchell 2010. Toward an architecture for never-ending language learning AAAI, Vol. Vol. 5. 3.
[8]
Zhiyuan Chen, Arjun Mukherjee, and Bing Liu. 2014. Aspect extraction with automated prior knowledge learning ACL.
[9]
Marie-Catherine De Marneffe, Bill MacCartney, Christopher D Manning, and others 2006. Generating typed dependency parses from phrase structure parses Proceedings of LREC, Vol. Vol. 6. Genoa, 449--454.
[10]
Anthony Fader, Stephen Soderland, and Oren Etzioni. 2011. Identifying relations for open information extraction EMNLP. 1535--1545.
[11]
Rayid Ghani, Katharina Probst, Yan Liu, Marko Krema, and Andrew Fano 2006. Text mining for product attribute extraction. SIGKDD Explorations, Vol. 8, 1 (2006), 41--48.
[12]
Rahul Gupta, Alon Halevy, Xuezhi Wang, Steven Euijong Whang, and Fei Wu 2014. Biperpedia: an ontology for search applications. PVLDB, Vol. 7, 7 (2014), 505--516.
[13]
Alon Halevy, Natalya Noy, Sunita Sarawagi, Steven Euijong Whang, and Xiao Yu 2016. Discovering structure in the universe of attribute names Proceedings of the 25th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 939--949.
[14]
Marti A Hearst. 1992. Automatic acquisition of hyponyms from large text corpora Proceedings of the 14th conference on Computational linguistics-Volume 2. Association for Computational Linguistics, 539--545.
[15]
Minqing Hu and Bing Liu 2004. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 168--177.
[16]
Heng Ji, Ralph Grishman, Hoa Trang Dang, Kira Griffitt, and Joe Ellis 2010. Overview of the TAC 2010 knowledge base population track Third Text Analysis Conference (TAC), Vol. Vol. 3. 3--3.
[17]
Meng Jiang, Peng Cui, Alex Beutel, Christos Faloutsos, and Shiqiang Yang 2016. Inferring lockstep behavior from connectivity pattern in large graphs. Knowledge and Information Systems Vol. 48, 2 (2016), 399--428.
[18]
Meng Jiang, Christos Faloutsos, and Jiawei Han. 2016. CatchTartan: Representing and Summarizing Dynamic Multicontextual Behaviors Proceedings of the 22rd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM.
[19]
Anitha Kannan, Inmar E Givoni, Rakesh Agrawal, and Ariel Fuxman 2011. Matching unstructured product offers to structured product specifications Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 404--412.
[20]
Xiao Ling and Daniel S Weld 2012. Fine-grained entity recognition. In AAAI.
[21]
Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, and Jiawei Han 2015. Mining quality phrases from massive text corpora. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1729--1744.
[22]
Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky 2014. The stanford corenlp natural language processing toolkit. ACL (System Demonstrations). 55--60.
[23]
Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005. Online large-margin training of dependency parsers Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, 91--98.
[24]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean 2013. Distributed representations of words and phrases and their compositionality Advances in neural information processing systems. 3111--3119.
[25]
Thahir P Mohamed, Estevam R Hruschka Jr, and Tom M Mitchell 2011. Discovering relations between noun categories. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1447--1455.
[26]
David Nadeau and Satoshi Sekine 2007. A survey of named entity recognition and classification. Lingvisticae Investigationes Vol. 30, 1 (2007), 3--26.
[27]
Ndapandula Nakashole, Tomasz Tylenda, and Gerhard Weikum. 2013. Fine-grained semantic typing of emerging entities. ACL. 1488--1497.
[28]
Ndapandula Nakashole, Gerhard Weikum, and Fabian Suchanek. 2012. PATTY: a taxonomy of relational patterns with semantic types EMNLP. 1135--1145.
[29]
Vivi Nastase, Michael Strube, Benjamin Börschinger, Cäcilia Zirn, and Anas Elghafari. 2010. WikiNet: a very large scale multi-lingual concept network LREC.
[30]
Marius Pasca and Benjamin Van Durme 2008. Weakly-supervised acquisition of open-domain classes and class attributes from web documents and query logs. In ACL. 19--27.
[31]
Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Jianyong Wang, Helen Pinto, Qiming Chen, Umeshwar Dayal, and Mei-Chun Hsu. 2004. Mining sequential patterns by pattern-growth: The prefixspan approach. TKDE, Vol. 16, 11 (2004), 1424--1440.
[32]
Katharina Probst, Rayid Ghani, Marko Krema, Andrew Fano, and Yan Liu 2007. Semi-supervised learning of attribute-value pairs from product descriptions AAAI.
[33]
Sujith Ravi and Marius Pacsca 2008. Using structured text for large-scale attribute extraction Proceedings of the 17th ACM conference on Information and knowledge management. ACM, 1183--1192.
[34]
Xiang Ren, Ahmed El-Kishky, Chi Wang, Fangbo Tao, Clare R Voss, and Jiawei Han. 2015. Clustype: Effective entity recognition and typing by relation phrase-based clustering Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 995--1004.
[35]
Xiang Ren, Wenqi He, Meng Qu, Clare R Voss, Heng Ji, and Jiawei Han. 2016. Label noise reduction in entity typing by heterogeneous partial-label embedding KDD.
[36]
Michael Schmitz, Robert Bart, Stephen Soderland, Oren Etzioni, and others 2012. Open language learning for information extraction. Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 523--534.
[37]
Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss, and Jiawei Han. 2017. Automated Phrase Mining from Massive Text Corpora. arXiv preprint arXiv:1702.04457 (2017).
[38]
Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon. 2015. Representing Text for Joint Embedding of Text and Knowledge Bases. EMNLP, Vol. Vol. 15. 1499--1509.
[39]
Fei Wu and Daniel S Weld 2010. Open information extraction using Wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 118--127.
[40]
Mohamed Yahya, Steven Whang, Rahul Gupta, and Alon Y Halevy 2014. ReNoun: fact extraction for nominal attributes. In EMNLP. 325--335.
[41]
Dian Yu and Heng Ji. 2016. Unsupervised person slot filling based on graph mining ACL.
[42]
Ning Zhong, Yuefeng Li, and Sheng-Tang Wu 2012. Effective pattern discovery for text mining. IEEE transactions on knowledge and data engineering, Vol. 24, 1 (2012), 30--44.

Cited By

View all
  • (2024)Kristeva on Exile, Artificial Intelligence, and the One-dimensional UniverseActa Nova Humanistica: A Journal of Humanities Published by New Bulgarian University10.33919/ANHNBU.24.1.2.111:2(131-143)Online publication date: 30-Dec-2024
  • (2024)Multiple Perspectives Analysis for Document-Level Relation Extraction2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650900(1-9)Online publication date: 30-Jun-2024
  • (2024)PEB-TAXO: Projecting Entities as Boxes for Taxonomy ExpansionNeural Processing Letters10.1007/s11063-024-11575-356:2Online publication date: 12-Mar-2024
  • Show More Cited By
  1. MetaPAD: Meta Pattern Discovery from Massive Text Corpora

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
    August 2017
    2240 pages
    ISBN:9781450348874
    DOI:10.1145/3097983
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 August 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. attribute discovery
    2. context-aware segmentation
    3. information extraction
    4. meta pattern
    5. natural language processing
    6. pattern discovery
    7. synonymous meta pattern
    8. text corpora
    9. text mining
    10. textual pattern

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    KDD '17
    Sponsor:

    Acceptance Rates

    KDD '17 Paper Acceptance Rate 64 of 748 submissions, 9%;
    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)113
    • Downloads (Last 6 weeks)13
    Reflects downloads up to 07 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Kristeva on Exile, Artificial Intelligence, and the One-dimensional UniverseActa Nova Humanistica: A Journal of Humanities Published by New Bulgarian University10.33919/ANHNBU.24.1.2.111:2(131-143)Online publication date: 30-Dec-2024
    • (2024)Multiple Perspectives Analysis for Document-Level Relation Extraction2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650900(1-9)Online publication date: 30-Jun-2024
    • (2024)PEB-TAXO: Projecting Entities as Boxes for Taxonomy ExpansionNeural Processing Letters10.1007/s11063-024-11575-356:2Online publication date: 12-Mar-2024
    • (2023)Study on the Extraction of Law Enforcement Relationships in Administrative Law Enforcement Instrument DataProceedings of the 4th International Conference on Artificial Intelligence and Computer Engineering10.1145/3652628.3652631(17-21)Online publication date: 17-Nov-2023
    • (2023)A Comprehensive Survey on Automatic Knowledge Graph ConstructionACM Computing Surveys10.1145/361829556:4(1-62)Online publication date: 5-Sep-2023
    • (2023)LogKG: Log Failure Diagnosis Through Knowledge GraphIEEE Transactions on Services Computing10.1109/TSC.2023.329389016:5(3493-3507)Online publication date: Sep-2023
    • (2023)Explainable Hyperlink Prediction: A Hypergraph Edit Distance-Based Approach2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00386(245-257)Online publication date: Apr-2023
    • (2023)RoREDInformation Sciences: an International Journal10.1016/j.ins.2023.01.132629:C(62-76)Online publication date: 1-Jun-2023
    • (2023)A Relational Instance-Based Clustering Method with Contrastive Learning for Open Relation ExtractionAdvances in Knowledge Discovery and Data Mining10.1007/978-3-031-33377-4_31(401-411)Online publication date: 25-May-2023
    • (2022)A review of the knowledge extraction technology in knowledge graph2022 41st Chinese Control Conference (CCC)10.23919/CCC55666.2022.9901677(4211-4218)Online publication date: 25-Jul-2022
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media