skip to main content
10.1145/2063576.2063781acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Facilitating pattern discovery for relation extraction with semantic-signature-based clustering

Published: 24 October 2011 Publication History

Abstract

Hand-crafted textual patterns have been the mainstay device of practical relation extraction for decades. However, there has been little work on reducing the manual effort involved in the discovery of effective textual patterns for relation extraction. In this paper, we propose a clustering-based approach to facilitate the pattern discovery for relation extraction. Specifically, we define the notion of semantic signature to represent the most salient features of a textual fragment. We then propose a novel clustering algorithm based on semantic signature, S2C, and its enhancement S2C+. Experiments on two real-world data sets show that, when compared with k-means clustering, S2C and S2C+ are at least an order of magnitude faster, while generating high quality clusters that are at least comparable to the best clusters generated by k-means without requiring any manual tuning. Finally, a user study confirms that our clustering-based approach can indeed help users discover effective textual patterns for relation extraction with only a fraction of the manual effort required by the conventional approach.

References

[1]
E. Agichtein and L. Gravano. Snowball: extracting relations from large plain-text collections. In DL, 2000.
[2]
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In VLDB, 1994.
[3]
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In IJCAI, 2007.
[4]
S. Brin. Extracting patterns and relations from the world wide web. In WebDB, 1999.
[5]
R. Bunescu and R. Mooney. Learning to extract relations from the web using minimal supervision. In ACL, 2007.
[6]
M. Califf and R. Mooney. Relational learning of pattern match rules for information extraction. In ACL Workshop on Natural Language Learning, 1997.
[7]
J. Chen, D. Ji, C. L. Tan, and Z. Niu. Unsupervised feature selection for relation extraction. In IJCNLP, 2005.
[8]
C.-F. Chiang, L. Chiticariu, V. Chu, S. Dasgupta, T. Goetz, H. Ho, R. Krishnamurthy, A. Lang, Y. Li, B. Liu, S. Raghavan, F. Reiss, S. Vaithyanathan, and H. Zhu. The SystemT IDE: An integrated development environment for information extraction rules. In SIGMOD, 2011.
[9]
L. Chiticariu, R. Krishnamurthy, Y. Li, S. Raghavan, F. R. Reiss, and S. Vaithyanathan. Systemt: an algebraic approach to declarative information extraction. In ACL, 2010.
[10]
L. Chiticariu, R. Krishnamurthy, Y. Li, F. Reiss, and S. Vaithyanathan. Domain adaptation of rule-based annotators for named-entity recognition tasks. In EMNLP, 2010.
[11]
L. Chiticariu, Y. Li, S. Raghavan, and F. Reiss. Enterprise information extraction: Recent developments and open challenges. In SIGMOD, 2010.
[12]
A. Culotta and A. McCallum. Confidence estimation for information extraction. In HLT/NAACL, 2004.
[13]
D. Davidov and A. Rappoport. Unsupervised discovery of generic relationships using pattern clusters and its evaluation by automatically generated sat analogy questions. In ACL-HLT, 2008.
[14]
A. Doan, R. Ramakrishnan, and S. Vaithyanathan. Managing information extraction: State of the art and research directions. In SIGMOD, 2006.
[15]
D. Downey, O. Etzioni, S. Soderland, and D. Weld. Learning text patterns for web information extraction and assessment. In AAAI Workshop on Adaptive Text Extraction and Mining, 2004.
[16]
O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Unsupervised named-entity extraction from the web: An experimental study. Artif. Intell., 165:91--134, 2005.
[17]
M. R. Gormley, A. Gerber, M. Harper, and M. Dredze. Non-expert correction of automatically generated relation annotations. In NAACL HLT Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, 2010.
[18]
M. A. Greenwood and M. Stevenson. Improving semi-supervised acquisition of relation extraction patterns. In IEBeyondDoc, 2006.
[19]
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: An update. SIGKDD Explorations, 11:10--18, 2009.
[20]
T. Hasegawa, S. Sekine, and R. Grishman. Discovering relations among named entities from large corpora. In ACL, 2004.
[21]
M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In ACL, 1992.
[22]
P. Jaccard. Étude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin del la Société Vaudoise des Sciences Naturelles, 37:547--579, 1901.
[23]
N. Jindal and B. Liu. Mining comparative sentences and relations. In AAAI, 2006.
[24]
N. Kambhatla. Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations. In ACL, 2004.
[25]
S. Kok and P. Domingos. Extracting semantic networks from text via relational clustering. In ECML/PKDD, 2008.
[26]
C. Li, B. Wang, and X. Yang. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In VLDB, 2007.
[27]
J. B. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. of the 5th Berkeley Symp. on Math. Stat. and Prob. UC Press, 1967.
[28]
E. Minkov, R. C. Wang, and W. W. Cohen. Extracting personal names from email: applying named entity recognition to informal text. In HLT/EMNLP, 2005.
[29]
A. Nanopoulos, D. Katsaros, and Y. Manolopoulos. A data mining algorithm for generalized web prefetching. IEEE Trans. Knowledge and Data Engineering, 15:1155--1169, 2003.
[30]
D. Ravichandran and E. Hovy. Learning surface text patterns for a question answering system. In ACL, 2001.
[31]
F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE, 2008.
[32]
E. Riloff and R. Jones. Learning dictionaries for information extraction by multi-level boot-strapping. In AAAI, 1999.
[33]
B. Rosenfeld and R. Feldman. Clustering for unsupervised relation identification. In CIKM, 2007.
[34]
B. Rozenfeld and R. Feldman. High-performance unsupervised relation extraction from large corpora. In ICDM, 2006.
[35]
L. Schmidt-Thieme and W. Gaul. Frequent generalized subsequences - a problem from web mining. In Data Analysis, Scientific Modelling and Practical Application. Springer, 2000.
[36]
R. Snow, B. O'Connor, D. Jurafsky, and A. Y. Ng. Cheap and fast--but is it good?: evaluating non-expert annotations for natural language tasks. In EMNLP, 2008.
[37]
M. Stevenson and M. Greenwood. A semantic approach to IE pattern induction. In ACL, 2005.
[38]
H. Theil. On the estimation of relationships involving qualitative variables. Amer. J. Sociology, 76(1):103--154, 1970.
[39]
D. Zelenko, C. Aone, and A. Richardella. Kernel methods for relation extraction. JMLR, 3:1083--1106.
[40]
S. Zhao and R. Grishman. Extracting relations with integrated information using kernel methods. In ACL, 2005.

Cited By

View all
  • (2022)Second-generation bioenergy from oilseed crop residues: Recent technologies, techno-economic assessments and policiesEnergy Conversion and Management10.1016/j.enconman.2022.115869267(115869)Online publication date: Sep-2022
  • (2018)Web Information ExtractionEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_459(4620-4629)Online publication date: 7-Dec-2018
  • (2017)Web Information ExtractionEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_459-2(1-9)Online publication date: 27-Jan-2017
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management
October 2011
2712 pages
ISBN:9781450307178
DOI:10.1145/2063576
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 October 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. clustering
  2. information extraction
  3. pattern discovery
  4. relation extraction
  5. semantic signature

Qualifiers

  • Research-article

Conference

CIKM '11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Second-generation bioenergy from oilseed crop residues: Recent technologies, techno-economic assessments and policiesEnergy Conversion and Management10.1016/j.enconman.2022.115869267(115869)Online publication date: Sep-2022
  • (2018)Web Information ExtractionEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_459(4620-4629)Online publication date: 7-Dec-2018
  • (2017)Web Information ExtractionEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_459-2(1-9)Online publication date: 27-Jan-2017
  • (2015)VINERyProceedings of the VLDB Endowment10.14778/2824032.28241088:12(1948-1951)Online publication date: 1-Aug-2015
  • (2013)I can do text analytics!Proceedings of the SIGCHI Conference on Human Factors in Computing Systems10.1145/2470654.2466212(1599-1608)Online publication date: 27-Apr-2013
  • (2012)WizIEProceedings of the ACL 2012 System Demonstrations10.5555/2390470.2390489(109-114)Online publication date: 10-Jul-2012
  • (2012)LODifierProceedings of the 9th international conference on The Semantic Web: research and applications10.1007/978-3-642-30284-8_21(210-224)Online publication date: 27-May-2012

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media