skip to main content
10.1145/1281192.1281234acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Detecting research topics via the correlation between graphs and texts

Published: 12 August 2007 Publication History

Abstract

In this paper we address the problem of detecting topics in large-scale linked document collections. Recently, topic detection has become a very active area of research due to its utility for information navigation, trend analysis, and high-level description of data. We present a unique approach that uses the correlation between the distribution of a term that represents a topic and the link distribution in the citation graph where the nodes are limited to the documents containing the term. This tight coupling between term and graph analysis is distinguished from other approaches such as those that focus on language models. We develop a topic score measure for each term, using the likelihood ratio of binary hypotheses based on a probabilistic description of graph connectivity. Our approach is based on the intuition that if a term is relevant to a topic, the documents containing the term have denser connectivity than a random selection of documents. We extend our algorithm to detect a topic represented by a set of terms, using the intuition that if the co-occurrence of terms represents a new topic, the citation pattern should exhibit the synergistic effect. We test our algorithm on two electronic research literature collections,arXiv and Citeseer.Our evaluation shows that the approach is effective and reveals some novel aspects of topic detection.

Supplementary Material

JPG File (p370-jo-200.jpg)
JPG File (p370-jo-768.jpg)
Low Resolution (p370-jo-200.mov)
High Resolution (p370-jo-768.mov)

References

[1]
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In SIGMOD, 1993.
[2]
R. Angelova and G. Weikum. Graph-based text classification: Learn from your neighbors. In Proceedings of SIGIR, 2006.
[3]
arXiv. http://arxiv.org.
[4]
D. M. Blei and J. D. Lafferty. Correlated topic models. In NIPS, 2005.
[5]
L. Bolelli, S. Ertekin, and C. L. Giles. Clustering scientific literature using sparse citation graph analysis. In PKDD, pages 30--41, 2006.
[6]
Citeseer. http://citeseer.ist.psu.edu.
[7]
E. Erosheva, S. Fienberg, and J. Lafferty. Mixed-membership models of scientific publications. Proceedings of the National Academy of Sciences, 101, 2004.
[8]
G. W. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web communities. In Proceedings of SIGKDD, 2000.
[9]
T. I. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, (5):5228--5235, 2004.
[10]
J. Hopcroft, O. Khan, B. Kulis, and B. Selman. Natural communities in large linked networks. In Proceedings of SIGKDD, 2003.
[11]
H. Ino, M. Kudo, and A. Nakamura. Partitioning of web graphs by community topology. In Proceedings of WWW, 2005.
[12]
J. Kleinberg. Bursty and hierarchical structure in streams. In Proceedings of SIGKDD, 2002.
[13]
R. Kumar, U. Mahadevan, and D. Sivakumar. A graph-theoretic approach to extract storylines from search results. In Proceedings of SIGKDD, 2004.
[14]
C.-Y. Lin and E. Hovy. The automated acquisition of topic signatures for text summarization. In Proceedings of the COLING Conference, Strausbourg, France, 2002.
[15]
G. S. Mann, D. Mimno, and A. McCallum. Bibliometric impact measures leveraging topic analysis. In JCDL, 2006.
[16]
A. McCallum, A. Corrada-Emmanuel, and X. Wang. The author-recipient-topic model for topic and role discovery in social networks: Experiments with enron and academic email. Technical Report, 2004.
[17]
Q. Mei and C. Zhai. Discovering evolutionary theme patterns from text - an exploration of temporal text mining. In Proceedings of SIGKDD, 2005.
[18]
D. B. Neill, A. W. Moore, M. Sabhnani, and K. Daniel. Detection of emerging space-time clusters. In Proceedings of SIGKDD, 2005.
[19]
M. Newman. Scientific collaboration networks. i. network construction and fundamental results. PHYSICAL REVIEW E, 64, 2001.
[20]
M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. arXiv:cond-mat/0308217, 2003.
[21]
J. D. M. Rennie and T. Jaakkola. Using term informativeness for named entity detection. In Proceedings of SIGIR, 2005.
[22]
M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Griffiths. Probabilistic author-topic models for information discovery. In Proceedings of SIGKDD, 2004.
[23]
X. Wang and A. McCallum. Topics over time: A non-markov continuous-time model of topical trends. In Proceedings SIGKDD, 2006.
[24]
D. Zhou, E. Manavoglu, J. Li, C. L. Giles, and H. Zha. Probabilistic models for discovering e-communities. In Proceedings of WWW, 2006.

Cited By

View all
  • (2023)Sustainability-Driven Green Innovation: Revolutionising Aerospace Decision-Making with an Intelligent Decision Support SystemSustainability10.3390/su1601004116:1(41)Online publication date: 20-Dec-2023
  • (2021)Unified Likelihood Ratio Estimation for High- to Zero-Frequency N-GramsIEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences10.1587/transfun.2020EAP1088E104.A:8(1059-1074)Online publication date: 1-Aug-2021
  • (2021)CSO Classifier 3.0: a scalable unsupervised method for classifying documents in terms of research topicsInternational Journal on Digital Libraries10.1007/s00799-021-00305-y23:1(91-110)Online publication date: 22-Jul-2021
  • Show More Cited By

Index Terms

  1. Detecting research topics via the correlation between graphs and texts

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
    August 2007
    1080 pages
    ISBN:9781595936097
    DOI:10.1145/1281192
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 August 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. citation graphs
    2. correlation of text and links
    3. graph mining
    4. probabilistic measure
    5. topic detection

    Qualifiers

    • Article

    Conference

    KDD07

    Acceptance Rates

    KDD '07 Paper Acceptance Rate 111 of 573 submissions, 19%;
    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)19
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 17 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Sustainability-Driven Green Innovation: Revolutionising Aerospace Decision-Making with an Intelligent Decision Support SystemSustainability10.3390/su1601004116:1(41)Online publication date: 20-Dec-2023
    • (2021)Unified Likelihood Ratio Estimation for High- to Zero-Frequency N-GramsIEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences10.1587/transfun.2020EAP1088E104.A:8(1059-1074)Online publication date: 1-Aug-2021
    • (2021)CSO Classifier 3.0: a scalable unsupervised method for classifying documents in terms of research topicsInternational Journal on Digital Libraries10.1007/s00799-021-00305-y23:1(91-110)Online publication date: 22-Jul-2021
    • (2020)Applying Text Analytics for Studying Research Trends in DependabilityEntropy10.3390/e2211130322:11(1303)Online publication date: 16-Nov-2020
    • (2020)A decade of Semantic Web research through the lenses of a mixed methods approachSemantic Web10.3233/SW-20037111:6(979-1005)Online publication date: 1-Jan-2020
    • (2020)Crowdsourcing Based Description of Urban Emergency Events Using Social Media Big DataIEEE Transactions on Cloud Computing10.1109/TCC.2016.25176388:2(387-397)Online publication date: 1-Apr-2020
    • (2020)TermBall: Tracking and Predicting Evolution Types of Research Topics by Using Knowledge Structures in Scholarly Big DataIEEE Access10.1109/ACCESS.2020.30009488(108514-108529)Online publication date: 2020
    • (2020)Multi-Dimension Topic Mining Based on Hierarchical Semantic Graph ModelIEEE Access10.1109/ACCESS.2020.29843528(64820-64835)Online publication date: 2020
    • (2019)Refining the Measurement of Topic Similarities Through Bibliographic Coupling and LDAIEEE Access10.1109/ACCESS.2019.29584897(179997-180011)Online publication date: 2019
    • (2018)AUGURProceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries10.1145/3197026.3197052(303-312)Online publication date: 23-May-2018
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media