skip to main content
10.1145/2987386.2987399acmconferencesArticle/Chapter ViewAbstractPublication PagesracsConference Proceedingsconference-collections
research-article

The Method of Semi-supervised Automatic Keyword Extraction for Web Documents using Transition Probability Distribution Generator

Published: 11 October 2016 Publication History

Abstract

In this paper, a semi-supervised method for automatic keyword extraction of web documents using unconventional Markov Chain with conditional transition matrices for each distinct feature distributed by Transition Probability Distribution Generator (TPDG) is introduced. Since keywords are the set of the most appropriate and relevant words which define the context of the document precisely and concisely, many applications such as text data mining, text analytics and other natural language processes of deriving high-quality information from text can take advantage of it. The conditional transition matrices for each distinct feature of the model is the state-of-the-art which mostly rely on the characteristics of the keywords and distribution probabilities of each feature on the state space in order to learn the sequence of behaviors of the keywords in various web documents. According to the experimental results, the proposed method outperforms the baseline methods for keyword extraction in terms of performance and semantically.

References

[1]
Cohen, J.D. 1995. Highlights: Language and Domain-independent Automatic Indexing Terms for Abstracting. Journal of the American Society for Information Science, 46, 3, 162--174.
[2]
Luhn, H. P. 1957. A Statistical Approach to Mechanized Encoding and Searching of Literary Information.IBM Journal of Research and Development, 1, 4, 309--317.
[3]
Salton, G., Yang, C. S. and Yu, C. T. 1975. A Theory of Importance in Automatic Text Analysis. Journal of the American society for Information Science, 16, 1, 33--44.
[4]
Matsuo, Y. and Ishizuka, M. 2004. Keyword Extraction from a Single Document Using Word Co-occurrence Statistical Information.International Journal on Artificial Intelligence Tools, 13, 1, 157--169. https://www.aaai.org/Papers/FLAIRS/2003/Flairs03-076.pdf.
[5]
Chien, L. F. 1997. PAT-tree-based Keyword Extraction for Chinese Information Retrieval. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Philadelphia, PA, USA, July 27-31, 1997), SIGIR '97. ACM, New York, NY, 50--58. DOI=http://doi.acm.org/10.1145/258525.258534.
[6]
Ercan, G. and Cicekli, I. 2007. Using Lexical Chains for Keyword Extraction. Information Processing and Management, 43, 6 (November 2007), 1705--1714. DOI=http://dx.doi.org/10.1016/j.ipm.2007.01.015.
[7]
Hulth, A. 2003. Improved Automatic Keyword Extraction Given More Linguistic Knowledge. In Proceedings of the 2003 Conference on Emprical Methods in Natural Language Processing(Sapporo, Japan, July 11-12, 2003), EMNLP '03, ACL, Stroudsburg, PA, USA, 216--223, DOI=http://dx.doi.org/10.3115/1119355.1119383.
[8]
Dennis, S. F. 1967. The Design and Testing of a Fully Automatic Indexing-searching System for Documents Consisting of Expository Text. In Information Retrieval: a Critical Review, G. Schecter, Ed. Thompson Book Company, Washington D. C., 67--94.
[9]
Salton, G. and Buckley, C. 1991. Achieving application requirements. Automatic Text Structuring and Retrieval -- Experiments in Automatic Encyclopedia Searching. In Proceedings of the Fourteenth SIGIR Conference on Research and development in information retrieval(Chicago, IL, USA, October 13-16, 1991), SIGIR '91. ACM, New York, NY, USA, 21--30. DOI=http://dx.doi.org/10.1145/122860.122863.
[10]
Frank, E., Paynter, G. W., Witten, I. H., Gutwin, C. and Nevill-Manning, C. G. 1999. Domain-Specific Keyphrase Extraction. In Proceedings of the 16th International Joint Conference on Artificial Intelligence(Stockholm, Sweden, July 31-August 6, 1999), IJCAI '99, Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, 668--673.
[11]
Zhang, K., Xu, H., Tang, J. and Li, Z. 2006. Keyword Extraction Using Support Vector Machine. In Proceedings of the 7th International Conference on Web-Age Information Management(Hong Kong, China, June 17-19, 2006), WAIM '06, Springer-Verlag Berlin, Heidelberg, 85--96. DOI=http://dx.doi.org/10.1007/11775300_8.
[12]
Keith Humphreys, J. B. 2002. PhraseRate: An HTML Keyphrase Extractor. Technical Report. University of California, Riverside, 1--16.
[13]
Wartena, C., Brussee, R. and Slakhorst, W. 2010. Keyword extraction using word co-occurrence. In Proceedings of 21st International Conference on Database and Expert Systems Applications(Bilbao, Spain, August 30-September 3, 2010), DEXA '10, IEEE, 54--58.
[14]
Rose, S., Engel, D., Cramer, N., and Cowley, W. 2010. Automatic Keyword Extraction from Individual Documents. In M. W. Berry & J. Kogan (Eds.), Text Mining: Theory and Applications: John Wiley & Sons.
[15]
Chengzhi Z., Huilin W., Yao L., Dan W., Yi L. and Bo W. 2004. Automatic Keyword Extraction from Documents Using Conditional Random Fields. Journal of Computational Information Systems, 4, 3, 1169--1180.
[16]
Zhang K., Xu H., Tang J., Li J. Z. 2006. Keyword Extraction Using Support Vector Machine. In Proceedings of the Seventh International Conference on Web-Age Information Management, WAIM2006, Hong Kong, China, 85--96.

Cited By

View all
  • (2021)Automatic Keyword Extraction From Text DocumentsDigital Technology Advancements in Knowledge Management10.4018/978-1-7998-6792-0.ch004(71-91)Online publication date: 2021
  • (2021)Dijital Kütüphanelerde Dokümanlardan Bilgi Geri Kazanımı için Kullanılan Güncel Teknolojiler: Derleme ÇalışmasıCurrent Technologies for Information Retrieval of Documents in Digital Libraries: A SurveyDüzce Üniversitesi Bilim ve Teknoloji Dergisi10.29130/dubited.7969649:1(79-91)Online publication date: 31-Jan-2021
  • (2019)A Novel Text Classification Approach Based on Word2vec and TextRank Keyword Extraction2019 IEEE Fourth International Conference on Data Science in Cyberspace (DSC)10.1109/DSC.2019.00087(536-543)Online publication date: Jun-2019
  • Show More Cited By
  1. The Method of Semi-supervised Automatic Keyword Extraction for Web Documents using Transition Probability Distribution Generator

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    RACS '16: Proceedings of the International Conference on Research in Adaptive and Convergent Systems
    October 2016
    266 pages
    ISBN:9781450344555
    DOI:10.1145/2987386
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 October 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Automatic keyword extraction
    2. Information retrieval
    3. Transition probability distribution generator
    4. Web documents

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Science, ICT & Future Planning
    • Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(

    Conference

    RACS '16
    Sponsor:

    Acceptance Rates

    RACS '16 Paper Acceptance Rate 40 of 161 submissions, 25%;
    Overall Acceptance Rate 393 of 1,581 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 23 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Automatic Keyword Extraction From Text DocumentsDigital Technology Advancements in Knowledge Management10.4018/978-1-7998-6792-0.ch004(71-91)Online publication date: 2021
    • (2021)Dijital Kütüphanelerde Dokümanlardan Bilgi Geri Kazanımı için Kullanılan Güncel Teknolojiler: Derleme ÇalışmasıCurrent Technologies for Information Retrieval of Documents in Digital Libraries: A SurveyDüzce Üniversitesi Bilim ve Teknoloji Dergisi10.29130/dubited.7969649:1(79-91)Online publication date: 31-Jan-2021
    • (2019)A Novel Text Classification Approach Based on Word2vec and TextRank Keyword Extraction2019 IEEE Fourth International Conference on Data Science in Cyberspace (DSC)10.1109/DSC.2019.00087(536-543)Online publication date: Jun-2019
    • (2018)A Flexible Keyphrase Extraction Technique for Academic LiteratureProcedia Computer Science10.1016/j.procs.2018.08.208135(553-563)Online publication date: 2018
    • (2018)An improved method of automatic text summarization for web contents using lexical chain with semantic-related termsSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-017-2612-922:12(4013-4023)Online publication date: 30-Dec-2018
    • (2017)SwiftRank: An Unsupervised Statistical Approach of Keyword and Salient Sentence Extraction for Individual DocumentsProcedia Computer Science10.1016/j.procs.2017.08.305113(472-477)Online publication date: 2017

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media