research-article

The Method of Semi-supervised Automatic Keyword Extraction for Web Documents using Transition Probability Distribution Generator

Authors:

Htet Myet Lynn,

Pankoo KimAuthors Info & Claims

RACS '16: Proceedings of the International Conference on Research in Adaptive and Convergent Systems

Pages 1 - 6

https://doi.org/10.1145/2987386.2987399

Published: 11 October 2016 Publication History

Abstract

In this paper, a semi-supervised method for automatic keyword extraction of web documents using unconventional Markov Chain with conditional transition matrices for each distinct feature distributed by Transition Probability Distribution Generator (TPDG) is introduced. Since keywords are the set of the most appropriate and relevant words which define the context of the document precisely and concisely, many applications such as text data mining, text analytics and other natural language processes of deriving high-quality information from text can take advantage of it. The conditional transition matrices for each distinct feature of the model is the state-of-the-art which mostly rely on the characteristics of the keywords and distribution probabilities of each feature on the state space in order to learn the sequence of behaviors of the keywords in various web documents. According to the experimental results, the proposed method outperforms the baseline methods for keyword extraction in terms of performance and semantically.

References

[1]

Cohen, J.D. 1995. Highlights: Language and Domain-independent Automatic Indexing Terms for Abstracting. Journal of the American Society for Information Science, 46, 3, 162--174.

Digital Library

[2]

Luhn, H. P. 1957. A Statistical Approach to Mechanized Encoding and Searching of Literary Information.IBM Journal of Research and Development, 1, 4, 309--317.

Digital Library

[3]

Salton, G., Yang, C. S. and Yu, C. T. 1975. A Theory of Importance in Automatic Text Analysis. Journal of the American society for Information Science, 16, 1, 33--44.

[4]

Matsuo, Y. and Ishizuka, M. 2004. Keyword Extraction from a Single Document Using Word Co-occurrence Statistical Information.International Journal on Artificial Intelligence Tools, 13, 1, 157--169. https://www.aaai.org/Papers/FLAIRS/2003/Flairs03-076.pdf.

[5]

Chien, L. F. 1997. PAT-tree-based Keyword Extraction for Chinese Information Retrieval. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Philadelphia, PA, USA, July 27-31, 1997), SIGIR '97. ACM, New York, NY, 50--58. DOI=http://doi.acm.org/10.1145/258525.258534.

Digital Library

[6]

Ercan, G. and Cicekli, I. 2007. Using Lexical Chains for Keyword Extraction. Information Processing and Management, 43, 6 (November 2007), 1705--1714. DOI=http://dx.doi.org/10.1016/j.ipm.2007.01.015.

Digital Library

[7]

Hulth, A. 2003. Improved Automatic Keyword Extraction Given More Linguistic Knowledge. In Proceedings of the 2003 Conference on Emprical Methods in Natural Language Processing(Sapporo, Japan, July 11-12, 2003), EMNLP '03, ACL, Stroudsburg, PA, USA, 216--223, DOI=http://dx.doi.org/10.3115/1119355.1119383.

Digital Library

[8]

Dennis, S. F. 1967. The Design and Testing of a Fully Automatic Indexing-searching System for Documents Consisting of Expository Text. In Information Retrieval: a Critical Review, G. Schecter, Ed. Thompson Book Company, Washington D. C., 67--94.

[9]

Salton, G. and Buckley, C. 1991. Achieving application requirements. Automatic Text Structuring and Retrieval -- Experiments in Automatic Encyclopedia Searching. In Proceedings of the Fourteenth SIGIR Conference on Research and development in information retrieval(Chicago, IL, USA, October 13-16, 1991), SIGIR '91. ACM, New York, NY, USA, 21--30. DOI=http://dx.doi.org/10.1145/122860.122863.

Digital Library

[10]

Frank, E., Paynter, G. W., Witten, I. H., Gutwin, C. and Nevill-Manning, C. G. 1999. Domain-Specific Keyphrase Extraction. In Proceedings of the 16th International Joint Conference on Artificial Intelligence(Stockholm, Sweden, July 31-August 6, 1999), IJCAI '99, Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, 668--673.

Digital Library

[11]

Zhang, K., Xu, H., Tang, J. and Li, Z. 2006. Keyword Extraction Using Support Vector Machine. In Proceedings of the 7th International Conference on Web-Age Information Management(Hong Kong, China, June 17-19, 2006), WAIM '06, Springer-Verlag Berlin, Heidelberg, 85--96. DOI=http://dx.doi.org/10.1007/11775300_8.

[12]

Keith Humphreys, J. B. 2002. PhraseRate: An HTML Keyphrase Extractor. Technical Report. University of California, Riverside, 1--16.

[13]

Wartena, C., Brussee, R. and Slakhorst, W. 2010. Keyword extraction using word co-occurrence. In Proceedings of 21st International Conference on Database and Expert Systems Applications(Bilbao, Spain, August 30-September 3, 2010), DEXA '10, IEEE, 54--58.

Digital Library

[14]

Rose, S., Engel, D., Cramer, N., and Cowley, W. 2010. Automatic Keyword Extraction from Individual Documents. In M. W. Berry & J. Kogan (Eds.), Text Mining: Theory and Applications: John Wiley & Sons.

[15]

Chengzhi Z., Huilin W., Yao L., Dan W., Yi L. and Bo W. 2004. Automatic Keyword Extraction from Documents Using Conditional Random Fields. Journal of Computational Information Systems, 4, 3, 1169--1180.

[16]

Zhang K., Xu H., Tang J., Li J. Z. 2006. Keyword Extraction Using Support Vector Machine. In Proceedings of the Seventh International Conference on Web-Age Information Management, WAIM2006, Hong Kong, China, 85--96.

Digital Library

Cited By

Goz FMutlu A(2021)Automatic Keyword Extraction From Text DocumentsDigital Technology Advancements in Knowledge Management10.4018/978-1-7998-6792-0.ch004(71-91)Online publication date: 2021
https://doi.org/10.4018/978-1-7998-6792-0.ch004
MUTLU AABDİSAMAD MKABASAKAL OGÖZ FTÜFEKÇİ ÖKÜÇÜK K(2021)Dijital Kütüphanelerde Dokümanlardan Bilgi Geri Kazanımı için Kullanılan Güncel Teknolojiler: Derleme ÇalışmasıCurrent Technologies for Information Retrieval of Documents in Digital Libraries: A SurveyDüzce Üniversitesi Bilim ve Teknoloji Dergisi10.29130/dubited.7969649:1(79-91)Online publication date: 31-Jan-2021
https://doi.org/10.29130/dubited.796964
Song SWang ZXu SNi SXiao J(2019)A Novel Text Classification Approach Based on Word2vec and TextRank Keyword Extraction2019 IEEE Fourth International Conference on Data Science in Cyberspace (DSC)10.1109/DSC.2019.00087(536-543)Online publication date: Jun-2019
https://doi.org/10.1109/DSC.2019.00087
Show More Cited By

The Method of Semi-supervised Automatic Keyword Extraction for Web Documents using Transition Probability Distribution Generator
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

Ranking-constrained keyword sequence extraction from web documents
ADC '09: Proceedings of the Twentieth Australasian Conference on Australasian Database - Volume 92

Given a large volume of Web documents, we consider problem of finding the shortest keyword sequences for each of the documents such that a keyword sequence can be rendered to a given search engine, then the corresponding Web document can be identified ...
Multiple sets of features for automatic genre classification of web documents

With the increase of information on the Web, it is difficult to find desired information quickly out of the documents retrieved by a search engine. One way to solve this problem is to classify web documents according to various criteria. Most document ...
A Novel Context Based Indexing of Web Documents
CSNT '12: Proceedings of the 2012 International Conference on Communication Systems and Network Technologies

The organization and retrieval of information from hyper-linked documents is a challenging task for search engine expected to satisfy user queries with relevant content in first few top results displayed to the user. This arrangement implies in need of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

RACS '16: Proceedings of the International Conference on Research in Adaptive and Convergent Systems

October 2016

266 pages

ISBN:9781450344555

DOI:10.1145/2987386

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing
ACCT: Association of Convergent Computing Technology

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Science, ICT & Future Planning
Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(

Conference

RACS '16

Sponsor:

SIGAPP
ACCT

RACS '16: International Conference on Research in Adaptive and Convergent Systems

October 11 - 14, 2016

Odense, Denmark

Acceptance Rates

RACS '16 Paper Acceptance Rate 40 of 161 submissions, 25%;

Overall Acceptance Rate 393 of 1,581 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
134
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Goz FMutlu A(2021)Automatic Keyword Extraction From Text DocumentsDigital Technology Advancements in Knowledge Management10.4018/978-1-7998-6792-0.ch004(71-91)Online publication date: 2021
https://doi.org/10.4018/978-1-7998-6792-0.ch004
MUTLU AABDİSAMAD MKABASAKAL OGÖZ FTÜFEKÇİ ÖKÜÇÜK K(2021)Dijital Kütüphanelerde Dokümanlardan Bilgi Geri Kazanımı için Kullanılan Güncel Teknolojiler: Derleme ÇalışmasıCurrent Technologies for Information Retrieval of Documents in Digital Libraries: A SurveyDüzce Üniversitesi Bilim ve Teknoloji Dergisi10.29130/dubited.7969649:1(79-91)Online publication date: 31-Jan-2021
https://doi.org/10.29130/dubited.796964
Song SWang ZXu SNi SXiao J(2019)A Novel Text Classification Approach Based on Word2vec and TextRank Keyword Extraction2019 IEEE Fourth International Conference on Data Science in Cyberspace (DSC)10.1109/DSC.2019.00087(536-543)Online publication date: Jun-2019
https://doi.org/10.1109/DSC.2019.00087
Rabby GAzad SMufti Mahmud “Zamli KMostafizur Rahman M(2018)A Flexible Keyphrase Extraction Technique for Academic LiteratureProcedia Computer Science10.1016/j.procs.2018.08.208135(553-563)Online publication date: 2018
https://doi.org/10.1016/j.procs.2018.08.208
Lynn HChoi CKim P(2018)An improved method of automatic text summarization for web contents using lexical chain with semantic-related termsSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-017-2612-922:12(4013-4023)Online publication date: 30-Dec-2018
https://dl.acm.org/doi/10.1007/s00500-017-2612-9
Lynn HLee EChoi CKim P(2017)SwiftRank: An Unsupervised Statistical Approach of Keyword and Salient Sentence Extraction for Individual DocumentsProcedia Computer Science10.1016/j.procs.2017.08.305113(472-477)Online publication date: 2017
https://doi.org/10.1016/j.procs.2017.08.305

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten