ABSTRACT
In this paper, a semi-supervised method for automatic keyword extraction of web documents using unconventional Markov Chain with conditional transition matrices for each distinct feature distributed by Transition Probability Distribution Generator (TPDG) is introduced. Since keywords are the set of the most appropriate and relevant words which define the context of the document precisely and concisely, many applications such as text data mining, text analytics and other natural language processes of deriving high-quality information from text can take advantage of it. The conditional transition matrices for each distinct feature of the model is the state-of-the-art which mostly rely on the characteristics of the keywords and distribution probabilities of each feature on the state space in order to learn the sequence of behaviors of the keywords in various web documents. According to the experimental results, the proposed method outperforms the baseline methods for keyword extraction in terms of performance and semantically.
- Cohen, J.D. 1995. Highlights: Language and Domain-independent Automatic Indexing Terms for Abstracting. Journal of the American Society for Information Science, 46, 3, 162--174. Google ScholarDigital Library
- Luhn, H. P. 1957. A Statistical Approach to Mechanized Encoding and Searching of Literary Information.IBM Journal of Research and Development, 1, 4, 309--317. Google ScholarDigital Library
- Salton, G., Yang, C. S. and Yu, C. T. 1975. A Theory of Importance in Automatic Text Analysis. Journal of the American society for Information Science, 16, 1, 33--44.Google ScholarCross Ref
- Matsuo, Y. and Ishizuka, M. 2004. Keyword Extraction from a Single Document Using Word Co-occurrence Statistical Information.International Journal on Artificial Intelligence Tools, 13, 1, 157--169. https://www.aaai.org/Papers/FLAIRS/2003/Flairs03-076.pdf.Google Scholar
- Chien, L. F. 1997. PAT-tree-based Keyword Extraction for Chinese Information Retrieval. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Philadelphia, PA, USA, July 27-31, 1997), SIGIR '97. ACM, New York, NY, 50--58. DOI=http://doi.acm.org/10.1145/258525.258534. Google ScholarDigital Library
- Ercan, G. and Cicekli, I. 2007. Using Lexical Chains for Keyword Extraction. Information Processing and Management, 43, 6 (November 2007), 1705--1714. DOI=http://dx.doi.org/10.1016/j.ipm.2007.01.015. Google ScholarDigital Library
- Hulth, A. 2003. Improved Automatic Keyword Extraction Given More Linguistic Knowledge. In Proceedings of the 2003 Conference on Emprical Methods in Natural Language Processing(Sapporo, Japan, July 11-12, 2003), EMNLP '03, ACL, Stroudsburg, PA, USA, 216--223, DOI=http://dx.doi.org/10.3115/1119355.1119383. Google ScholarDigital Library
- Dennis, S. F. 1967. The Design and Testing of a Fully Automatic Indexing-searching System for Documents Consisting of Expository Text. In Information Retrieval: a Critical Review, G. Schecter, Ed. Thompson Book Company, Washington D. C., 67--94.Google Scholar
- Salton, G. and Buckley, C. 1991. Achieving application requirements. Automatic Text Structuring and Retrieval -- Experiments in Automatic Encyclopedia Searching. In Proceedings of the Fourteenth SIGIR Conference on Research and development in information retrieval(Chicago, IL, USA, October 13-16, 1991), SIGIR '91. ACM, New York, NY, USA, 21--30. DOI=http://dx.doi.org/10.1145/122860.122863. Google ScholarDigital Library
- Frank, E., Paynter, G. W., Witten, I. H., Gutwin, C. and Nevill-Manning, C. G. 1999. Domain-Specific Keyphrase Extraction. In Proceedings of the 16th International Joint Conference on Artificial Intelligence(Stockholm, Sweden, July 31-August 6, 1999), IJCAI '99, Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, 668--673. Google ScholarDigital Library
- Zhang, K., Xu, H., Tang, J. and Li, Z. 2006. Keyword Extraction Using Support Vector Machine. In Proceedings of the 7th International Conference on Web-Age Information Management(Hong Kong, China, June 17-19, 2006), WAIM '06, Springer-Verlag Berlin, Heidelberg, 85--96. DOI=http://dx.doi.org/10.1007/11775300_8.Google Scholar
- Keith Humphreys, J. B. 2002. PhraseRate: An HTML Keyphrase Extractor. Technical Report. University of California, Riverside, 1--16.Google Scholar
- Wartena, C., Brussee, R. and Slakhorst, W. 2010. Keyword extraction using word co-occurrence. In Proceedings of 21st International Conference on Database and Expert Systems Applications(Bilbao, Spain, August 30-September 3, 2010), DEXA '10, IEEE, 54--58. Google ScholarDigital Library
- Rose, S., Engel, D., Cramer, N., and Cowley, W. 2010. Automatic Keyword Extraction from Individual Documents. In M. W. Berry & J. Kogan (Eds.), Text Mining: Theory and Applications: John Wiley & Sons..Google Scholar
- Chengzhi Z., Huilin W., Yao L., Dan W., Yi L. and Bo W. 2004. Automatic Keyword Extraction from Documents Using Conditional Random Fields. Journal of Computational Information Systems, 4, 3, 1169--1180.Google Scholar
- Zhang K., Xu H., Tang J., Li J. Z. 2006. Keyword Extraction Using Support Vector Machine. In Proceedings of the Seventh International Conference on Web-Age Information Management, WAIM2006, Hong Kong, China, 85--96.Google ScholarDigital Library
- The Method of Semi-supervised Automatic Keyword Extraction for Web Documents using Transition Probability Distribution Generator
Recommendations
Ranking-constrained keyword sequence extraction from web documents
ADC '09: Proceedings of the Twentieth Australasian Conference on Australasian Database - Volume 92Given a large volume of Web documents, we consider problem of finding the shortest keyword sequences for each of the documents such that a keyword sequence can be rendered to a given search engine, then the corresponding Web document can be identified ...
Multiple sets of features for automatic genre classification of web documents
With the increase of information on the Web, it is difficult to find desired information quickly out of the documents retrieved by a search engine. One way to solve this problem is to classify web documents according to various criteria. Most document ...
A Novel Context Based Indexing of Web Documents
CSNT '12: Proceedings of the 2012 International Conference on Communication Systems and Network TechnologiesThe organization and retrieval of information from hyper-linked documents is a challenging task for search engine expected to satisfy user queries with relevant content in first few top results displayed to the user. This arrangement implies in need of ...
Comments