Extract Knowledge from Web Pages in a Specific Domain

Lu, Yihong; Yu, Shuiyuan; Shi, Minyong; Li, Chunfang

doi:10.1007/978-3-319-99365-2_10

Yihong Lu ORCID: orcid.org/0000-0003-4767-2202¹⁶,
Shuiyuan Yu¹⁶,
Minyong Shi¹⁶ &
…
Chunfang Li¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11061))

Included in the following conference series:

International Conference on Knowledge Science, Engineering and Management

1679 Accesses

Abstract

Most NLP tasks are based on large, well-organized corpus in general domain, while limited work has been done in specific domain due to the lack of qualified corpus and evaluation dataset. However domain-specific applications are widely needed nowadays. In this paper, we propose a fast and inexpensive, model-assisted method to train a high-quality distributional model from scattered, unconstructed web pages, which can capture knowledge from a specific domain. This approach does not require pre-organized corpus and much human help, and hence works on the specific domain which can’t afford the cost of artificially constructed corpus and complex training. We use Word2vec to assist in creating term set and evaluation dataset of embroidery domain. Next, we train a distributional model on filtered search results of term set, and conduct a task-specific tuning via two simple but practical evaluation metrics, word pairs similarity and in-domain terms’ coverage. Furthermore, our much-smaller models outperform the word embedding model trained on a large, general corpus in our task. In this work, we demonstrate the effectiveness of our method and hope it can serve as a reference for researchers who extract high-quality knowledge in specific domains.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Altszyler, E., Ribeiro, S., Sigman, M., Slezak, D.F.: The interpretation of dream meaning: resolving ambiguity using latent semantic analysis in a small corpus of text. Conscious. Cogn. 56, 178–187 (2017). https://doi.org/10.1016/j.concog.2017.09.004
Article Google Scholar
Altszyler, E., Sigman, M., Slezak, D.F.: Comparative study of LSA vs Word2Vec embeddings in small corpora: a case study in dreams database. Science 8, 9
Google Scholar
Altszyler, E., Sigman, M., Slezak, D.F.: Corpus specificity in LSA and Word2Vec: the role of out-of-domain documents. arXiv preprint arXiv:1712.10054 (2017)
Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the ACL, vol. 1: Long Papers, pp. 238–247 (2014)
Google Scholar
Cardellino, C., Alonso i Alemany, L.: Disjoint semi-supervised Spanish verb sense disambiguation using word embeddings. In: XVIII Simposio Argentino de Inteligencia Artificial (ASAI)-JAIIO 46 (Córdoba, 2017) (2017)
Google Scholar
Chiu, B., Crichton, G., Korhonen, A., Pyysalo, S.: How to train good word embeddings for biomedical NLP. In: Proceedings of the 15th Workshop on BioNLP. ACL (2016)
Google Scholar
Diaz, F., Mitra, B., Craswell, N.: Query expansion with locally-trained word embeddings. In: Proceedings of the 54th Annual Meeting of the ACL, vol. 1: Long Papers. ACL (2016)
Google Scholar
Dusserre, E., Padró, M.: Bigger does not mean better! we prefer specificity. In: IWCS 2017–12th International Conference on Computational Semantics–Short Papers (2017)
Google Scholar
Finkelstein, L., et al.: Placing search in context: the concept revisited. ACM Trans. Inf. Syst. 20(1), 116–131 (2002). https://doi.org/10.1145/503104.503110
Article Google Scholar
Hill, F., Reichart, R., Korhonen, A.: SimLex-999: evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. 41(4), 665–695 (2015). https://doi.org/10.1162/coli_a_00237
Article MathSciNet Google Scholar
Jin, P., Wu, Y.: SemEval-2012 task 4: evaluating Chinese word similarity. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, pp. 374–377. ACL (2012)
Google Scholar
Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: International Conference on Machine Learning, pp. 957–966 (2015)
Google Scholar
Kutuzov, A., Kunilovskaya, M.: Size vs. structure in training corpora for word embedding models: araneum russicum maximum and russian national corpus. In: van der Aalst, W., et al. (eds.) AIST 2017. LNCS, vol. 10716, pp. 47–58. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73013-4_5
Chapter Google Scholar
Lai, S., Liu, K., He, S., Zhao, J.: How to generate a good word embedding? IEEE Intell. Syst. 1 (2017)
Google Scholar
Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. TACL 3, 211–225 (2015)
Google Scholar
Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 142–150. ACL (2011)
Google Scholar
Major, V., Surkis, A., Aphinyanaphongs, Y.: Utility of general and specific word embeddings for classifying translational stages of research. arXiv preprint arXiv:1705.06262 (2017)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Pakhomov, S.V., Finley, G., McEwan, R., Wang, Y., Melton, G.B.: Corpus domain effects on distributional semantic modeling of medical terms. Bioinformatics 32, 3635–3644 (2016). https://doi.org/10.1093/bioinformatics/btw529
Article Google Scholar
Qu, L., Ferraro, G., Zhou, L., Hou, W., Schneider, N., Baldwin, T.: Big data small data, in domain out-of domain, known word unknown word: the impact of word representations on sequence labelling tasks. In: Proceedings of the Nineteenth Conference on CoNLL. ACL (2015). https://doi.org/10.18653/v1/k15-1009
Rekabsaz, N., Mitra, B., Lupu, M., Hanbury, A.: Toward incorporation of relevant documents in Word2Vec. arXiv preprint arXiv:1707.06598 (2017)
Spousta, M.: Web as a corpus. In: Zbornik konference WDS, vol. 6, pp. 179–184 (2006)
Google Scholar
Sugathadasa, K., et al.: Synergistic union of Word2Vec and lexicon for domain specific semantic similarity. In: 2017 IEEE ICIIS. IEEE, December 2017. https://doi.org/10.1109/iciinfs.2017.8300343
Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., Qin, B.: Learning sentiment-specific word embedding for twitter sentiment classification. In: Proceedings of the 52nd Annual Meeting of the ACL, vol. 1: Long Papers. ACL (2014). https://doi.org/10.3115/v1/p14-1146
Muneeb, T.H., Sahu, S., Anand, A.: Evaluating distributed word representations for capturing semantics of biomedical concepts. In: Proceedings of BioNLP 2015. ACL (2015)
Google Scholar
Tixier, A.J.P., Vazirgiannis, M., Hallowell, M.R.: Word embeddings for the construction domain. arXiv preprint arXiv:1610.09333 (2016)
Wang, Y., et al.: A comparison of word embeddings for the biomedical natural language processing. arXiv preprint arXiv:1802.00400 (2018)

Download references

Author information

Authors and Affiliations

School of Computer, Communication University of China, Beijing, 100024, China
Yihong Lu, Shuiyuan Yu, Minyong Shi & Chunfang Li

Authors

Yihong Lu
View author publications
You can also search for this author in PubMed Google Scholar
Shuiyuan Yu
View author publications
You can also search for this author in PubMed Google Scholar
Minyong Shi
View author publications
You can also search for this author in PubMed Google Scholar
Chunfang Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yihong Lu .

Editor information

Editors and Affiliations

University of Bristol, Bristol, United Kingdom
Weiru Liu
Università di Trento, Povo, Italy
Fausto Giunchiglia
Jilin University, Changchun, China
Bo Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lu, Y., Yu, S., Shi, M., Li, C. (2018). Extract Knowledge from Web Pages in a Specific Domain. In: Liu, W., Giunchiglia, F., Yang, B. (eds) Knowledge Science, Engineering and Management. KSEM 2018. Lecture Notes in Computer Science(), vol 11061. Springer, Cham. https://doi.org/10.1007/978-3-319-99365-2_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-99365-2_10
Published: 12 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99364-5
Online ISBN: 978-3-319-99365-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics