Abstract
Retrieval of relevant educational videos by NLP analysis of their transcripts represents a particular information retrieval problem that is found in many systems. Since various indexing techniques are available, finding the suitable ingredients that build an efficient data analysis pipeline represents a critical task. The paper tackles the problem of retrieving top-N videos that are relevant for a query provided in the Spanish language at Universitat Politècnica de València (UPV). The main elements that are used in the processing pipeline are clustering, LSI modelling and Wikipedia contextualizing along with basic NLP processing techniques such as bag-of-words, lemmatization, singularization and TF-IDF computing. Experimental results on a real-world dataset of 15.386 transcripts show good results, especially compared with currently existing search mechanism which takes into consideration only the title and keywords of the transcripts. Although live application deployment may be further necessary for further relevance evaluation, we conclude that current progress represents a milestone in further building a system that retrieves appropriate videos for the provided query.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
State-of-the-art multilingual lemmatization. https://towardsdatascience.com/state-of-the-art-multilingual-lemmatization-f303e8ff1a8. Accessed 25 Feb 2020
Aker, A., Petrak, J., Sabbah, F.: An extensible multilingual open source lemmatizer. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, pp. 40–45. ACL (2017)
Anaya, L.H.: Comparing Latent Dirichlet Allocation and Latent Semantic Analysis as Classifiers. ERIC (2011)
Basu, S., Yu, Y., Singh, V.K., Zimmermann, R.: Videopedia: lecture video recommendation for educational blogs using topic modeling. In: Tian, Q., Sebe, N., Qi, G.-J., Huet, B., Hong, R., Liu, X. (eds.) MMM 2016. LNCS, vol. 9516, pp. 238–250. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-27671-7_20
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)
Cvitanic, T., Lee, B., Song, H.I., Fu, K., Rosen, D.: Lda vs lsa: a comparison of two computational text analysis tools for the functional categorization of patents. In: International Conference on Case-Based Reasoning (2016)
Deerwester, S., Dumais, S.T., Landauer, T.K., Furnas, G., Beck, F.d.L., Leighton-Beck, L.: Improvinginformation-retrieval with latent semantic indexing (1988)
Drachsler, H., Verbert, K., Santos, O.C., Manouselis, N.: Panorama of recommender systems to support learning. In: Ricci, F., Rokach, L., Shapira, B. (eds.) Recommender Systems Handbook, pp. 421–451. Springer, Boston, MA (2015). https://doi.org/10.1007/978-1-4899-7637-6_12
Galanopoulos, D., Mezaris, V.: Temporal lecture video fragmentation using word embeddings. In: Kompatsiaris, I., Huet, B., Mezaris, V., Gurrin, C., Cheng, W.-H., Vrochidis, S. (eds.) MMM 2019. LNCS, vol. 11296, pp. 254–265. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-05716-9_21
Gutiérrez, L., Keith, B.: A systematic literature review on word embeddings. In: Mejia, J., Muñoz, M., Rocha, Á., Peña, A., Pérez-Cisneros, M. (eds.) CIMPS 2018. AISC, vol. 865, pp. 132–141. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-01171-0_12
Kastrati, Z., Imran, A.S., Kurti, A.: Integrating word embeddings and document topics with deep learning in a video classification framework. Pattern Recogn. Lett. 128, 85–92 (2019)
Kastrati, Z., Kurti, A., Imran, A.S.: Wet: word embedding-topic distribution vectors for MOOC video lectures dataset. Data Brief 28, 105090 (2020)
Lewis, D.D., Jones, K.S.: Natural language processing for information retrieval. Commun. ACM 39(1), 92–101 (1996)
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S., McClosky, D.: The stanford corenlp natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60 (2014)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Pappano, L.: The year of the MOOC. New York Times 2(12), 2012 (2012)
Perkins, J.: Python 3 Text Processing with NLTK 3 Cookbook. Packt Publishing Ltd. (2014)
Ramos, J., et al.: Using TF-IDF to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning, Piscataway, vol. 242, pp. 133–142 (2003)
Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. ELRA, Valletta, Malta, May 2010. http://is.muni.cz/publication/884893/en
Romero, C., Ventura, S.: Educational data mining: a review of the state of the art. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 40(6), 601–618 (2010)
Springmeyer, P.: Inflector for python (2019). https://pypi.org/project/Inflector/
Tucker, B.: The flipped classroom. Educ. Next 12(1), 82–83 (2012)
Turcu, G., Mihaescu, M.C., Heras, S., Palanca, J., Julián, V.: Video transcript indexing and retrieval procedure. In: SoftCOM 2019, pp. 1–6. IEEE (2019)
Zhu, H., Dong, L., Wei, F., Qin, B., Liu, T.: Transforming wikipedia into augmented data for query-focused summarization. arXiv:1911.03324 (2019)
Acknowledgement
This work was partially supported by RTI2018-095390-B-C31-AR project of the Spanish government, and by the Generalitat Valenciana (PROMETEO/2018/002) project.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Bleoancă, D.I., Heras, S., Palanca, J., Julian, V., Mihăescu, M.C. (2020). LSI Based Mechanism for Educational Videos Retrieval by Transcripts Processing. In: Analide, C., Novais, P., Camacho, D., Yin, H. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2020. IDEAL 2020. Lecture Notes in Computer Science(), vol 12489. Springer, Cham. https://doi.org/10.1007/978-3-030-62362-3_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-62362-3_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-62361-6
Online ISBN: 978-3-030-62362-3
eBook Packages: Computer ScienceComputer Science (R0)