Abstract
The proliferation of publicly available information across multiple languages presents both unique challenges and opportunities for Open Source Intelligence (OSINT) systems. This paper proposes a novel architecture for multilingual OSINT that is both parallel and distributed. The architecture integrates language identification and translation capabilities, enabling it to handle linguistically diverse data by transforming it into a unified format for efficient analysis. Designed specifically to address the challenges of parallel and distributed processing in OSINT systems, this architecture aims to offer scalability and performance benefits when dealing with massive data volumes. Our primary focus has been on devising strategies and tactics that address these concerns, providing a robust solution for the collection, processing and analysis of data in various languages. This work marks a significant step towards the development of more globally inclusive OSINT systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: Proceedings of the April 18–20, 1967, Spring Joint Computer Conference, pp. 483–485 (1967)
Bahrami, M., Singhal, M., Zhuang, Z.: A cloud-based web crawler architecture. In: 2015 18th International Conference on Intelligence in Next Generation Networks, pp. 216–223. IEEE (2015)
Barbaresi, A.: Trafilatura: a web scraping library and command-line tool for text discovery and extraction. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pp. 122–131 (2021)
Bevendorff, J., Gupta, S., Kiesel, J., Stein, B.: An empirical comparison of web content extraction algorithms (2023)
Celery (2023). https://docs.celeryq.dev/en/stable/userguide/workers.html. Accessed 17 May 2023
Coleman, S., Secker, A., Bawden, R., Haddow, B., Birch, A.: Architecture of a scalable, secure and resilient translation platform for multilingual news media. In: 1st International Workshop on Language Technology Platforms, pp. 16–21 (2020)
FastText (2023). https://fasttext.cc. Accessed 17 May 2023
Heydon, A., Najork, M.: Mercator: a scalable, extensible web crawler. World Wide Web 2(4), 219–229 (1999). https://doi.org/10.1023/A:1019213109274
Kiczales, G., Hilsdale, E., Hugunin, J., Kersten, M., Palm, J., Griswold, W.G.: An overview of AspectJ. In: Knudsen, J.L. (ed.) ECOOP 2001. LNCS, vol. 2072, pp. 327–354. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45337-7_18
Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.M.: OpenNMT: open-source toolkit for neural machine translation. arXiv preprint arXiv:1701.02810 (2017)
Nutch (2023). https://nutch.apache.org. Accessed 09 Dec 2023
PYCLD2 (2023). https://github.com/aboSamoor/pycld2. Accessed 17 May 2023
Quoc, D.L., Fetzer, C., Felber, P., Rivière, , Schiavoni, V., Sutra, P.: UniCrawl: a practical geographically distributed web crawler. In: 2015 IEEE 8th International Conference on Cloud Computing, pp. 389–396 (2015). https://doi.org/10.1109/CLOUD.2015.59
Ranade, P., Mittal, S., Joshi, A., Joshi, K.: Using deep neural networks to translate multi-lingual threat intelligence. In: 2018 IEEE International Conference on Intelligence and Security Informatics (ISI), pp. 238–243. IEEE (2018)
Redis (2023). https://redis.io. Accessed 17 May 2023
Scrapy (2023). https://scrapy.org. Accessed 17 May 2023
Splash (2023). https://splash.readthedocs.io/en/stable. Accessed 17 May 2023
Steinberger, R., Ehrmann, M., Pajzs, J., Ebrahim, M., Steinberger, J., Turchi, M.: Multilingual media monitoring and text analysis – challenges for highly inflected languages. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 22–33. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40585-3_3
Wolf, T., et al.: HuggingFace’s transformers: state-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019)
Yang, D., Thiengburanathum, P.: Scalability and robustness testing for open source web crawlers. In: 2021 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunication Engineering, pp. 197–201. IEEE (2021)
Zavarella, V., Tanev, H., Linge, J., Piskorski, J., Atkinson, M., Steinberger, R.: Exploiting multilingual grammars and machine learning techniques to build an event extraction system for Portuguese. In: Pardo, T.A.S., Branco, A., Klautau, A., Vieira, R., de Lima, V.L.S. (eds.) PROPOR 2010. LNCS (LNAI), vol. 6001, pp. 21–24. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12320-7_3
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Karamanlioglu, A., Yurtalan, G., Karatas, Y.B. (2024). Parallel and Distributed Architecture for Multilingual Open Source Intelligence Systems. In: Tekinerdoğan, B., Spalazzese, R., Sözer, H., Bonfanti, S., Weyns, D. (eds) Software Architecture. ECSA 2023 Tracks, Workshops, and Doctoral Symposium. ECSA 2023. Lecture Notes in Computer Science, vol 14590. Springer, Cham. https://doi.org/10.1007/978-3-031-66326-0_27
Download citation
DOI: https://doi.org/10.1007/978-3-031-66326-0_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-66325-3
Online ISBN: 978-3-031-66326-0
eBook Packages: Computer ScienceComputer Science (R0)