Abstract
In the rapidly evolving digital landscape, effective retrieval of product information from enterprise websites is crucial for enterprise research, industry analysis, and strategic planning, which rely on accurate and comprehensive data. In this context, a “product" is defined as any tangible item, a solution, or a service. This paper proposes a novel method for extracting such product data-such as product name, category, description, and specifications-directly from company websites. Our approach leverages the capabilities of Large Language Models (LLMs) to enhance the accuracy and automation of the web-page information retrieval process. The adoption of LLMs allows for a more sophisticated extraction and organization of data, overcoming the limitations of conventional methods. Currently, there is a notable absence of open-source or commercial databases that comprehensively cover enterprise products, making it challenging to conduct comparative studies. Our proposed method aims to fill this gap, providing a tool for gathering product information that can be used to assess competitive differences.
1. Supported by Shenzhen Science and Technology Program (No:GJHZ20220913144201002).
2. Supported by IER Foundation 2022(IERF202203).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alarte, J., Silva, J.: Page-level main content extraction from heterogeneous webpages. ACM Trans. Knowl. Discov. Data 15(6) (2021). https://doi.org/10.1145/3451168
Arora, S., et al.: Ask me anything: a simple strategy for prompting language models. In: The Eleventh International Conference on Learning Representations (2023). https://openreview.net/forum?id=bhUPJnS2g0X
Dalvi, B.B., Cohen, W.W., Callan, J.: In: WebSets: extracting sets of entities from the web using unsupervised information extraction. In: WSDM 2012, pp. 243–252. Association for Computing Machinery, New York (2012). https://doi.org/10.1145/2124295.2124327
Huang, Y., et al.: Large language models for networking: applications, enabling techniques, and challenges. IEEE Netw. (2024)
Joby, P.P.: Expedient information retrieval system for web pages using the natural language modeling. J. Artif. Intell. Capsule Netw. 2(2), 100–110 (2020)
Kumar, A., Morabia, K., Wang, J., Chang, K.C.C., Schwing, A.: Cova: context-aware visual attention for webpage information extraction. arXiv (2021). https://doi.org/10.48550/arXiv.2110.12320
Ling, C., et al.: Domain specialization as the key to make large language models disruptive: a comprehensive survey (2024). https://arxiv.org/abs/2305.18703
Liu, J., Lin, L., Cai, Z., Wang, J., Kim, H.J.: Deep web data extraction based on visual information processing. J. Ambient Intell. Humanized Comput. 15(2) (2024)
Ozkaya, I.: Application of large language models to software engineering tasks: opportunities, risks, and implications. IEEE Softw. 40(3), 4–8 (2023)
Patil, R., Gudivada, V.: A review of current trends, techniques, and challenges in large language models (llms). Appl. Sci. 14(5) (2024). https://www.mdpi.com/2076-3417/14/5/2074
Ramalingam, M., Saranya, D., ShankarRam, R., Chinnasamy, P., Ramprathap, K., Kalaiarasi, A.: An automated framework for dynamic web information retrieval using deep learning. In: 2022 International Conference on Computer Communication and Informatics (ICCCI), pp. 1–6 (2022)
Shaukat, K., Masood, N., Khushi, M.: A novel approach to data extraction on hyperlinked webpages. Appl. Sci. 9(23) (2019). https://www.mdpi.com/2076-3417/9/23/5102
Vaswani, A.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)
Wang, C., Wei, P.: A novel web page text information extraction method. In: 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), pp. 2213–2218 (2019)
Wang, Q., Fang, Y., Ravula, A., Feng, F., Quan, X., Liu, D.: Webformer: the web-page transformer for structure information extraction. In: Proceedings of the ACM Web Conference 2022, pp. 3124–3133. Association for Computing Machinery, New York (2022). https://doi.org/10.1145/3485447.3512032
Yang, J., et al.: Harnessing the power of llms in practice: a survey on chatgpt and beyond. ACM Trans. Knowl. Discov. Data 18(6), 1–32 (2024)
Yang, R., Tan, T.F., Lu, W., Thirunavukarasu, A.J., Ting, D.S.W., Liu, N.: Large language models in health care: development, applications, and challenges. Health Care Sci. 2(4), 255–263 (2023). https://onlinelibrary.wiley.com/doi/abs/10.1002/hcs2.61
Zhang, M., Yang, Z., Ali, S., Ding, W.: Web page information extraction service based on graph convolutional neural network and multimodal data fusion. In: 2021 IEEE International Conference on Web Services (ICWS), pp. 681–687 (2021)
Zhou, H., et al.: Large language model (llm) for telecommunications: a comprehensive survey on principles, key techniques, and opportunities (2024). https://arxiv.org/abs/2405.10825
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liao, C., Cheng, G., Huang, S., Yao, L. (2025). LLM-Based Automating Product Information Retrieval for Industry Analysis: A Real-World Application. In: Xu, R., Chen, H., Wu, Y., Zhang, LJ. (eds) Cognitive Computing - ICCC 2024. ICCC 2024. Lecture Notes in Computer Science, vol 15426. Springer, Cham. https://doi.org/10.1007/978-3-031-77954-1_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-77954-1_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-77953-4
Online ISBN: 978-3-031-77954-1
eBook Packages: Computer ScienceComputer Science (R0)