Enhancing web service clustering using Length Feature Weight Method for service description document vector space representation

https://doi.org/10.1016/j.eswa.2020.113682Get rights and content

Highlights

  • Extracted features from web services and NLP is applied for preprocessing.

  • Proposed Length Feature Weight Method for vector form of preprocessed services.

  • Applied K-Mean clustering on the vector representation of web service documents.

  • Achieved better clustering performance measured using standard measurement criteria.

Abstract

Due to the rapid growth of web services in repositories, discovering the requisite web service is becoming increasingly cumbersome task. It has raised the demand for efficient web service clustering algorithms. In service repositories, when related web services are stored in a clustered way, it enhances the web service discovery process by reducing search space and time. Many eminent researchers have worked in this field and used the Term Frequency – Inverse Document Frequency (TF-IDF) method for representing web services in vector space. In general, there are various limitations of the TF-IDF approach i.e. (1) Not efficient for large documents (2) Position of term and its co-occurrences does not matter (3) Unable to analyze how terms are dispersed in different documents. In the web service scenario, services are represented in short text form. TF-IDF does not work well in web service representation because of the reason that it is unable to effectively find the importance of a term concerning its occurrence in other documents. If we compare two service documents i.e. ‘s1’ and ‘s2’ first having a large and second having small number of terms respectively then TF-IDF does not demonstrate the importance of terms in ‘s1’ as smaller to ‘s2’. Therefore, it is not possible to assign effective weights to the terms. In the lack of effective vector space representation, the performance of the clustering algorithm also degrades. In this paper, we propose a new approach i.e. LFW+K which is based on Length Feature Weight (LFW) for the vectorized representation of service followed by K-Means clustering. The proposed approach helps to find the informative term from web service and assigns the term weight accordingly by considering parameters like the dimension of the web service document, maximum frequency of a term in the document and occurrences of a term in other documents. LFW+K is applied on the datasets of real-world web services and the performance is measured using standard measurement criteria (i.e. precision, recall, F1-score, and accuracy). Results of the proposed approach are compared with K-Means clustering on TF-IDF representation method i.e. TF-IDF+K. Results show that the proposed method outperforms the clustering done by using TF-IDF method for vector space representation of web services.

Introduction

In today’s scenario, various business applications are carried out with the help of web services due to its seamless benefits like data integration, information exchange, code reuse, versatility, cost-saving, etc. Web services are growing at a rapid speed as various vendors like IBM, Amazon, Microsoft, etc. are relying on web services standards and providing tools & software to the customers according to their requirements. Web services are the software components that are based on some standards for communication, data transfer and service description. In general, two types of web services exist: (1) SOAP-based services (2) REST-based services. The power of SOAP-based service lies in XML and three core technologies: UDDI (Universal Description Discovery and Integration), WSDL (Web Service Description Language) and SOAP (Simple Object Access Protocol). Firstly the vendor creates a service and publishes the description file of service in WSDL format in UDDI which is a repository to store web services. WSDL document includes the functionality of service, name, bindings, type of messages, etc. which are required by a customer to have information about that service. The customer lookups web service according to its requirements from UDDI and communicates to the vendor to use its functionality by using SOAP messages as illustrated in Fig. 1 (Bhardwaj & Sharma, 2015). REST-based services (Web APIs) use HTTP for message transmission and URI for identifying resources. The functionality of services is described using XML or simple natural language text.

In recent years, text mining and web mining techniques have gained a lot of attention from the researchers due to the proliferation of large growing data and its management. Only syntactic analysis of web services for matchmaking according to the customer’s needs is not an efficient way. Semantic web services have also evolved for interpreting the functionality and capability of web service in an improved manner. Various other models are proposed by eminent researchers for semantically specifying web services such as Web Service Modeling Language (WSML) (De Bruijn, Lausen, Polleres, & Fensel, 2006), Web service Modeling Ontology (WSMO) (Fensel et al., 2006), Web Service Modeling ontology for semantics (OWL-S) (Martin et al., 2004). But practically it is identified that a lot of web services do not have explicit semantic information in terms of ontological impressions. For manually annotating these web services, one needs to have appropriate knowledge of the domain to which it belongs to. As the bulk of web services and ontologies are created exponentially every day, where each ontology encloses a thousand of concepts and relationships, so it is a tedious and burdensome task to manually find out the appropriate domain and to annotate these massive accessible web services manually (Nisa and Qamar, 2015, Aznag et al., 2013). Taking these limitations into consideration, mainly WSDL documents and Web APIs are preferred by many researchers for text mining techniques and various methods have been proposed for the extraction of semantic meaning from web services.

With the rapid proliferation of web services, there is a need for an intelligent system that can be able to retrieve efficient results for customer’s web services queries. When services are stored in a clustered manner in repositories then they can be efficiently discovered. For web service clustering, firstly preprocessing is to be done on web service documents, and after that web services are represented in vector form so that clusters can be created according to their similarity. For web service representation, mainly TF-IDF (Term Frequency – Inverse Document Frequency) method is preferred by many researchers so that clustering can be carried out (Elshater et al., 2015, Sharma et al., 2014). In WSDL files and Web APIs, services are described in short text form. Due to the lack of frequent terms in service and inability to determine the dissemination of terms across all the services, TF-IDF does not work well for web service representation. For web service discovery and clustering, researchers have tried to enhance the TF-IDF method used for web service representation into vector form. An enhanced method for vector representation of text document has been proposed by the researcher in which terms can be easily discriminated and due to that, the performance of text document clustering is improved (Abualigah, 2019).

The main aim of this paper is to overcome the limitations of TF-IDF method in web service clustering so that the results of clustering algorithms can be improved. In this paper, the LFW+K method is proposed in which Length Feature Weight (LFW) Method is used to determine the most informative term from the service followed by K-Means clustering. This method tries to assign the weight to the term according to the importance and dissemination of that term across the services. The contribution of the paper is to facilitate the performance of web service clustering by enhancing the representation of service in vector space. By achieving this, web services can be efficiently discovered from the large repositories and it will work as an intelligent system that will be expert to handle the service queries.

This paper is organized as follows. Section 2 provides the related work in the domain of web service discovery and clustering using various similarity approaches. Section 3 presents the proposed methodology, defines the web service document, feature extraction, preprocessing steps and clustering method. Section 4 shows the comparative analysis of TF-IDF and LFW method with K-Means clustering. Limitations of the proposed methodology are discussed in Section 5. Section 6 concludes the paper and throws light on future work.

Section snippets

Related work

There are mainly 3 approaches for the discovery of web services from large repositories (Bhardwaj and Sharma, 2015, Bukhari and Liu, 2018).

  • i.

    Discovery in a directory such as UDDI.

  • ii.

    Discovery in web portals.

  • iii.

    Generic search engines

Discovery in UDDI is a traditional approach that is mainly based on keyword searching. It is not an efficient one, as it provides syntactic information only and many UDDI registries are also unavailable in the current era. In the second approach, different web portals are

Proposed methodology

In this section, we describe the proposed methodology i.e. LFW+K which includes the following steps:

  • The extraction of needed features from WSDL/Web API documents of web services.

  • Necessary preprocessing steps to remove irrelevant features from extracted features.

  • LFW method for representing preprocessed features of web service documents in the vector space.

  • K-Means clustering to group similar web services.

The detailed methodology is shown in Fig. 3.

Experiment setup and results

We have performed this experiment in windows 10 environment on a machine with an i7 processor and 8 GB RAM. Python 3.7 is used to perform our proposed methodology i.e. LFW+K discussed in Section 3. For representing web services or text documents in vector space, TF-IDF and LFW method are used and the effectiveness of these methods is tested by using the K-Means Clustering algorithm. For the execution of the K-Means Clustering algorithm, we have taken 10000 iterations and the algorithm is

Limitations

From results, it is proved that our proposed methodology is providing improvement in terms of clustering performance. By considering the dimension of service document, frequency of terms in other documents, and the highest frequency of a term in the document, we are able to efficiently represent services in vector space. The main limitation of the proposed work is that this approach is not able to find semantic relations among the words. It can not determine the synonym, antonyms, etc.

Conclusion and future work

In this paper, a better technique for vector representation of web service description documents is applied which overcomes the shortcomings of basic method i.e. TF-IDF. After vector space representation, the performance of the K-Means clustering approach is analyzed. Web service clustering is a challenging task in today’s scenario due to the rapid increase in the number of web services provided by different vendors. Results prove that our proposed method i.e. LFW+K has enhanced the performance

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (29)

  • K. Elgazzar et al.

    Clustering wsdl documents to bootstrap the discovery of web services

  • Y. Elshater et al.

    godiscovery: Web service discovery made efficient

  • D. Fensel et al.

    Enabling semantic web services: The web service modeling ontology

    (2006)
  • H. Gao et al.

    Hierarchical clustering based web service discovery

  • Cited by (29)

    • A semantic matching approach addressing multidimensional representations for web service discovery

      2022, Expert Systems with Applications
      Citation Excerpt :

      Descriptions can vary from simple text to large and sophisticated semantic descriptions based on ontologies (Jordy et al., 2013). Specifically, for a syntactic description language, the web service description language (WSDL) is one of the most representative systems (Agarwal et al., 2020). It can describe services in terms of what they do and how they are invoked, and it provides syntactic functional information through low-level message-exchanging descriptions (Renzis et al., 2017).

    • A systematic literature review on web service clustering approaches to enhance service discovery, selection and recommendation

      2022, Computer Science Review
      Citation Excerpt :

      In paper [81], K-Means and improved fuzzy with KNN algorithm is proposed for efficient web service discovery. A novel method LFW+K (Length Feature Weight with K-Means clustering) is proposed for vector space representation of services [82]. This method has tried to overcome the limitations of TF-IDF approach.

    View all citing articles on Scopus
    View full text