Keywords

1 Introduction

The manufacturing paradigm is rapidly changing from whole product to component integrated production [1, 2]. The big organizations usually possess the capability of in-house manufacturing for all product components. However, Small and Medium Enterprises (SMEs) are left with the sole option of heavily relying on Request for Information (RFI), Request for Proposal (RFP), costing, tendering etc. It is conclusively evident that there is an in born situation of vying environment among the corporates for the sake of enhancing their respective business values. This situation has directed the small and medium enterprises (SMEs) to be more striving and dynamic in pursuit of adapting novel business opportunities. The SMEs are facing complex situation where it becomes difficult in drawing optimized decisions for the costing, trending, reverse costing and marketing recommendations etc.

In the era of electronic age, the unstructured data is increasing more as compared to the linked data [3]. The intensity of this phenomenon in SMEs is significantly higher due to the mass production triggered by the fast means of communication resources [4, 5]. This large amount of data contains explicit and implicit patters to help out the SMEs for making essential decisions related to their business strategies. Machine learning techniques have enabled the researchers to produce real time marketing, outsourcing and marketing trends. These solutions encompass across provision of solutions to dig out useful patterns while easing out the prediction on specific scenario. However, it is an established fact that the machine learning techniques always perform better with provision of large data sets [6]. The issue of inefficiency of machine learning techniques is three dimensional. These dimensions are known as volume, velocity and variety (shortly 3v).

Big Data and its analytic have emerged as an important domain of interest for both researchers as well as industrial practitioners. It reflects the magnitude and impact of data related issues to be resolved in contemporary business enterprises. Big Data deals with three non-trivial challenges (mentioned earlier) where machine leaning techniques usually exhaust. The second challenge (variety) of dealing heterogeneous data is abundant in the business organizations. The objective of this study is to exploit the diversified features of data to address the answers related to enhance the business capability of an enterprise.

One interesting challenge in dealing heterogeneous and large amount of data is scarcity of a specific class label in application of business intelligence. There are situations when the class in interest is in acute shortage. Exemplifying this problem in customer response modeling, the number of respondents is usually much smaller than that of the number of non-respondents [7]. On the other hand, most of the machines learning algorithms behave well on roughly same fraction of classes [8]. It requires implementing a normalization technique. The normalization include under sampling, oversampling to deal with highly imbalanced classes. Some techniques have been outlined in literature review [9]. Usually the business intelligence algorithms are designed to cater the linked data while ignoring the free text in most of the cases. In fact the volume of unstructured data is growing far more rapidly in comparison to the structured data [10]. According to a report, the volume of total data being generated at the enterprises will surpass 1600 Exabyte in which less than 100 Exabyte data will be structured [10]. Overwhelming amount of the unstructured data is also one of the motivations to focus in these techniques which are more acclimatized towards handling unstructured data. The distributed natures of the task make it ensure that the Big Data with both of its dimensions Volume and Variety is handled whereas conventional strategy of dealing in processing of data with non distributed nature was arguably ineffective.

The remaining paper is structured in five more sections as follow: Sect. 2, we have briefly discussed the relevant work. Section 3 introduces the problem statement and its branches. Section 4 is related to the proposed methodology in which we have discussed each and every component of the proposed framework. Section 5 presents a case study along with the compulsory Big Data steps. We have concluded the research in the last section with some recommendations.

2 Related Work

In this study, we have examined new challenges, the untamed Big Data has posed. These challenges include.

  1. 1.

    Efficient retrieval of specific data out of large corpus.

  2. 2.

    Identification of trends and patterns.

  3. 3.

    Identification of Implicit relationship between various parts of evolving data.

  4. 4.

    Investigating out any opportunistic value culminating into an added value for a network of enterprises.

  5. 5.

    What are the useful aspects which only Big Data analysis can provide?

  6. 6.

    Scalable indexing of the data.

The researchers have shown that large enterprises have already perceived the clandestine but useful asset of the ever increasing data as we illustrated in the striking problems, challenges and issues above. Some notable examples include web giants Google, Yahoo, and Facebook. These corporates are already exploiting the data for the provision of vibrant, vigorous, dynamic and pertinent recommendations to their esteemed clients and web users. Yet the challenging research issue is formulate able and inevitable that whether Small and Medium Enterprises (SMEs) can also dig out the data in the same fashion. This study conveys the illustration of steps involved in the process of investigating the “data into assets” wherein the scope of the steps is limited to small and medium enterprises for the purpose of turning structured and semi structured data into an added value. A response to the 5th challenge/issue stays in the analogy of a serialized flow of functionality. According to this analogy, the outputs from a product recommendation system affectedly experiences the technical aspect oriented changes along increasing the size of the input space leading to Big Data. This can be highlighted by the fact given the transaction of size in million or billion, the useful and persuasive patterns can be ascertained with keen precision.

The capacity of measuring the value of the data is firmly destined to the notion of delay; herein also coined by the terms of ‘latency’ or ‘throughput’. During the course of the processing of large volume of input data space, latency plays an ineluctable challenge. This challenge keeps an open question to the research community. Auspiciously the Big Data technical applications, infrastructure and tools have empowered the solution architects to consider this essential aspect to a great extent. It can be brought forward through keeping a balance between employing cheaper hardware to process large volume of data.

Chelmis [11] investigated the analytical treatment of Big Data technologies. Their work was targeted for collaboration in focus with highlighting some interesting research questions; these questions encompass user’s communication behavioural patterns, statistical characteristics, dynamics properties and complex correlations between social assemblies and topical structures. Differentiating the value added quality of this research, the study by Chelmis [11] is limited towards the technical and functional internal processes of a SME. Its canvas is not covering the impact of Big Data for the purpose of product improvement between two or more SMEs. In commercial world, the data repositories have placed a non-trivial role and impact for the business of a SME. This aspect is backed by the fact that the data by virtue of its inherent nature acts as an intangible knowledge assets towards any corporate [12]. The nature of ‘data as asset’ possess an array of issues. These issues include definition of data (especially for structured or semi structured data); formulation of information assets in an enterprise; discovery of the unique patterns and characteristics associated with this input space. Adding to these issues of data assets include the definition of key concepts of input data; the adherent quality of implicit as well as explicit knowledge and information management. More important is to address the issue of the business impact of having low-quality data and information assets. In inter enterprise collaboration, negotiation plays a critical role [1]. They introduced Negotiations for Sustainable Enterprise Interoperability with Ontologies (NEGOSEIO). This framework used MENTOR [13]. The MENTOR introduced the mediator ontology concept first time which was widely adopted by the research community. Jardim et al. [1] identified the factor which are essential to be incorporated. These include information, function and behavioural factors. Their study discuss the aspects of negotiation leverage the ontological modelling.

3 Problem Statement

Enterprises in disparate, complex but essential relationship with various partners are no more capable of restricting themselves of single private information model for any particular venture. This constraint ultimately pushes them towards their interaction with different set of partners in inter-enterprise environment context. The enterprise needs to assess the value of the data which it contains. The value of the asset has numerous characteristics such as finding the degree of the value. The degree of value is a function of related stack holders. The stakeholders refer to those who are potential user of the information within the enterprise. The deceleration of value of information is marked by the nature of underlying information and the persistence of its accuracy and precision over certain range of time period. Every asset in its life cycle turn into the liability; this state of liability is referred to the state when it is no more useful to anyone at all. Keeping in view of this problem and to address the issues elaborated in the previous section; we can identify following open question.

Can high degree of data centric and knowledge integration in enterprise collaboration can be promoted and linked with the perceived benefits of added value towards an increase in the operational efficiency of the enterprise.

The following hypothesis can be formulated based on the research question.

  • If the business intelligence service detect that the data in document management system give the oscillation of pattern then enterprise collaboration provides an appropriate method to create data asset to catch assets traceability in ontological modelling.

  • If enterprise collaboration yields an effective way to realize an added value then digital preservation comprised of ontologies serves the purpose of re usability.

4 Proposed Approach

This study adopts a theoretical framework to realize a platform. We first introduced an axiom of status of the SMEs in the context of this research. We propose that the enterprises (small and medium) be considered as repositories of data and knowledge. These repositories can be used in emphasizing the role of intangible resources. In quest of these aims, the goal oriented intelligent mechanism has been proposed. The target of this mechanism is improvement of the mobility mechanisms in the acquisition phase. It is a non-trivial aspect to establish a globalized standard during the pursuit of attaining the objective of inter-enterprise collaboration [14].

From our previous discussion, it is evident that the solution of a problem with heterogeneous data cannot be handled by applying a single algorithm or a single business intelligent model. The proposed solution of the problem is a logical layer as shown by Fig. 1. The nature of the problem motivates us to find out the solution in the array of correct alignment of algorithms, appropriate filtering, cleansing and development of a scalable framework which can handle large amount of variety of data. The top most layer is receiving raw data. Technically the data at this layer is a meta-base that contains free text as well as Meta attributes. Here in this case, our system is expecting data in three forms. These include unstructured data from video, images, electronic sheets and free text, the structured data related to query refinement and quality procedures. This component (centered) also contains unstructured data which is awarded of quality procedures. The left most component in this layer is typical structured data which is linked data of Supply Chain Management (SCM) system. This data is gleaned from Supplier Relationship Management (SRM), Enterprise Resource Planning (ERP), Customer Relationship Management (CRM) system and Product Lifecycle Management (PLM) system.

Fig. 1
figure 1

Framework of proposed platform

The machine learning algorithm can’t produce useful pattern over free text and other unstructured data unless they are aggregated to a higher degree of granularity. Natural Language Processing plays a pivoted role in this situation. To summarize the free text into a meaningful information, the data needs to pass through four levels in order to tailor the unstructured data. These include lexical level, morphological level, syntactical level and semantic level. During the last two levels, variety of similarity measures have been proposed. A generic classification for all of the similarity measures have been proposed [15]. These classes include measures inspired by path length, information content, feature based and hybrid. We in this proposal suggest incorporating the measures belonging to first two classes. The technique employs the robust accurate statistical parsing graph [16]. Moreover, clustering techniques for free text has proved its significance to produce useful patterns [17]. However, we argue that the result of clustering is not directly utilizable for inference purpose when the input space and cluster realm is large enough. We fix the position of clustering in the proposed framework in the capacity of intermediate refined data. The second stream of unstructured data is images usually posted by the social media community about the product of the SMEs. We argue that the popularity of a brand can be extrapolated by means of the posting, free text and images on the social media sites. It is a fact that single image speaks hundred words and a single video is worth of hundred images. However, the perception of semantics has always been a challenging task [18]. This challenge has been taken by research community to the level that now primitive information such as figuring out the basic human features such like age group, gender, race (color) etc. can be predicted with relatively high recall [18]. A recent research solution proposed by Han et al. [18] is useful in the context as their technique can infer multiple demographic features. Once this primitive data is obtained, there is a compulsory need to perform typical data preprocessing tasks such as filtering, cleansing, transformation, feature selection and the most important handling the imbalanced classes.

To handle the imbalanced problem, smote [9] provides a reasonable solution. It employs K nearest neighbor (KNN) in its heuristics. The time complexity of KNN is Log(m) . O(m2) and space complexity is of O(m2). This baseline complexity gives problem in case of large volume of data. However, a possible solution is its integration into Hadoop echo system. Our initial analysis of reducing KNN to map reduce paradigm indicates that like simple Bayesian classifier, this problem is also effectively reducible to Big Data architecture.

The output of the second layer is a segmented data passed to Big Data platform. Here, the data is provided to the acquisition phase of the Big Data technologies. This phases include Oracle NoSQL database system and/or Oracle Online Transaction Processing System. The variety of the incoming data is unstructured, so it is demanded that NoSQL database is more suitable to this type of data. This type of DBMS system deals data line by line where every line is a complete ‘value’ and a corresponding ‘key’ is allotted. This gives the freedom from a strict schema. The output of the acquisition is subjected to the second component of the Big Data technology layer. This step is related to the organization of the data. Here the notion of the organization refers to adjust its suitability for the analysis phase. The analysis phase is the final step of this step in this research study. The analysis will identify any potential opportunity. One can observe that the granularity level of data is increasing at this stage. Logically, the information is in first three classes; however, its physical interpretation is expected to be marked by dozens of relational entities. The motivation to generate this data is twofold. The first reason is driven by application of the classification algorithms; whereas the classification systems by virtue of their supervised learning nature demands a clear position over the underlying schema. The second reason is that, the data is to be analyzed in NoSQL, which is capable of handling Big Data. The layer responsible for data modeling is showing numerous modeling architectures. The rationale is rooted in the nature of versatile data. The final layer is composed of patters, rules, inferences and prediction on Trending, Costing and Recommendations for marketing purpose.

5 Case Study

To exemplify how the proposed platform will work, we shall give an example related to the functionality of the core part of the platform. There is a scenario in which a manufacturer company produces a component by assembling. The company needs variety of components during the assembly of any finished product. There is huge variety of components. A possible collection or assembly of the component produces the situation like snowflakes where each snowflake is unique with 1018th possibility of repeating the same snowflakes. The same situation repeats in this scenario when each assembly is unique in its functional or component specifications. The whole of the information is being gleaned in a linked data repositories. The collections of the repositories consist of hundreds of relational table. There are some interesting queries which need to be answered but utilizing the information from all of the underlying linked entities. One such query is to churn out the “best prediction of the price a supplier (of a single or multiple components) is likely to quote given the final assembly to be built”. Our methodology suggests that all of the information must be aggregated into a single entity. This will pose the problem of size in vertical as well as horizontal dimension. But the combination of Big Data and classification or regression will produce the most anticipated result. To get the price, we must need some prior information. Extract all meaningful information in a linked database. In this case these type of tables will be required.

  1. 1.

    Technical specifications

  2. 2.

    List of parts

  3. 3.

    Cost/quantity of individual parts

  4. 4.

    Taxation

  5. 5.

    Operational cost

  6. 6.

    Quote price

Quote pricing needs to be discretized or binned if minimum or maximum limit on price is required. Fix the empty cells (numeric to 0 and string to NA). Separate column name into a schema Map. After it, upload all of the data on HDFS. Now we shall integrate all of them using replicated left outer join taking quote price on left hand side. It will output a single file. The idea is to have a single file as it can convey all of the information. We have clearly identified the class problem which is estimated quote price. Apply the random forest over this input data given the class. Now perform the tracing of the model performance by means of ROC, Confusion Matrix or marginalization of number of trees in forest. If performance is not satisfactory, reduce the less important parameter. These steps are repeated till an optimized model is achieved. We can save this model so that we can load the same whenever required in mahout using MapReduce.

Figure 2 is showing the typical steps required to launch the Big Data in this case study. First step is related to the setup of in data house hadoop cluster in hadoop eco-system. In this case study, we are using Oracle Virtual Machine (OVM) for the experimental part. The linked data is aggregated via ETL job. There are three popular services available in the Hadoop echo-system for the creation of ETL jobs. These include Hive, Pig and MapReduce.

Fig. 2
figure 2

Process flow of the steps in weaving trending, costing and recommendations case study

We restrict ourselves to MapReduce as it gives more flexibility providing a complete programming environment by high performance language Java. The consolidated data is a pre requisite for application of data analytic tools such as regression analysis, association rule mining and classification. These data analytic tools ensure the diffusion and transformation into creation of knowledge repositories. At this point, semantic engineering becomes an imperative need playing an inevitable role. We provided means of operators which can work on set of semantic rules. Intuitively, the knowledge perceived in previous step is in different shape where some knowledge is implicative and some is in non implicative (frequent item-sets vs. association rules). Moreover, some knowledge is in descriptive form and some is in inferential statistical shape achieved via clustering and regression analysis respectively. The operators ensure their diffusion into “Enterprise Capacity”, “Enterprise Integration” leading to “Collaborative” perspective. The whole operation in the figure is regulated by means of Oozi work flow in Hadoop echo-system.

6 Conclusions

With the advent of modern computing tools, the service components within enterprises are generating large amount of data. The proposed platform system provides various results according to the role of end user. If the end user is a supplier in industrial sector, then he can get possibility of new chances to achieve new markets. The supplier will receive the in time and situation oriented quality feedbacks for the ranking of his offer. If the end user is a buyer then the system is likely produce the benchmark purchase prices. Moreover, the system’s recommendation system will give the list of best supplier with ranks. The rank is always qualified by the specifications provided by the potential buyer. The specifications include common parameters such as dimensions of the product, technical skills and capacity, quality framework, location, etc. The proposed system offer the array of following services:

  1. 1.

    An improved and robust mechanism for ‘Request for Information’ and ‘Request for Quotation’

  2. 2.

    Consistency analysis in vulnerable costing system of long term as well as daily life products.

  3. 3.

    The proposed platform is useful for technical and commercial endorsement using its recommendation system.

  4. 4.

    Reduction in the cost of importing for local small and medium importer companies.

  5. 5.

    Assistance to SMEs in optimizing investment, design to cost novel products and redesigning of existing products.

The proposed framework has illustrated that the data analytic technologies are useful for developing assets composed of opportunity analysis through exercising the Big Data technologies over ever-accumulated unstructured data.