Keywords

1 Introduction

Salesman plays a critical role in a company due that any products need finding their target customers to achieve the performance. Therefore, salesman or marketing staffs pay great attentions in discovering the sale leads. Using search engine to find the leads is a common channel, for instance, salesman may search a keyword with specific product, company, industry or location in search engine to find potential customers; however, it is a non-trivial task for the salesman in a varied market environment. First, it will cost a large amount of time and energy for a salesman to extract information from massive web data, which greatly hinders the progress of salesman. Second, it is relative inefficiency and insufficient to mine useful and available leads from data. Third, many leads may be unconsciously missed by the salesman due to the limitation of human attentions.

Traditional business intelligence (BI) [3] focuses on offering the decision support for enterprises via inner data created in production procedure. For example, leveraging analyzing the sale data of enterprises, the user can find the law and predict future sale trend for improving product sale strategies. However, little works mention how to automatically discover sales leads by intelligent technologies from external data environment (web data).

To address these challenges, we present a software robot named LeadsRobot, which can discover and mine the leads from web big data according to the requirements of the salesman. It can enable the real-time leads generation via big data analytics technologies and natural language process (NLP) [6], and it can provide the customized leads recommendation based on the salesman’s personas and the history feedback leads. Specifically, we attain the keywords of leads by semantic understanding technologies for the salesman’s natural language input, and then employ the keywords to automatically the crawling the raw leads from public web data leveraging TextRank and Word2Vec technologies. Subsequently, the crawled data is automatically analyzed to mine potential leads by Name Entity Recognition (NER) for Chinese company name.

The main contributions of this paper can be summarized as follows:

  • We propose LeadsRobot, which is a framework of leads discovering robot. It can achieve the automatic mining sales leads from the public web data.

  • Multiple NLP technologies including TextRank, Word2Vec and NER are combined to enable the raw web data processing and the leads data extraction.

  • The robot can accomplish the intelligent recommendations according to user’ personas and feedback of leads data from the salesman.

The reminder of this paper is organized as follows: Sect. 2 introduces the related work about sales intelligence and robot. In Sect. 3, we present a framework of leads robot and the details. Section 4 gives the application of LeadsRobot. For Sect. 5, the performance and evaluation for LeadsRobot is described. Finally the conclusion is summarized in Sect. 6.

2 Related Work

Recent years robot has attracted many intensive attention in academia and industry with the development of artificial intelligence. Software robot is an emerging direction for service based applications. In the work of [20], the authors presented a human pose tracking for service robot applications by a SDF representation model. The work in [19] proposed an approach to modeling social common sense of an interactive robot, which provides customers in a queue. The work of [12] proposed an indoor navigation service robot system according to vibration tactile feedback. All these service robots are physical robot.

Another type of robot is software robot, which seeks to provide a virtual service for the customers by embedding into a software, such as chat robot [17], business agent robot [8], shopping robot [5], crawling robot [18], etc. All these works in software robot are very attractive, however little works in sale intelligence which focuses on using big data analytics and artificial intelligence technologies to provide services for the salesman to improving the performance.

Different from traditional BI, sales intelligence advocates that using multiple analytics to find prospects from external data sources. Existing BI seeks to provide decision support for company managers based on existing data of companies. However, little attentions are paid on the salesman.

Many products of sales intelligence have been developed to bridge the gap. RainKing [14] offers IT sales intelligence solutions, which can transform data into actionable, forward-looking intelligence and help salesman speedily to discover prospects. It can generate sales leads via a substantial of web data. Besides, it can identify the needs of potential customers such as pain point, new funding, management change, project scoop, contract scoop, etc.

In the real scenarios, there are many sale intelligence products that are similar with RainKing, for instance, LinkedIn sales navigator [10] help users to discover the prospects and to understand the intention of buyers. Insideview [7] allows to collect social and business sales from various sources than competitors, and to appropriately deliver to your sales rep’s computers and smart cellphone - online or inside your CRM. Other sales intelligence products, such as Discoverorg [4] SaleLoft [16], Zoominfo [21], etc, are developed to provide the promising function to address the pain point of the salesman.

Even though various sale intelligence products have been developed in market, currently it lacks effective technical framework and down-to-earth applications for enabling sale intelligence in sales leads discovering domain. To address this issue, we use big data analytics and NLP to create a leads robot that allows gaining the insight about sales leads from public web data.

3 The Framework of LeadsRobot

To address the issue of leads automation generation, in this section, we present a framework for designing leads robot. The robot can understand the requirements of salesman and then automatically crawl the public web data to analyze and to extract the sales leads for recommendation to the salesman. The salesman is able to give the feedback to the robot, which improves the accuracy of leads recommendation.

The framework is demonstrated in Fig. 1, firstly, the salesman can propose their requirements for leads by natural language, for example, the salesman can say he wants to find some medical industry companies and sell ERP products to them. By extracting the keywords from the natural language, the basic requirements words can be attained. They also called leads keywords which will be used for further mining from public web data. The requirements descriptions usually are simple and ambiguous that is insufficient to mine much more information from public web data.

Therefore, the second step is to discover much more support words in order to automatically crawling. The support words are the similar words with the leads keywords, which can be enabled by support words discovery model. Subsequently, the model generates knowledge base of semantic support words. All the words in knowledge base are the features for the salesman.

Leveraging these knowledges, we can learn the demands of the salesman, and realize the customized crawling for him meanwhile recommend proper leads via leads analysis engine. The automatic web crawling is able to gain the raw leads data from public web data by putting the words of the knowledge base into an automatic crawling queue, such as news website, recruitment website, funding website, project scoop website, etc. The crawled data are analyzed to create raw leads in real-time and then they are proceeded for word segment and indexed for a quick query. Lead analysis engine is responsible for recommending leads to salesman according to the knowledge base of semantic support words, user personas, history leads data from raw leads. Meanwhile it can achieve the automatic contact crawling for each leads for the salesman to reach the customers.

Fig. 1.
figure 1

The framework of LeadsRobot

3.1 Semantic Understanding Engine

Semantic understanding engine aims to understand the questions proposed by the salesman and further to extract the keywords included in the text. The result of the semantic understanding is leads keywords which are extracted via NLP approaches.

To conduct the keywords extraction, firstly segmenting word for a sentence or a passage is the basis in Chinese keyword recognition due that Chinese segmentation is quite different from English word segmentation, which uses blank to segment word. In the real application of Chinese word segmentation, it heavily relies on corpus resources. We use AnsjFootnote 1 as a tool to proceed the Chinese word segmentation, which supports efficient word segmentation by optimized Trie Tree, however its memory usage is relatively lower. By using it, a sentence can be conversed into a set of words.

The second step is to recognize keywords according to the result of word segmentation. TF-IDF approach [15] is employed for keywords recognition. It combines the term frequency and inverse document frequency to discover the keywords in a document. The basic idea is to filter the common use words and to remain the vital word for a document. Regarding to a term frequency, for a given corpus D, word i, document j, it can be defined as: T\(F_{ij}=\frac{n_{ij}}{\sum _{k}n_{kj}}\), where \(TF_{ij}\) denotes the numbers of word i occurs in document j, \(n_{ij}\) denotes the number that i occurs in document j, \(\sum _{k}n_{kj}\) denotes the sum that all words occur in the document j. For inverse document frequency, \(IDF_{ij}=log\frac{N}{|\{j \in D: i \in j\} |+1}\), where |N| is the number of document in corpus D, \(|\{j \in D: i \in j\}|\) is the document number where the word i appears in corpus D. Therefore, the \(TFIDF_{ij}=TF_{ij} \times IDF_{ij}\), where \(TFIDF_{ij}\) represents the importance for each word in a document.

We use TF-IDF to extract the keywords from the words given by word segmentation from natural language input. These keywords are closely related to the sales leads, which refer to the industry, products name, targeted sales distract, etc. The leads keywords are the basis to conduct leads mining from public web data. They to some extent express the expected leads intentions for the salesman.

3.2 Support Words Creator

Based on the semantic understanding engine, various leads keywords can be extracted from the salesman expressions. However, it is tricky and insufficient to use these limited keywords to leads from public web data. Therefore, support words creator is devised for discovering much more words about the leads keywords, these generated words are called support words which allow us to find much more similar and close words for the leads keywords.

Fig. 2.
figure 2

The framework of support words creator

The framework of support words creator is demonstrated in Fig. 2, a substantial of public web data is firstly gained from the Internet to be considered as a training dataset, these data is unstructured data in the initial period of gaining the data, and from multiple sources with diverse data structure, therefore the body of data is extracted and then data sharing is conducted by word segmentation processing.

Then TF-IDF and TextRank algorithm [9] are jointly used to model the data after word segmentation due to the poor performance of TF-IDF algorithm in the extraction of semantic support words regarding to short text. Meanwhile it does not consider the semantic context. TextRank algorithm has much higher complexity in the specific application context. It is suitable for combine TF-IDF and TextRank algorithm to extract semantic support words. There are two types of text, which is short text and long text. The former is more suitable for using TextRank algorithm, while the later is appropriate for using TF-IDF. By using the combination of the two approaches, the framework can guarantee the quality of extraction meanwhile ensuring the extraction speed.

TextRank is a keyword extraction algorithm for short text, which is employed to identify the keywords within the news that is closely related to leads context. We use TextRank algorithm [9] to perform the keyword extraction for short text. Similar to PageRank which is a well-known search engine ranking algorithm successfully used in Google search, TextRank elaborately borrows the same idea in keyword extraction under long text context. The algorithm brings in the concept of the weight value of edge, and determines relationships among words via a sliding window. A good advantage is that the performance of algorithm only relates to input corpus without extra training cost.

The model of TextRank can be represented as a directed weighted graph, in the context of sentence; the edge represents the relationship between the words appearing in different window. Specifically, we use Algorithm in Fig. 3 to extract keywords of external data.

Fig. 3.
figure 3

Algorithm of keywords extraction by TextRank

For the algorithm in Fig. 3, first, we use the HMM model by training to segment word for input data and get words. Second, we use Pos model [11] to filter the words, and merely remain the word list of assigned Pos to assign words Then, each word in words is considered as a node, and the window is set (i.e, 3), built the graph model where the weight of edge is the sum of distance between two words. Furthermore, we use TextRank Algorithm (getWeight) to gain the importance of each word. Finally, we choose the first n words which have the highest weight as keywords. In the getWeight approach, for each input word and graph model word_graph, we will output its importance. First, we gain all the word set word_in which haves a path to word, and get the weight d from word to word_in. Then, for each node wi in word_in, we iteratively find the reachable node for wi, and calculate the distance sum of them to assign value for od. Subsequently, we calculate the sum of (d/od)*getWeight(wi, words_graph), and assign value to res_you. Lastly, we gain the (1−d)+d*res_you, which is the importance of input data.

Due that the data for training and modelling is quite limited, the discovered support words is insufficient by using above approaches, therefore, word2vec model is used to generate some words which are similar with the discovered semantic support words. The word2vect model uses a substantial of public dataset for training, it consists of a number of Chinese words and phrases, which are mapped into fifty-dimensional float vector. The close words in semantic are also close in vector distance. The similarity between words can be gained via calculating the cosine value between vectors. Meanwhile, to accelerate the speed of iteration, all the vectors are portioned and take the close words into the same partition to speed the discovery of semantic support words. By using word2vec model to generate semantic support words can discover a large number of semantic support words for benefiting to mine external enterprise data mining from existing extracted semantic support words.

3.3 Data Crawler

Data crawler aims to gain the raw leads data from public web data, which achieve the automatic web crawling and automatic contact crawling. The former is to gain the raw leads data from a lot of open websites (business news, recruitment, funding, project scoop, etc), while the later is to discover the contact for the corresponding leads, such as phone number, email, leads address, etc. It benefits for the salesman to directly reach targeted leads and converses them into sale performance.

Therefore, based on the generated semantic support words by support words creator, it can be used as the features for describing the required leads of the salesman. By using them, we employ data crawler to mine much more similar data from potential targeted websites. The architecture for the data crawler is shown in Fig. 4, many existing websites have provided the potentially possible interface for data searching, but they usually set the strict restriction for visitors. Even if the result set is limited for each time query, we can build different search conditions to mine much more valuable information, semantic support words and website search interface can be used for data mining. Specifically, by examining the updated frequency of data of public websites, the timed scheduler can accomplish the reliable and efficient crawling of data by using IP resource pool. The public web data is parsed into raw leads database by data parser that extracts the raw leads data from the web page format files. We use the Beautiful SoupFootnote 2 to achieve the process. It is an open source html parser for web pages process.

Fig. 4.
figure 4

The architecture for data crawler

3.4 Leads Analysis Engine

Leads analysis engine aims to analyse the leads from raw leads data from data crawler. The core part for leads is to recognize Chinese company name from a substantial of text. It is a tricky task for company name recognition, which is a typical NER task.

As the Fig. 2 shown, the input for leads analysis engine is raw leads, user personas, knowledge base of semantic support words and leads data from history record of the salesman. Leads analysis engine proceed two types of tasks. First, it can achieve the company name recognition from the raw leads data. NER approaches are employed for addressing the Chinese company name recognition. Second, it allows recommending the leads for salesman according to his customized needs.

Regarding to the company name recognition, a NER service is developed to deal with it.

To proceed the NER, various libraries are needed to assist the recognition of company name entity. In our framework, it includes three libraries, which are company library, corpus library and rule library. Company library plays a critical role in recognizing the company name entity, all the recognized potential company names are compared with company library to verify the results. The corpus library can help to improve the accuracy of NER via company products and projects information due that most news mentioned are oriented to specific projects or products of companies. By using them, we can perform the search of a company to find potential companies. The other important issue is rule library. The task of NLP is quite tricky; usually NLP algorithms are difficult to tailor all the conditions. To address the incorrect recognition results, we need to establish a rule library to help filter the noise in the actual recognition. Assuming take the Chinese enterprises as an example, our system builds the following rules: (1) we choose several high-quality companies referred to specific location or geography. In these positions, more sales leads have relatively existed than other districts; (2) we set a series of blacklists to filter the words that seriously affect the accuracy of recognition. Especially for the noise in the company library, we shield them in recognition process; (3) we directly filter some unexpected characters occurring in the company name, such as digital, too short or too long company name, etc; (4) We also deleted the foreign names in appearing in the company name and some cartoon names in the company name. All these steps are to reduce the noise as far as possible.

To achieve the recognition of company entity, we use CRF based name entity algorithm via corpus library. The corpus In our framework is used to assist the word segmentation by using two-dimensional hidden Markov model (HMM) [2, 13] in a substantial of labelled Internet data. Based on this, word segmentation dictionary is built to help the algorithms for NER. Then filtering by rule library, we can calculate the certainty factor and association factor.

Fig. 5.
figure 5

Algorithm of Chinese company name recognition

As Algorithm in Fig. 5 demonstrates, the input is the text ready for recognition and output is the possible company name recognizing from the text. First, cleared bi-dimensional word transition matrix and generated probabilistic matrix are used to train for attaining HMM model, and the labelled training set by labelling name entity is used for training to gain the CRF model. Subsequently, HMM model is employed to segment the word of input data to get the words list. Furthermore, CRF model is utilized to mark the sequences of words for getting possible company name entity. Then, all possible name entities are used to match the company list built by using inverted index, and all matched company name entities are remained for gaining company_entities. Finally, we use all the name entities in company_entities, to match the rules in rule_library in turn, and remove the name entities by rule and get the final results.

For the convenient of searching, we extract the keywords for indexing the leads (company name) by using the TextRank approach given in Sect. 3.2.

Fig. 6.
figure 6

The chi-square cerification model

However, there are some inaccuracy recognized company names due to the context of text. Therefore, to improve the precision of company name recognition, we use the chi-square verification model [1] to improve the precision of final results. The basic idea is to calculate the divergence degree between theoretic value and observation value. By using the NER service and keyword extraction service, we can attain the company name and corresponding keywords. Based on them, we use manual to mark the company name and keywords partly, and take them as observation values, the generated results are considered as theoretic value, the chi-square model is used to match them and output the most significant company name and corresponding keywords list. Concretely, for each generated result, it can be represented as a two-tuple <company, keyword>, we mark it as a class, and then we calculate the chi-square value for each keyword. Furthermore, for each input two-tuple, we calculate its chi-square value to ensure its class (Fig. 6).

Based on the leads and keywords given by the above approaches, the leads recommendation for salesman can be conducted according to the following steps, firstly the list of semantic support words should be gained from the knowledge base of semantic words. Then we get the user personas of a salesman from user personas database and get the N users which have the similar action with the salesman. Furthermore, we attain the lead data about these users from the conversed leads data, and proceed statistic model to gain the statistic characteristics. Finally, we find the most closest M companies with these characteristics from leads database to as the recommendation leads for the salesman.

Table 1. Critical N values

4 Case Study

To demonstrate the feasibility and effectiveness of our proposed robot framework, we employ a case about ERP sales in a real scenario to verify it. For the case, we collaborate with Kingdee International Software Group Company, which is a leading ERP solutions supplier in China. There are multiple product lines in ERP for different industry solutions. ERP sell is a non-trivial problem due that it is closely related to the information requirements for enterprises. The sales leads area critical for improving the ERP sale performance, however, it is inefficient and insufficient for sales to searching valuable leads from a substantial of web pages.

In our case, we recommend the high-quality leads to Kingdee product lines according to their diverse requirements. Taking Kingdee Cloud as an example, it is the flagship product of Kingdee ERP and provides the ERP service based on Cloud mode. The ERP is sold in various districts of China. According to their requirements, we recommend leads to the leads management staff, then he distributes them to the salesman in Kingdee Cloud. The leads are the prospects in buying ERP, which include all the information about the prospects (full company name, industry, staff, register capital, address, contact email, phone number, etc.). All the leads are distributed to the salesman for following up (Table 1).

As shown, we use feedback results of three cities (Shen Zhen, Shang Hai, Guang Zhou) from sales in China. The sales sell the ERP by dialling the phone number by our provided leads. From the results, it can be found that Shen Zhen and Shang Hai have much higher valid leads. The average valid leads rate has exceeded 20%, while Guang Zhou is relatively lower than other two cities. Usually, according to the salesman feedback, the valid leads collected from web page are about 1%–2% by them, which is totally insufficient to satisfy their demands of need. The leads provided by our service can at least increase triple times in valid leads rate than manual searching by the web. It can significantly improve the conversion rate in ERP product sales.

According to the feedback from their CRM, the sale performance for Cloud ERP by using LeadsRobot is greatly improved, the number of signed contract from the recommendation leads is 135, the contract number is 256 and contract amount is 5.682171 million RMB in 2017.

5 Performance and Evaluation

To test the performance of LeadsRobot, we carry out the evaluation from the speed of leads generation and throughput of recommendation leads. The configuration for the cluster of leads collection and generation are indicated in Table 2.

Table 2. The configuration for the cluster

The configuration of computer for throughput test is given in Table 3. From Oct. 10, 2017 to Feb. 28, 2018, we have collected the 150,7681 leads, and generated 13706 leads every day in average. In other words, the speed is that each lead is generated an average of 6.3 s.

Table 3. The configuration for throughput test

All the leads are generated relying on the crawling due to the automatic raw web data crawling and contact crawling. To test the throughput of the recommendation by LeadsRobot, we randomly selected 1000 from the leads database to test it, as the Fig. 7 is demonstrated, the average throughput of LeadsRobot is 689 per hour. According to the salesman, they can collect 100 leads every day at most; the LeadsRobot generates the leads for one hour that can be equal to the leads collected by two sales, which can greatly improve the efficiency of the salesman.

Fig. 7.
figure 7

LeadsRobot throughput test

6 Conclusions

The sales leads are critical for the salesman to carry out his business. Currently, it is inefficient for salesman to blindly search leads from a substantial web pages. Besides, the leads analysis by manual is for the salesman which greatly occupies his time.

To address the issue of lead gaining, in this paper, we propose a LeadsRobot which is a software robot and seeking to address the automatic leads discovery via analyzing the web big data on the Internet. Multiple text processing approaches are used for mining the web big data, web mining is proceeded by using TextRank and TF-IDF model, and a NER service is presented and developed for Chinese company name recognition. According to the feedback from the ERP sell scenario, the recommendation by LeadsRobot can greatly help the company to improve its performance of sales.