Keywords

1 Introduction

The academic homepage of a researcher usually contains lots of profile information and descriptions, such as employment status, research interests, contact information and publications. These are essential resources for the digital library access portals [8] to collect the researcher’s metadata. In general, there are two frequently-used ways to collect academic homepages. One is to monitor the official websites of known research institutes and make a binary classification on each crawled webpage to determine whether it is an academic homepage. The other is to use researcher names and some additional information (such as researcher’s affiliation and research interests) as search engine queries to retrieve related webpages from the web, and choose one retrieval result as the academic homepage. In this paper, we focus on the second way to collect academic homepages massively only with retrieval results of search engines including URLs, titles and snippets, because it can be deployed as an API service and respond rapidly with low resource cost. However, building an accurate module that automatically identifies researcher homepages from retrieval results is not easy, owing to the following technical challenges. First, one researcher may have multiple webpages associated with him/her. Second, search engines may split the query into multiple fragments to obtain more retrieve results, but introduce more noise at the same time. Then one’s homepage may rank very low in retrieval results.

In this paper, we focus on the academic homepage identification. In comparison with previous work that only extracted some simple statistical features from each retrieval result, Our key contributions are as follows: (1) We proposed a novel solution to identify researcher homepages via search engines, and demonstrated the effectiveness of our approach on a publicly-available dataset. It not only obtains remarkable improvements with respect to the accuracy, but also performs more stable through computing precision and recall by selecting different proportions of test results. (2) We designed four types of novel features to help identify homepage from high similarity retrieval results. (3) We presented a joint neural network model, which allows different kinds of neural networks being trained synchronously, and thus makes full use of hand-crafted features information and sequence information.

2 Related Work

Relevant work on academic homepage identification using retrieval results of search engines first appeared in TREC’s track [1]. It’s an entity-oriented web search task. The task aims at finding homepages for four types of entities: organization, location, person, and product. To identify an academic homepage, many query-dependent features can be effectively utilized. Tang et al. [10] used researcher’s name and affiliation name as queries of search engines and selected the best retrieval results as researcher homepage, but only hand-craft features from URL are used in their work.

In the perspective of feature extraction, there are three shortcomings in previous work. First, whether researcher’s name appears in URL, title or snippet is a critical factor for homepage identification. It can’t be judged by simply string matching. Second, relevance between retrieval results has not been explored, while previous work only considers a single retrieval result. Third, semantic information involved in URL, title or snippet have not been utilized.

3 Academic Homepage Identification

Given a researcher and his affiliation, the query statement will be “researcher name + affiliation”, such as “Clark T. C. Nguyen, UC Berkeley Engineering”, defined as Q. Each query Q has N retrieval results, named as QR pair.

Table 1. Summary of features designed for each QR pair

3.1 Feature Analysis of Academic Homepages

Through the analysis of academic homepages, query statements and retrieval results, we find some typical features and summarize them into four types. These four types of features are described in Table 1.

Query-dependent (QD) Feature. The QR pair’s order in the retrieval results is positively related to being a homepage. Thus we extract the QR pair’s order as a feature. Other features are from previous work [3], which can be divided into two parts. First, number of researcher’s name and researcher’s affiliation fragments in URL, title or snippet. Second, keywords related to homepage. The keywords contain homepage related topic words such as “students”, “member”, and “committee” extracted from the web-contents of homepages using the Latent Dirichlet Allocation (LDA) model and some high frequency words.

Entity-saliency (ES) Feature. Whether researcher’s name and researcher’s affiliation appear in the retrieval result is quite important for the identification of the homepage. However, researcher’s name has various forms. ES feature aims to make entities more easy to be detected. Hence, some heuristic rules are set to splice name fragments, like from “Jiawei Han” to “JiaweiHan” or from “Charu C. Aggarwal” to “CharuAggarwal”.

Local-contextual (LC) Feature. Existing works only consider a single QR pair, while ignoring the relationship between QR pairs of the same researcher. LC Feature able to establish relations between QR pairs of the same researcher, which contains: Rank: ranks of specific QR pair; Mean: the mean of feature values of top N QR pairs; Variance: the variance of feature values of top N QR pairs; Normalized feature: normalizing feature values by standard deviation.

Fig. 1.
figure 1

Architecture of our joint model.

Semantic-embedding (SE) Feature. We first perform segmentation and entity masking operation to URL, title and snippet. Entity masking aims to encode name and institute in a unified way. Then researcher’s name fragment, researcher’s institute fragment and numerical fragment appear could be encoded to three fixed numbers. Sequences of URL, title and snippet are padded to fixed length and concatenated to a single sequence \(\varvec{S}\). They were fed into a Bi-directional LSTMs (BiLSTM) [5] model to obtain the semantic information.

3.2 Joint Model in Our Work

In order to integrate hand-crafted features and semantic-embedding feature, we employ a joint neural network and adopt a joint training mode. The joint neural network model, as shown in Fig. 1, aims to improve the identification ability of the model by combining hand-crafted features and semantic information.

The joint neural network model contains two inputs: the first part takes hand-crafted features as input, denoted by \(\varvec{v}_\texttt {hc}\); the second part takes sequence information as input, denoted by \(\varvec{v}_\texttt {seq}\). Let \(\varvec{v}_\texttt {seq}\) input to one layer BiLSTM, the output of BiLSTM layer is a global sentence-level hidden vector \(\varvec{g}\) which detailed in Sect. 3.1. The joint neural network model contains two outputs, namely main output and auxiliary output. At the training stage, these two outputs share the same label. At the testing stage, they will output a probability value range from 0 to 1 represent the score of QR pair. The value of main output is regarded as the final score of QR pair.

The joint training model has two advantages. One is to make the BiLSTM and embedding layer being trained smoothly, even if the joint loss value is very high. The other is to make use of semantic information involved in sequences. For a single QR pair i, the loss value can be calculated by formula Eq. (1) and the batch random gradient descent as Eq. (2), m is the batch size. We set a tunable parameter \(\lambda \) to control the joint loss function Eq. (3). The goal of parameter estimation is to find the optimal \(\theta ^{*}\) to minimize the joint loss function \(\mathcal L_\text {joint}\), y represent the label of the data, \(h_\theta (x)\) represent a series of linear or nonlinear transformations.

$$\begin{aligned} \begin{aligned} \mathcal L(h_\theta (x_i),y_i)=-y_ilog(h_\theta (x))-(1-y_i)log(1-h_\theta (x)) \end{aligned} \end{aligned}$$
(1)
$$\begin{aligned} \begin{aligned} \mathcal L(x, y)=\sum _{i=1}^{m}\mathcal L(h_\theta (x_i),y_i) \end{aligned} \end{aligned}$$
(2)
$$\begin{aligned} \begin{aligned} \mathcal L_\text {joint}(x, y)=\lambda \mathcal L_\text {seq}(x, y)+ (1-\lambda )\mathcal L_\text {hc}(x, y) \end{aligned} \end{aligned}$$
(3)

4 Experiments

4.1 Experimental Setup

We conduct our experiments on a real-world dataset provided by AMinerFootnote 1. It contains 20,445 researchers and 203,019 corresponding retrieval results where each researcher has 8 to 11 retrieval results. The dataset falls into three parts including a training set (6000 researchers, 59675 QR pairs, 5677 homepages), a validation set (2435 researchers, 24187 QR pairs, 2267 homepages) and a test set (12010 researchers, 119157 QR pairs, 11364 homepages). In order to present impact of our approaches, we set the following 6 groups comparison experiments.

(1) Baseline (BL_SVM/BL_RSVM). This experiment uses features from previous work [3, 10], as described in the query-dependent feature part of Sect. 3.1. The SVM [2] and RankSVM [6] models are utilized. (2) Hand-crafted features with SVM (HF_SVM). The SVM model which uses the hand-crafted features proposed in this paper as input. (3) Hand-crafted features with deep neural network (HF_DNN). A deep neural network model which only uses the hand-crafted features to classify the homepages. It could be seen as the baseline of the Joint Neural Network. (4) Semantic-embedding feature with biLSTM (SF_BiLSTM). It’s an auxiliary classifier trained to identify the homepages. (5) Combined features with joint neural network (CF_JNN). This experiment combines hand-crafted features and semantic-embedding feature together and adopts joint training mode, detailed in Sect. 3.2.

To better evaluate these approaches, we set three types of evaluation criterion. Accuracy (only if the identified page equals to the labeled homepage, the page is considered to be correct.) precision recall curves [4] and the mean reciprocal rank [9]. For the network configuration, the parameters of the dropout probability are tuned to 0.25, the layer of deep neural network is set to 3, the semantic embedding vector size is set to 256, the batch size is 300 and the optimizer is Adam [7] with a learning rate of 0.001. We implement neural models based on KerasFootnote 2 and directly use its default parameter initialization strategy. Since academy homepages are only one tenth of retrieval results, the weight proportion of the positive and negative data is set to 9:1 heuristically to overcome the problem of data imbalance.

Table 2. Accuracies (%) of different approaches.
Fig. 2.
figure 2

P-R curves and MRR values of different approaches on test set.

4.2 Experimental Results and Discussion

We first report the accuracy of 6 groups of different experiments in Table 2. Our CF_JNN approach achieves best results. The accuracy is 71.12% in the validation set and 71.04% in the test set. In comparison with baseline approach BL_SVM and BL_RSVM, which accuracy are 60.31% and 60.17%, our approach performs 10–11% better than the baseline. There are two main reasons: one is that our more effective features, which could be proved from the comparison between BL_SVM and HF_SVM; the other is that our joint model and it could be seen from the comparison between HF_DNN and CF_JNN.

From the accuracy of HF_DNN, SF_BiLSTM and CF_JNN, we observe that the joint neural network significantly outperform both SF_BiLSTM and HF_DNN, which means the joint model is effective. According to PR curves in Fig. 2, our improved features and joint neural network are more stable than previous work. From the PR curves of BL_SVM and HF_SVM, although they have similar accuracy, our features have higher F1 values and perform more stable in most cases. The result of the RankSVM model unable to draw the P-R curves, since it’s a comparison between retrieval results of the same author instead of giving a global score. These demonstrate the effectiveness of our approach.

5 Conclusion and Further Work

In this paper, we study the problem of academic homepage identification using retrieval results from the search engine. To fully leverage both structural and content information in retrieval results, we propose a joint neural network model to identify academic homepage using both carefully designed features and semantic embeddings. We conduct experiments on a real-world dataset and the experimental results demonstrate the effectiveness of our approach. Our future directions is to investigate the performance of our approach for identifying the related webpages of other entities, such as institute, medicine and weapon.