Keywords

1 Introduction

Word embeddings or distributed representations of words in a vector space [1] have been shown in many Natural Language Processing (NLP) tasks to be able to capture valuable semantic information. They are used as a replacement of lexical dictionary such as WordNet [2] for semantic meaning expansion, leveraging the linguistic background knowledge. These include but are not limited to document classification [3, 4], named entity recognition [5], and word sense disambiguation [6].

A word embedding framework contains a set of language modeling and feature learning techniques of NLP and it transforms words or phrases into vectors of real numbers. Popular embedding frameworks include Word2Vec [1, 7], GloVe [8] and the very recent FastText [9, 10]. Generating word embeddings may require a significant amount of time and effort by collecting large-scale data, pre-processing the data, machine learning, evaluating the results and tuning hyper-parameters for performance improvement [11, 12]. Therefore, using pre-trained word vectors learned from billions of words is a cost-effective solution with potential better performance due to the massive amount of data used in obtaining these pre-trained embeddings. Although the pre-trained Word2VecFootnote 1, GloVeFootnote 2, and FastTextFootnote 3,Footnote 4 embeddings are readily available for various NLP tasks, they are trained on the general domain only. Training domain-specific terminologies using domain corpus can create embeddings that capture more meaningful semantic relations in the application domain.

The most popular test for demonstrating the effectiveness of distributed representations of words, using real-valued vectors, is the analogy test [13]. The analogy test is often in the form of the relations between the first pair of words is equal to that of the second pair. A famous example to showcase the valid semantic relations captured by word embeddings is:

$$king:man\,{:}{:}\,queen:woman$$

In other words, king is related to man as to queen is related to woman.

The analogy relations can be broadly classified as morphological relations and semantic relations. For general purpose corpus, the semantic relations might include purpose, cause-effect, part-whole, part-part, action-object, synonym/antonym, place, degree, characteristics, sequence, and etc. When it comes to a special domain, such as the geological survey domain we are interested in, domain specific relations such as commodity and its mineralisation environment, locations, and host rock types.

All existing analogy test data are for general purpose only. Assuming word embeddings have been trained sufficiently to capture semantic relations between words, such distributed representations would become an invaluable tool in querying real-world textual data to support knowledge discoveries from the wide variety of reports available. Take the Western Australian Mineral Exploration Reports (WAMEX) data for example, a query of Kalgoorlie-Gold+Iron ore=? may help discover a selected number of words related to the Iron ore in the same way as Gold is related to Kalgoorlie, potentially capturing the location relation to Iron ore.

In this paper, a framework that includes pre-processing, domain dictionary construction, entity extraction, clustering for exploratory study of text reports and finally analogy queries is developed to support domain-specific information retrieval. 33,824 geological survey reports are used to train and obtain geological word embeddings. Our experiment compared these geological domain-specific embeddings against pre-trained word embeddings in answering analogy queries that are of interests to the domain experts.

Our results have been confirmed by geological experts, Wikipedia and even Google Map on the effectiveness of how meaningful these word embeddings are in capturing domain specific information. This warrants more future work in the direction of developing word embedding enabled production of information retrieval systems for domain specific textual data. Our framework is designed in a modular fashion so that it can be readily applicable to other domains, despite that our experimental results are for geological survey reports.

This initial success of the prototype system built using this framework on the geological survey reports left a lot to be desired in applying this to real world document collections. It provides hope in addressing the significant yet challenging problem of learning from the massive amount of “dark corporate data” stored in the less accessible textual format.

2 Literature Review

2.1 Embedding Architectures

Three state-of-the-art neural network architectures for learning are discussed here, namely, Word2Vec, GloVe and FastText. They have been shown to perform well on various NLP tasks and on large scale corpora of billions of words [14].

Word2Vec [1, 7] supplies two predictive model architectures to produce a distributed representation of words: Continuous Bag-of-Words (CBOW) or Continuous Skip-Gram (CSG). The CBOW model predicts the target word from a window of surrounding context. The order of surrounding words does not influence prediction, so the name is bag-of-words. The CSG model uses the target word to predict the context in the window. The CSG model weighs nearby context words higher than distant context words. It is reported that CBOW is faster, but CSG performs better for infrequent words and is more accurate on a large corpus. An optimisation method such as Hierarchical Softmax or Negative Sampling optimizes the computation of the output layer, to speed up the training of a model [15]. Some studies [1, 7] reported that the Hierarchical Softmax works better for infrequent words, while Negative Sampling performs better for frequent words and with low dimensional vectors.

Global Vectors (GloVe) learns by constructing a word co-occurrence matrix that captures the frequency of word appearance in a context. GloVe needs a factorizing matrix to reduce the dimension of that large co-occurrence matrix [8].

While Word2Vec and GloVe treat each word as the smallest unit to train on, FastText uses n-gram characters as the smallest unit [9, 10]. For example, one word vector could be broken down into several vectors to represent multiple n-gram characters. The benefit of using FastText is generating word embeddings for rare or unseen words during training, because the n-gram character vectors can be shared by multiple words.

2.2 Word Relations

Words relate to each other differently. Gladkova et al. [16] introduced the Bigger Analogy Test Set, which explained the types of word relations: inflectional and derivational morphology, and lexicographic and encyclopedic semantics.

Morphological relations can be inflectional or derivational. Inflectional morphology is the modification of a word to express different grammatical categories including numbers, tenses and comparatives. For example, rock:rocks, occur:occurred, and hard:harder. Derivational morphology is a word formation by changing syntactic category or by adding new meaning to a word. For example, able:unable, produce:reproduce, and employ:employment,employer,employee,employable.

Semantic relations can be lexicographic and encyclopedic. Lexicographic relations includes hypernymy (superordinate relation), hyponymy (subordinate relation), meronymy (a part-of relation), synonymy (same meaning) and antonymy (opposite meanings). For example, apple:fruit, color:blue, member:team, talk:speak, and up:down. Encyclopedic semantics define closely related words without considering grammatical changes. For example, Australia:English (country:language), sky:blue (thing:color) and dog:puppy (adult:young).

2.3 Algorithms for Solving Analogy Test

Analogy tests are performed on morphological or semantic similarities of words including countries and their capital cities, countries and their currencies, modification of words to express different grammatical categories such as tenses, opposites, comparatives, superlatives, plurals, and gender inflections. For example, France:Paris::Italy:Rome; go:gone::do:done; cars:car::tables:table; wife:woman::husband:man and better:good::larger:large.

Performing analogy tests depends on an embedding method, its parameters, specific word relations [16], and a method of solving analogies [17]. Analogies solved by one method may not be solved by another method on the same embedding. Therefore, generating embeddings are useful for exploring rather than evaluating the underlying dataset. Pair-based methods such as 3CosAdd [13] and 3CosMul [18] perform analogical reasoning based on the offset of word vectors. 3CosAdd performs a linear sum to normalize and ignore the lengths or embedding vectors, unlike the Euclidean distance. 3CosMul method amplifies the differences between small quantities and reduces the differences between larger ones by using the vector multiplication instead of the addition. Set-based methods [17] include 3CosAvg and LRCos. 3CosAvg works on vector offset averaged over multiple pairs. LRCos incorporates a supervised learning of the target class and the cosine similarity. Pair–Pattern matrix method called Latent Relational Analysis (LRA) [19] takes word pairs, and constructs a matrix to find the relational similarity between word pairs, by deriving patterns automatically from a large corpus with synonyms. Dual-Space method depends on direction and ignores spatial distance between word vectors [20].

3 Methodology

3.1 Architectural Overview

An overview of our study is presented in Fig. 1. The system consists of a pre-processing module, a dictionary construction module which supports an annotation and filtering module for entity extraction, a word embedding training module to learn the embeddings from other pre-processed text or extracted entities, a similarity module that implements multiple similarity tests, and an analogy test module with various solvers.

Fig. 1.
figure 1

An overview

Pre-processing module includes cleaning, tokenization and lemmatization processes using Natural Language Toolkit (NLTK) [21]. The geological corpus data are cleaned by removing the stop words, numbers and delimiter characters. The stop word list includes 353 common words such as the, a, an, and is. After cleaning, we tokenise the words by breaking up sentences into lists of words. The remaining words are then lemmatised to remove inflectional endings and reduced to their base form, using NLTK WordNetLemmatizer.

Annotation. To target domain specific entities and avoid noise, a dictionary-based named entity extraction method is implemented. Only the words of interest from the geological corpus are collected for learning embeddings. A geological vocabulary of 5623 terms is created, which contains minerals, commodity names, geological eras, rocks, stratigraphic units, and mineralisation styles, as well as geographical information, such as location names, mines, tectonic setting names and regions. Figure 2 shows some terms in the vocabulary. The sources of these geological terminologies are WikipediaFootnote 5, Geographical Locations of Western Australia (WA)Footnote 6 and WA Stratigraphic Units DatabaseFootnote 7. The data are annotated using this domain vocabulary. Once annotated, the textual data are transformed into the format shown in Fig. 3.

Fig. 2.
figure 2

Domain vocabulary

Fig. 3.
figure 3

Raw text to annotated text

Fig. 4.
figure 4

WAMEX entities for training

Filtering. During the filtering process, all entities (words and phrases) are extracted based on the annotation using our geological dictionary. All documents are filtered and only dictionary terms are kept for the embedding learning process. The whole document contains only dictionary-specific terminologies and is then used for embedding training. Figure 4 shows two example sentences appear that in the filtered text; entities from those two sentences are underlined.

Embedding. Embedding models use two types of context: linear context refers to the positional neighbours of the target word, while dependency based context uses syntactic neighbours of the target word based on a dependency parse tree using part-of-speech labeling. Word representations can be bound or unbound. Bound context representation considers sequential positions of context with the target word. Unbound context representation treats all words within the chosen context window as the same, irrespective of their positions from the target word. Linear context is sufficient for comparing topical similarity compared to dependency based context, according to Li et al. [14], who compared Word2Vec GSG, Word2Vec GBOW and GloVe models. They stated that word analogies are most effective with unbound representation. Therefore, for this study we choose Word2Vec models of unbound representations with linear context.

3.2 Data Clustering and Visualisation Using t-SNE

t-distributed Stochastic Neighbour Embedding (t-SNE) [22] is a popular dimensionality reduction technique that projects high dimensional vectors onto a low dimensional plane, while preserving the distances and similarities of the data as much as possible. We use t-SNE to visualise semantic closeness of words in the various embeddings obtained in this paper. Two main similarity measures for text clustering are Cosine similarity and Euclidean distance.

Cosine similarity of two given n-dimensional vectors \(\varvec{A} = (a_1, a_2,\ldots , a_n)\) and \(\varvec{B} = (b_1, b_2,\ldots , b_n)\) is calculated as the cosine of the angle between them, where the vectors represent a pair of words, phrases, sentences, documents or corpora. When we have two documents with similar contents, but one is several times bigger in size, cosine similarity defines how similar they are to each other in terms of the context, not of the size. The similarity is measured in the range of 0 to 1 in cosine distance, where 0 means the most different, near 1 means the highly similar and 1 is for identical. Cosine similarity is defined as:

$$\begin{aligned} \cos (\varvec{A},\varvec{B}) = \frac{\varvec{A} \varvec{B}}{||\varvec{A}|| ||\varvec{B}||} = \frac{\sum _{i=1}^{n} a_i \cdot b_i}{\sqrt{\sum _{i=1}^{n} a_i^2} \sqrt{\sum _{i=1}^{n} b_i^2}} \end{aligned}$$
(1)

Euclidean distance is defined by any non-negative value. Euclidean distance of two n-dimensional vectors \(\varvec{A} = (a_1, a_2,\ldots , a_n)\) and \(\varvec{B} = (b_1, b_2,\ldots , b_n)\) is defined as:

$$\begin{aligned} distance(\varvec{A},\varvec{B}) = \sqrt{\sum _{i=1}^{n} (a_i - b_i)^2} \end{aligned}$$
(2)

The cosine similarity deals with relative difference between words, instead of absolute frequency difference. For example, the vector A = (2, 3) has the highest similarity with B = (4, 6), because they have the same angle, although the latter vector is longer, while Euclidean distance between them is 3.6 units. We prefer the cosine similarity for relative difference between words, thus vector length is ignored. t-SNE displays words closer in the visualisation if their high-dimensional vectors are similar, distant if dissimilar.

3.3 Analogy Investigation

A proportional analogy holds between two word pairs: A:B::C:D, which means A is to B as C is to D. Mikolov et al. [13] first reported that word embeddings capture relational similarities and word analogies. Analogy tasks answer the questions such as what is the word X that is similar to woman in the same way that King is similar to man? The answer is expected to be Queen, if the model is trained well: \(King + (woman - man) = Queen\)

An analogy query is answered by performing algebraic operations over the word vectors to find the angular distance for the query. For example, in the embedding space, cosine similarity can be used to find a word X from woman in the same distance and angle as King from man using vector analysis. The embedding vectors are all normalized to unit norm. X is the continuous space representation of the word for the answer. If no word is found in that exact position, the word vector with the greatest cosine similarity to X is the answer.

Analogical reasoning method 3CosAdd [13], which is based on the offset of word vectors, is used in this study, to find the answer to an analogy query and to show how semantically meaningful terms are related. To maintain consistency, we use this cosine similarity measure for all tests on our embeddings.

4 Results

The WAMEX dataset contains unannotated geological text reports obtained from Geological Survey of Western Australia (GSWA)Footnote 8. The dataset contains 33,824 geological reports with 42.6 million tokens, while after filtering using our domain dictionary, the WAMEX terms dataset is fifteen times smaller than the WAMEX dataset and only contains the words that are valid mineralisation system terms. The number of tokens is reduced to 2.8 million.

Six sets of embeddings as shown in Table 1 are prepared for this research. Two geological embeddings are trained on the pre-processed WAMEX dataset using Word2Vec and FastText models, respectively. The Word2Vec embedding is named Word2Vec raw embedding and the FastText embedding is named FastText raw embedding. Another two geological embeddings are trained by Word2Vec and FastText models on the WAMEX terms dataset which only contain terms representing geological entities that of interests to the mineralisation process. The so learned embeddings are named Word2Vec terms embedding and FastText terms embedding respectively. These four embeddings are created with the CSG model using the GenSim packageFootnote 9. The following hyper-parameters are used for the training: dimensionality of vectors is set to 100, window size is set to 5 (window of neighbouring five words), negative sampling size is set to 5, and minimum count for frequency of words is set to 300. In addition, two pre-trained embeddings are downloaded: Word2Vec pre-trained embedding (see footnote 1) and FastText pre-trained embedding (see footnote 3). They are pre-trained on Google news of 100 billions words and web crawl of 600 billion words, respectively. The vocabularies include general knowledge, including geological terminologies.

Table 1. Embeddings
Fig. 5.
figure 5

Clusters related to geological eras and types of rocks

4.1 t-SNE Clustering and Visualisation

We visualise Word2Vec Terms Embedding, which is created by Word2Vec model on dictionary-based terms, in order to explore geological terms and their relations in WAMEX dataset. We use t-SNE for visualisation and validation of the trained vectors in a 2D vector space. About 840 unique geological entities are visualised. Figure 5 shows clusters of entities related to geological eras and types of rocks. The geological eras in our visualisation are mentioned in the Wikipedia page about the Geologic time scaleFootnote 10. The rocks appear in Fig. 5 are mentioned in the Wikipedia page about Sedimentary RocksFootnote 11.

This result shows the groupings of different types of geological information, i.e. entity groups such as geological eras, rock types are effectively clustered using t-SNE. Similar meaningful results are also obtained for entities related to iron ore. The relevance are confirmed by Wikipedia. Interestingly, the location name distance on t-SNE have great resemblance to their locations on GoogleMap.

4.2 Similarity Query

To compare and understand what types of semantic information are captured through the representational learning, extensive similarity queries are conducted on all six sets of word embeddings. Table 2 shows the top ten similar tokens given two commodities as query inputs, one for gold and one for iron ore.

Table 2. Similarity test

The query results demonstrate that the geological domain specific embeddings provide more relevant and useful information than the pre-trained Google or FastText vectors, despite the much smaller training corpus. In particular, dictionary-based entity embeddings retrieved more relevant terms than other embeddings. The largest endowment of gold in Western Australia is located in the Kalgoorlie gold camp. Whereas, iron ore, the Hamersley Province in the Pilbara region of Western Australia contains world class iron ore deposits. Banded iron-formations are a dominant host to the iron ores. The similarity query results contain relations between minerals and their associated locations, host rock types and geological eras. These provide more critical information for mineral explorers operating in WA using our domain/region (WA) specific word vectors than the pre-trained vectors by Google and Facebook.

To validate our Geological entities’ corpus further, let’s take a look at another example on the geological entity ashburton formation using Word2vec Terms embedding. The query for most similar entities for ashburton formation returned top five most similar entities as follows: wyloo group, mount minnie group, mount mcgrath formation, june hill volcanics, and capricorn group. This result is checked using the Explanatory Notes System (ENS) in Fig. 6, available from the GSWA, which stores relevant geological descriptions (e.g. formal names, rock compositions and age) and interpretations between major rock groups in Western Australia. These explanatory notes include the stratigraphic unit description of the Ashburton Formation, which contains collected field observations. The use of this independent data source from the training corpus, provides an unbiased assessment of the embedding analysis results. Stratigraphic information is important for mineral explorers as most mineral deposits are controlled by structures (e.g. geological faults) and/or stratigraphic relationships (e.g. banded iron-formations that host iron ores).

These findings are further confirmed by the domain experts that domain-specific embeddings learned from geology related data provide more targeted knowledge for geological applications.

Fig. 6.
figure 6

A GSWA explanatory note extract for Ashburton Formation

4.3 Analogy Test

As shown in the similarity query results in Table 2, more morphological relations are present in the Pre-trained embeddings, for example, iron ore and Iron_Ore, whereas more semantic relations are captured by the purposely trained embeddings. For example, we can see commodity and geochemical name (e.g. gold:au), commodity and geological era (e.g. gold:archaean). The domain experts confirmed that the most relations that are critical to the understanding of mineralisation systems are present in the embeddings learnt from the WAMEX Terms dataset.

Take the most intuitive Commodity:Location relation as example, we perform an analogy test, with results shown in Table 3. Our geological terms vectors from WAMEX dataset reflect detailed information such as town names for the iron ore in WA, while Google pre-trained vectors represent the general knowledge. For example, the query Kalgoorlie + (iron ore - gold) should return terms related to iron ore in the same way as Kalgoorlie relates to gold.

The result from Google news vectors return the location names of Pilbara, Port_Hedland and Karratha, which are closely associated with commodity-related locations. Our vectors trained on WAMEX reports return hematite - the most important ore of iron, martite - a type of iron ore, marandoo - the Marandoo iron ore mine in the Pilbara region, iron, west_angelas - the West Angelas iron ore mine in the Pilbara region, windarling - Windarling Iron Ore Mine, mount_newman_member - Australian stratigraphic unit in Hamersley Basin.

The pre-trained Google vectors return the same types of entities, while our data shows all highly related entities, but with different types. For example, when we query associated terms for a location, Google vectors return results that are all location names, while our results return related entities of locations and minerals. More data improves these results and helps to return only entities of that type. So if the question word was a location or a commodity, more data during the training helps to return only locations or commodity types.

Table 3. Relation Commodity:Location. Query:

5 Conclusion

In this paper, we investigated how representational learning of words can affect the entity retrieval results from a large domain corpus. Extensive similarity tests and analogy queries have been performed, which demonstrated the necessity of training domain-specific word embeddings. Pre-trained embeddings are good at capturing morphological relations, but are inadequate for domain specific semantic relations. This seemed to only confirmed the obvious, but we also demonstrated that the importance of entity extraction. A dictionary based entity extraction filter is used to create the entity-only datasets, with the sentence structure completely removed. The embeddings trained over the large sequence of entities using Word2Vec and FastText provide meaningful domain-specific semantic relations better than the embeddings based on the raw data. All results are confirmed by multiple sources, such as Wikipedia, relevant external datasets (e.g. GSWA Exploratory Notes), and more importantly, domain experts.

The success of this initial investigation confirmed the feasibility of using vector representations of words for concept or entity retrieval. Other types of embeddings such as those generated from non-linear context should also be investigated. Different analogy solvers such as knowledge graph models are also currently under investigation.