skip to main content
research-article
Open Access

What Is in a <unittitle>? Cross-lingual Topic Detection & Information Retrieval in Archives Portal Europe

Authors Info & Claims
Published:26 March 2024Publication History

Skip Abstract Section

Abstract

Archives Portal Europe (APE, www.archivesportaleurope.net) is the portal of European archives, an aggregator that connects on a single research point the catalogues and digitised archival material of all archives in and about Europe. It currently hosts material from more than 30 countries and from a variety of archival institutions (such as State archives, city archives, university and parish archives, private institutions, and more). It is maintained by the Archives Portal Europe Foundation, an international consortium of State archives and other archival institutions that aim to connect the archival material of single institutions into one digital repository to allow universal access to the archival heritage of Europe, promoting new forms of archival research beyond national or local boundaries. One of the research tools made available by Archives Portal Europe is by topics; however, these are currently maintained manually by the archivists, and the vast amount of archival material ingested in the portal makes it impossible to have a comprehensive body of topics that describe the whole of the APE repository. Archives are traditionally not organised by their subject content, but around the entity (person, organization, body) that created and/or collected the documents in the course of their activities. While this is an undisputed pillar of archival management, the availability of online digital repositories for archival research requires new tools for digital archival research, particularly when different archival traditions from different countries and different types of institutions are merged into a unique research portal. Topic detection becomes a fundamental tool to guide archival research and to allow archives to be accessible to potentially world-wide users in a situation where national and linguistics barriers blur or are re-defined. This article presents the preliminary results and plan for future iterations of an AI tool for automated topic detection in a multi- lingual environment, where human-created taxonomies act as bases for the algorithms to aggregate relevant material around a specific topic. The development is based on supervised machine learning, with a combination of human inputs in different languages, and of the usage of Wikipedia pages to model the relevant vocabulary and entities.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Archives are traditionally organised not by their subject, but around the entity (person, organisation, body) that created and/or collected the documents composing the archive in the course of their activities. Finding aids in archives start by describing the records creator and later go on describing the content of the collection's components, usually down to the single files and documents.1 In their classic work “Manual for the arrangement and description of archives,” the first manual on the discipline, Muller et al. stated that “the archive arises as a consequence of the activities of the person who formed it [as] the documents can only be understood from the point of view of the task involved” [37]. More recently, Cook [8], Duranti [11], and Duranti and Franks [13] considered the principle of provenance equal to the creation of a record and additionally emphasised the arrangement of the records (according to the “original order”) as an essential part of archival practice.

Consequently, the starting point of archival research is traditionally based around the records creator; searching documents on Napoleon will most likely start with a trip to the “Archives Nationales” in Paris, which holds the documents produced by the French state. While provenance remains an undisputed pillar of archival management,2 the development of online catalogues has caused historical archives operating in a digital environment to require new tools for archival research.3 In particular, searching by keywords, arguably the most intuitive and most often used feature of online search, has made the challenge of searching by subject, rather than simply by records creator, one of the most vivid in the development of online archival catalogues—after all, searching by keywords very often means searching with specific content in mind.4

Already in 1980, Richard H. Lytle [28] and Lytle and Dürr [29] posited that archives in the digital environment would need both a provenance method retrieval and what Lytle defined as a “Context Indexing Method” based on subject. In this scenario, topic association becomes a fundamental feature of online archival research; and automated topic detection an essential tool for archivists and researchers working in an online environment. This article presents the results of a proof-of-concept project for an automated topic detection tool developed specifically for Archives Portal Europe (APE), www.archivesportaleurope.net, the portal on archives from and about Europe. APE is an aggregator that connects on a single research point the catalogues and digitised archival material from institutions in more than 30 countries and from a variety of organisations (such as State archives, city archives, university and parish archives, private institutions, etc). Because of the multi-institution, multi-country, and multilingual characteristics of the portal, APE is the ideal recipient of an automated topic detection tool, which would both allow users to browse the vast amount of material in a much easier way and the portal's curators to organise multi-institution and multi-country collections in a more consistent way. The highly challenging scenario presented by APE is also a benchmark for assessing the current performance of Natural Language Processing (NLP) and Information Retrieval (IR) tools developed for cross-language tasks in a realistic end-user application. With the partial exception of the Europeana 1914/18 Data Science project [17], where automated topic modelling was applied to a dataset specifically created around the Europeana collections related to World War I, automated topic detection methods have not been applied yet to historical archival catalogues, particularly in a multi-institution and multilingual environment.5

Through a proof-of-concept project conducted by the Archives Portal Europe Foundation and supported by King's College London, we developed an approach for automated topic detection in a multilingual environment, where human-created archival descriptions and associated topic labels act as training data for an algorithm that discovers new relevant material yet to be topically annotated. The information retrieval prototype developed as part of this project goes beyond the usual keyword searches by establishing semantic connections between terms that form part of the archival descriptions rather than relying on the user to know the exact phrases used by the records creators and/or archivists. By this, the project picks up on new approaches that have been on the rise specifically during the last decade, such as linked data [3], knowledge graphs [18], Name Entity Recognition (NER) [20], or machine learning [41]. Furthermore, it does not rely on index terms from controlled vocabularies to be part of the archival descriptions to identify relevant subjects, and it allows the identification of topically relevant materials across different languages without employing a machine translation pipeline.

The development of such functions was based on supervised machine learning and cross-lingual word embeddings; domain experts contributed to the information retrieval prototype by developing topic taxonomies in different languages, which, integrated with the usage of Wikipedia pages, allowed the retrieval of concepts and entities in materials’ descriptions across languages. While historical documents have been at the centre of many NLP-based projects, topic detection projects working on archival catalogues (i.e., on metadata), rather than on the actual content of the documents, have not yet been at the centre of any research project, particularly in a highly multilingual scenario such as the one described.

Following a general overview of NLP in Digital Humanities and a description of how Archives Portal Europe currently uses and produces topics, this article focuses on outlining the methodology that has been applied for this proof-of-concept. It also presents an initial analysis of the results, which will act as the starting point for the next steps in bringing the proof-of-concept to pilot, alpha, and beta phases, with the intent to provide an open-source version of the tool during the upcoming years.

Skip 2NATURAL LANGUAGE PROCESSING AND ARCHIVAL RESEARCH Section

2 NATURAL LANGUAGE PROCESSING AND ARCHIVAL RESEARCH

Automatic topic-driven methods for IR, and for archival research in particular, have been studied and applied since at least the 1980s.6 During the past decade in particular, there has been a great focus in Digital Humanities (DH) on the adoption of NLP methods for identifying topics in textual corpora. Such approaches could be divided into two main groups: supervised and unsupervised. Supervised approaches detect topics by relying upon (a) a predefined and (supposedly) comprehensive list of topics and (b) a (potentially large and representative) set of materials manually annotated with the relevant topic. To train an algorithm employing, for instance, a Support Vector Machine (SVM) [21] or a Convolutional Neural Network [23], each document is represented as a single feature-vector, which should capture the “meaning” of its content. To do so, researchers often rely on the use of pre-trained word embeddings [25]; these are vector representations of words, initially obtained from large corpora, that, combined together, can model longer sequences [25], such as the description of materials in APE. In recent years, research has shown that such pre-trained word embeddings can be aligned across different languages by generating a common “semantic” space [7, 15]. Having each description represented with an embedding and associated with a topic-label, the algorithm will then learn the relation between the embedding and the associated topic. While supervised approaches generally offer reliable performance for topic detection tasks (see, for instance, the experiments conducted by Merz et al. [33] and Glavas et al [15] on the Manifesto Corpus) to learn such vector-label relation, they need large amounts of so-called training examples, which are often difficult to collect, because they have to be sets of consistently conducted human annotations.

For this reason, unsupervised approaches for identifying topics in texts, and Latent Dirichlet Allocation (LDA) topic modelling in particular [7], have become highly popular in DH during the past 10 years [32, 39]. These approaches do not require any training data or topic labels in advance and leverage upon similarities in the materials under study (for instance, in LDA, how words co-occur with each other) to identify underlying patterns in the data, which could correspond to topics. Unsupervised approaches are simple to use and very useful for initial corpus exploration; however, it is very hard to employ them for topic detection especially in cases in which the user already knows which are the topics contained in the collection [40]. This is because results of LDA topic models are often extremely hard to interpret [6], and it is not straightforward to align them with our common notion of topics.

In our research setting, we have a relatively large amount of materials, approximately two million documents, at our hands, that are already annotated with topics across different languages (see Table A). Even though this only represents a small part of Archives Portal Europe's dataset and bears the challenge that topics might not have been assigned consistently in all cases (see Section 3), we decided to make use of this and opted for a supervised approach for topic detection as a first step. Nevertheless, it is important to keep in mind that the annotations in the test dataset will most likely not cover all topics in our collection, and that our sample of annotated materials might not entirely represent the distribution of topics present in the entire APE collection. Moreover, it is important to take into consideration that employing word embeddings pre-trained on a collection (in our case, Wikipedia) will necessarily embed social, gender and ethnic biases that exist in society (and consequently in the corpus), which might have consequences in the modelling and produced output [5]. For these reasons, readers should consider our experiments only as a first attempt towards a very challenging goal, and we plan to work with an interaction of supervised and unsupervised methods in future experiments, to tackle these challenges in a more comprehensive way.

Skip 3AN ARCHIVAL CATALOGUE OF ARCHIVAL CATALOGUES: SEARCHING IN ARCHIVES PORTAL EUROPE Section

3 AN ARCHIVAL CATALOGUE OF ARCHIVAL CATALOGUES: SEARCHING IN ARCHIVES PORTAL EUROPE

Archives Portal Europe went first online in 2012, with 14 million descriptive units from national and institutional archives; it then expanded to include many different types of private and public archival institutions. As of October 2020, the portal holds 282 million descriptive units, from over a thousand institutions in more than 30 countries that actively provide content.7 The portal is maintained by the Archives Portal Europe Foundation, an international consortium of State archives and other archival institutions that aim to connect the archival material of single institutions into one digital catalogue, allowing researchers to access a myriad of archival institutions across Europe through a single research portal. The portal currently operates in 24 languages (and five different alphabets) and holds descriptions of more than 590,000 archival fonds and collections, making it the largest aggregator of its kind in the world. The main objective of APE is to provide a single research point when looking for this strongly varied archival material, and to display finding aids describing individual documents or collections along with contextual information. The scope of the portal is to gather archival material from Europe or about Europe, without limitations on the nationality or type of institution holding the archival material.

Approaching an online portal of historical archives means being at the crossroad of the old creator-based way to search an archive, and the new google-like information retrieval way, based on keywords. To retain all the characteristics of traditional archival research while simultaneously making use of the new forms of information retrieval enabled by digital technologies, Archives Portal Europe allows for three forms of research. First, as in any physical institution, it is possible to search by the holding guides (overviews of the collections and fonds) and/or a list of finding aids (structured descriptions of archival materials per collection or fonds, often up to item level) of any determined institution. However, the added value to digital search relies in the second form, keyword search, where it is possible to scrape the descriptive metadata of the collections of every institution present in the portal, at the same time. These descriptions are in all the languages represented in the portal, and will be returned as search results provided they match the spelling of the keywords used; to maximise the results and help researchers navigating multilingualism, Boolean operators and wildcards are in place to make searching across languages as comprehensive as possible.8

The third and most important tool of research for the scope of this article are the topics. Assigning an archival collection to a specific topic allows to go beyond simple keyword search, to make larger semantic associations on a specific subject, expanding a query not only across different languages, but across different interpretations of a specific research. Topics group together archival collections, from different institutions and records creators, which relate to the same argument. It is possible to start with specific content already aggregated from different archival institutions and different countries, without the need to use specific keywords that may not be included in document descriptions that actually refer to a specific topic—for example, the word “slavery” (in any language in which it is used) may not be contained in the descriptions of collections that are highly relevant to the subject; furthermore, even with Boolean operators and wildcards in place, it may be extremely lengthy and repetitive to search keywords related to slavery in all languages represented in the portal.

By being the single entry-point to the European archival heritage, the portal enables new types of digital archival research, freed from geographical limitations and based on cross-country comparison and multilingualism. The portal entirely respects the principles of provenance and original order, and it allows searching by records creator, through high-level holding guides and through the finding aids of each institution that provides content to the portal [38, 27].

However, the added value of Archives Portal Europe as a new technology for archival research in a digital environment lies in the possibility of searching across multiple archives from different countries and cultures in a comparative perspective. For this reason, APE strongly promotes searching by topic to scrape the portal in a horizontal way, making new connections between archives and the subject of a specific search, ultimately allowing a researcher to find what s/he did not know existed on a specific subject, in archives that were not originally under consideration.

Skip 4TOPIC ASSOCIATION IN ARCHIVES PORTAL EUROPE Section

4 TOPIC ASSOCIATION IN ARCHIVES PORTAL EUROPE

While the benefits of topic association are clear for researchers, the implementation of this metadata point in the repository of Archives Portal Europe carries several important obstacles and challenges, which are embedded in the general organisation of the portal. The ingestion and maintenance of the data in Archives Portal Europe is organised through a decentralised approach, in which each single institution providing material to the portal has access to a back-end dashboard to autonomously ingest the material; in the majority of cases, however, a national-level aggregator (e.g., national archives or national-level portals such as Archives Hub in the UK)9 take care of the ingestion of multiple institutions in their countries directly. At the moment, topics are assigned manually by the institutions holding the archival materials or by the national-level aggregator on their behalf; they do so according to their local or national classification systems, their own ways of organising archival material, and their own sensitivity.

There are currently two ways in which topics can be assigned: via the association to source guides or based on the subject headings of finding aids. A source guide (or thematic guide) is a special type of holdings guide based on subject rather than on holding institution: In the portal's dashboard, the single institution (or national-level aggregator) can gather finding aids that refer to the same subject into a source guide. When assigning a source guide to one of the predefined topics in Archives Portal Europe, all components of all the finding aids linked in the source guide are also assigned to this topic. While this is an easy “catch-all” approach, it bears the potential risk of some materials being tagged with a topic that does not specifically apply to these documents per se, but rather to the collection that they are a part of. Furthermore, the source guide approach in itself does not connect the predefined topic terms with related terms that might be present in the archival descriptions.

The latter aspect is addressed by the second way to assign topics, through subject headings. Here, the predefined list of topic terms is put in relation to subject headings that are already part of the descriptive metadata of a collection. This, e.g., allows to connect the general topic term “Arts” with more specific subject headings such as “Paintings,” “Sculpture,” “Drawings,” and so on, as they appear in the archival descriptions. These relationships between topic terms and subject headings can be established on an institutional as well as on a national level, and they enable content providers to connect to national vocabularies and ontologies as well as to institutional rules and guidelines for creating subject headings. With these relationships set up once, every component that includes one of the mapped terms in its subject headings will then be assigned to the central topic as defined in APE (see Figure 1) [2].

Fig. 1.

Fig. 1. Topic assignment workflow.

Topics are currently far from being comprehensive with what is available on Archives Portal Europe on any given subject. On the contrary, several topics are clearly not representative of the whole repository, with only a handful of documents tagged on subjects that are actually strongly present in European history (e.g., the topic “Napoléon I, Emperor of the French” is currently only associated with six descriptive units). This depends on three main factors.

First, while institutions and national aggregators assign topics independently, they can only choose from a predefined list, established by the Archives Portal Europe Foundation at a central level. At the moment, the topic list in use was created by the Archives Portal Europe staff and network, closely following the UK Archival Thesaurus (UKAT), the subject thesaurus created for the archive sector in the United Kingdom [43]. While the UKAT is based on the general UNESCO thesaurus structure [44], it is still rooted in the archival tradition of a specific country, and it does not provide a one-size-fits-all solution.10

Second, the semantic relations between documents and topics are still at the discretion of the single archivist conducting the association, and decisions are often made on the basis of pre-existing archival traditions that may strongly vary from country to country.

Third, there are institutions that specifically decided against including subject headings in the finding aids for reasons of resources in keeping up with the task of archival description in general. This impedes the linking of finding aids to topics through semantic affinities.

The combination of these aspects leads to the current state of play, where only a handful of countries has had their metadata available in a way that would have made it easy to connect to the topics at the time of ingestion in Archives Portal Europe and where most archival institutions do not have the means to conduct topic association manually. At the moment, Archives Portal Europe features 62 topics that vary from very general ones, such as “Economics” or “Education,” to specific entities, such as “German Democratic Republic” or “Napoleon I.” The size of topic association varies from as little as 5 to as many as 400k descriptive units being associated to a specific topic; however, this is not reflective of the specificity level of a topic: Something as general as “Statistics” includes only 5 associated records; a specific topic such as “German Democratic Republic” includes more than 100k records. France is the only country that has participated significantly in topic association, contributing to 60 existing topics, strongly shaping the outlook of the existing topics architecture. In much smaller measures, Germany has contributed to 6 topics, Poland to 2, and Finland and Latvia to 1 topic each. In total, less than 2M documents have been assigned to 1 or more topics, over a total repository of 280M: a mere 0.69% of the data available in Archives Portal Europe (see Table A). Manual topic association, whether carried out through crowdsourcing or by professional archivists, has clear limits to what can be achieved in the presence of such a vast and varied repository. Automated topic detection, however, could open the doors to a complete redesign of the research approach, and new accessibility, of the portal.

Skip 5TOPIC DETECTION: BUILDING A NEW TOOL Section

5 TOPIC DETECTION: BUILDING A NEW TOOL

The first prototype of an automated topic detection tool was developed on a test sample out of the whole dataset of Archives Portal Europe, consisting of a pool of 9 topics and a total of 457,000 descriptive units (see Table B for a full list). The topics were selected according to the following criteria:

Multilingualism: As the vast majority of documents currently tagged with a topic have been contributed by institutions either from France or from Germany, the first challenge was to find a balance between having languages with a big enough representation in the test data to learn from and finding topics that included documents from more than one country, and hence in more than one language, to address the multilingualism that is central in the context of Archives Portal Europe. In most cases, this meant using topics that included documents in French and German, but this criterion also led to including languages that are less tackled in NLP, such as Finnish, Latvian, and Polish.11

Representation: The pool of selected topics tried to represent the variety of topics in Archives Portal Europe. Topics varied in size and scope, from the very specific (“Napoleon I”; “German Democratic Republic”) to the very generic (“Economics”); from referring to a specific type of record (“Maps”), to a historical practice (“Genealogy”), to a certain group of records creators (“Notaries”); from being barely sketched (6 records on “Napoleon I,” 765 for “Slavery”) to being already a useful tool for research (40,000 on “Genealogy,” from three different countries).

Entities & Concepts: To address two of the main search approaches in archival research, persons/places and subjects/themes, the selected topics also aimed at including entity-based topics such as “Napoleon I” or “German Democratic Republic”; concept-based topics such as “Economics” or “Slavery”; and mixed topics such as “First World War,” which contains both entities and concepts.

Using this sample, we developed a prototype to highlight the potential and the challenges of cross-lingual topic detection, as well as a new approach to information retrieval based on this. The tool has four main components: a supervised topic detection algorithm that works across languages; a topic taxonomy of relevant words; two information retrieval functions that allow users to browse for information across languages on a given concept or entity.

5.1 Cross-lingual Topic Classification

The first step addressed during the development of the prototype was the training of a supervised classifier for topic detection on the collection, considering as “documents” the descriptive units within a finding aid.12 The developed tool could be used both to discover entries that might be relevant to a specific topic beyond the label they are currently associated with (for instance, a document under “First World War” but also relevant for the topic “Economics”) and for enriching materials not already manually labelled. In recent years, the NLP community has intensively focused on distributional semantics and word-embedding methods to move beyond word-frequency approaches to determine text similarity [34]. These methods have become increasingly useful also in cross-lingual scenarios [15], as they allow to capture underlying topic similarities without requiring complex machine translation systems. In our experiments, we have employed Fast-Text word-embeddings from all languages present in our dataset13 and aligned in a common cross-lingual “semantic” space by the project MUSE [7]. We represented the description of each descriptive unit in the collection as the averaged vector of all its words, therefore obtaining a single “document embedding” for each description. Then, we trained a supervised topic classifier (a multi-class Support Vector Machine) in a 10-fold cross-validation setting, using these document embeddings as feature-vectors, similar to what has been done already by Glavas et al. [15]. This approach achieved really high performance, identifying the correct topical label for the materials in over 90% of the cases.14 We additionally ensured that the classifier was correctly distinguishing between topics, and not languages, by conducting an in-depth error analysis.

These results highlight the potential of contemporary NLP methods in a complex multilingual scenario in a cultural heritage context and provide a real-world testbed for these computational approaches, outside of the established controlled evaluation scenarios where they are usually examined. For the system deployed at the end of this proof-of-concept phase, we have used an SVM trained with cross-lingual document embedding representing the entirety of the training set. Any newly acquired document or user query would then be represented as well with a cross-lingual document embedding and be classified by the VM in one of the possible topics.

5.2 Topic Taxonomies Generation

In parallel, to generate a first list of potentially relevant entities and concepts, we processed each document using the Spacy Python library. We extracted all detected entities (people, locations, organisations, etc.) through a Named Entity Recogniser (NER) and searched for the most relevant corresponding entry in Wikipedia (i.e., the entity in Wikipedia that is most frequently mentioned with such name).15 Then, we provided subject-experts with two lists of potentially relevant topical entities: one list of overall most frequent entities for each topic and a second list of most distinctive entities (i.e., entities that are frequent in a topic but not very frequent in other—this was computed using an entity adaptation of Term Frequency–Inverse Document Frequency (TF-IDF), as in Lauscher et al. [24]). Using these lists as starting points, domain experts curated a series of topic taxonomies by adding additional keywords that they thought were relevant to search for a specific topic in the repository, either because frequently used in relation to a topic (e.g., “efficiency” in “Economics”) or because they represented entities that were clearly related to a subject topic (e.g., “Stasi” for “German Democratic Republic”). Domain experts included historians and archivists, thus combining the sensitivity around a topic from both the point of view of researchers and of records managers; they were also selected among different countries, thus expanding the pool of languages present in the taxonomies by Maltese, Italian, English, Norwegian, Spanish, and Portuguese.

5.3 Concept Search

Relying on both the cross-lingual classifier and the generated taxonomies, we then developed a function for browsing the collection for concepts across languages; to further test the functionalities, we added English and Italian as supported languages for user queries. Given a user query in one of the supported languages, for instance, “traité” in French, this would be represented as a cross-lingual “document” embedding, similar to the approach employed above. Then, using the trained topic classifier, we would be able to detect the potentially relevant topics for the query (e.g., “First World War” and “Notaries”). Next, we would rank all materials in the collection by measuring the “semantic” similarity of their vectors with the vector of the query (we used cosine similarity, a common practice in information retrieval, and have employed the Faiss library, developed by Facebook Research, for achieving very fast performance16) [45]. As we use cross-lingual document embeddings, results of this ranking would therefore also be in languages different from the query, but they should supposedly address the same concept; by default, we showed only the first 100 retrieved materials, but this can be easily modified in the interface. Wildcards and Boolean operators are currently not supported; nevertheless, because basic algebraic operations are possible in word-embedding spaces (see, for instance, the famous “King - man + woman = queen” in Mikolov et al. [35]), we will explore their usefulness for cross-lingual information retrieval on the APE collection in the future. We finally enriched the presentation of the results by using the constructed taxonomies and suggesting the most relevant topical words and entities, given the user query, as other possible query terms for further exploring the collection. These are selected as they appear in a topical taxonomy relevant for the user query (for instance, “First World War”) and are semantically similar to the user query. We plan to further improve this functionality by additionally suggesting topical words that are extracted from the content of the retrieved materials.

5.4 Entity Search

As an additional feature, we also offer the option to users to search for entities across languages. Instead of relying on cross-lingual embeddings, our retrieval function first maps the entity inserted by the user as a query to its equivalent in Wikipedia (when present). Next, it retrieves name variations in the other languages under study17 and finally searches for their occurrence in the corpus. While this function is an early prototype and does not fully rely upon entity disambiguation approaches, we found it useful as an additional way of exploring the collection. As for concept search, we provide together with the ranking of potentially relevant documents an additional list of potentially relevant associated topical words and entities from the taxonomies. In this case, however, we return results only when they contain a mention of the entity (in one of the covered languages), so they might be less than the default cut-off (set at 100 materials).

All of the steps presented in this section are in their early stages; while they will be extended in the future to include other NLP functionalities (e.g., entity disambiguation approaches, pre-trained deep language models), they already show the potential of our strategy for cross-language topic detection and information retrieval. The current status of the prototype, together with future developments, is available on APE's GitHub page [1]; once fully developed, the tool will be released as open source, following the general development principles of APE.

Skip 6TESTING THE TOOL Section

6 TESTING THE TOOL

While the tool was designed to be applied to the complete dataset of Archives Portal Europe, testing on the whole repository would require processing power and memory space beyond the scope of this proof-of-concept. We thus concentrated on a sample of 457,538 descriptive units, corresponding to nine selected topics.

First, several keyword searches were conducted for each topic, according to the following criteria:

The keywords had to be relevant to the topic in question.

One keyword for each topic had to be an entity (ideally a person or a place).

One keyword for each topic had to be a concept.

One search for each topic had to be a combination of two or more keywords (with the caveat that the tool is not yet working with Boolean operators to distinguish between several keywords being used as a complete text string, and several keywords being used in combination with each other).

If a keyword could be both a concept and an entity (e.g., “Keynes” as John Maynard Keynes or as in a Keynesian policy), then the search would be done both as an entity and as a concept.

Taking into account the language distribution of the test dataset, each search was conducted in French and German, by far the most represented languages in the sample.

At least one of the other languages under consideration was used: Polish and Finnish, as well as Italian, English, and Slovenian.

For each query, the tool returned the 100 most relevant results or fewer if the tool could not detect at least 100 results (this was particularly the case of search for entities, as they are narrower). First, we captured how many results were already tagged with the topic that the keyword suggested and how many were tagged with other topics. As we worked on the assumption that all the material tagged by the archivists operating in the APE dashboard was correctly assigned, we trusted that all results already tagged with the topic in question were relevant. We checked instead how many results not tagged with the topic in question were indeed relevant. As it would have been impossible to run a thorough testing by checking all the results provided, we checked up to the first 10 results for each of the other topics.18

Because the research focussed on topic detection, a result was considered relevant if it was clearly and beyond doubt related to a topic, whether the result was also related to the specific search query. For example, when searching “Keynes” with the search interest of the topic “Economics,” a document was considered to be relevant when its description referred to Keynes, and furthermore to an economic context, such as a document tagged under the topic “First World War” that related to Keynes’ role in the Versailles peace conference. However, a document referring to Keynes attending a performance given by his wife, Lydia Lopokova, would have been considered not to be relevant to the topic of “Economics,” even though it was a match for the entity “Keynes.”

Other elements from the results that we captured were the language of the search results; the list of the 10 most relevant topical words suggested by the tool, based on the taxonomies; and how many of the suggested topical words were indeed pertinent to the topic under consideration.

In total, 153 keyword queries were conducted.19

Skip 7RESULTS OF THE FIRST TESTING Section

7 RESULTS OF THE FIRST TESTING

The testing of the tool in this proof-of-concept phase aimed to (1) confirm that the tool does what it is expected to do, i.e., identify documents relevant to a specific topic, and (2) evaluate our approaches to the human input in the process of assigning topics and in the creation of the taxonomies. The results gathered should not be treated as conclusive, as the test dataset only represented a very small sample of the repository of Archives Portal Europe (25% of all documents tagged with a topic and 0.16% overall); however, they aim at identifying patterns, evaluating the performance of the tool, and establishing a workflow in view of the future scaling up of the project.

7.1 Cross-lingual Topics Classification

An important function of the tool is the identification of documents related to a keyword subject and topic independent from the language. Initial results show that the tool does perform this task, in spite of several hindrances. One of the problematic aspects of the dataset (which is, however, representative of the overall current portion of tagged documents in the portal) is that the majority of the tagged documents are either in French (37.5%) or in German (32.9%), each representing about a third of all documents, with the final third divided between Polish, Latvian, and Finnish. This meant that the tool was mainly trained on the basis of documents in French or German, which we addressed with the application of Fast-Text word embeddings, and their alignment in a cross-lingual “semantic” space, to broaden the tool's functionalities to the other three languages represented, plus two additional ones that are usually part of such multilingual settings (we chose English and Italian, as both languages are strongly represented in Archives Portal Europe among the non-tagged documents). In terms of testing, we have followed the breakdown of languages when choosing our search terms but have aimed at bringing the percentages down, at least to a certain extent, by including further search terms from other languages. Despite these efforts, the result cap at 100 may have also lowered the chances of a document from the “underrepresented” languages being included in the list. Given these premises, it should not be surprising that the results of the searches were mostly in French or German (see Table E).

It is interesting, however, that there was a higher number of cases where the language of the search term differed from the predominant language of the search results rather than remaining in the same language context. This occurs more often in entity searches than in concept searches:

Language of the search term DIFFERS FROM the predominant language of search resultsLanguage of the search term IS THE SAME AS the predominant language of search results
Overall56.8%43.2%
In concept searches51.9%48.1%
In entity searches62.3%37.7%

While more testing is needed, we hypothesise that this is due to archival descriptions largely re-using the administrative language of the records creators, which is not “standard” nor “natural,” and therefore presents additional challenges in NLP.

A second interesting observation is that, in terms of changes from one language to another, language families play a role, e.g., English search terms result more often in German search results than in French ones, while Italian search terms result more often in French search results than in German ones. Finnish, however, from the North-Eurasian Uralic languages family, generates search results in a larger variety of languages (there are instances in which a search in Finnish gives twice the results in Italian than it does in Finnish or German).

A third important observation is that connecting documents via semantically similar entities and concepts allowed to retrieve results in other languages than the official one of the country where the institutions are located, for example, archival descriptions in Italian held in French institutions, which can be most likely explained with the changing borders and administrations of several territories.

7.2 Most Relevant Topical Words

The findings with regard to the word lists, which the tool currently uses as a list of suggested terms that could also be of interest based on the search that has been conducted, provided mixed results that should be investigated further. The assumption is that this is less to do with the initial extraction of entities and the definition of the most frequent and the most distinctive entities for each topic, but more with the extension of these lists. This step mainly relied on the availability and engagement of the partners of the Archives Portal Europe network and of domain experts to provide additional entities to enhance these lists; while this has provided lists in many different languages, it has also led to the status quo that English, which is not represented as a language in the documents of the test dataset, makes up nearly 40% of the total 1,654 taxonomy entries. English, furthermore, is the language with the highest representation in the taxonomies for five out of the nine topics of the test dataset, followed by French and Portuguese as the languages with the highest representation in the taxonomies for three, respectively, one, of the topics. German, however, only gets to 5% of the taxonomy entries overall, being ranked second for the topics of “First World War” and “German Democratic Republic,” for which German also is the predominant language represented in the documents, but only contributing between 0% and 2.3% of the taxonomy entries of the seven remaining topics:

LanguageRepresentation in the datasetRepresentation in the taxonomies
EnglishNot representedMost represented for 5 topics
FrenchMost representedMost represented for 3 topics
GermanSecond-most representedSecond-most represented for 2 topics
PortugueseNot representedMost represented for 1 topic

To better illustrate this with one specific example, below are two representations of the topic “First World War.” Figure 2 looks at the languages in relation to:

The topic itself—documents tagged with this topic are predominantly in German (blue), with some smaller part being in French (green);

The search term used—mainly English (grey), French (green), and German (blue), with some searches in Italian (red) and one in Finnish (yellow);

The most relevant topical words from the taxonomies—mostly in English (grey) and with a balanced distribution between French (green) and German (blue) for the remainder.

Fig. 2.

Fig. 2. Language representation in the topic “World War I” in the search terms used in this context and in the most relevant terms found.

While German and French are the languages of the documents and have been used as often as English as the language of the search term, the graph shows a picture where English dominates the topical words. However, Figure 3, which looks at the topics represented by the most relevant topical words, shows a more homogeneous picture for terms related to “First World War” (blue) with the second predominant topic being “German Democratic Republic” (green), fostered by the fact that this topic also has German as its main language. This again seems to speak positively to the cross-lingual topic classification that the tool provides, despite the improvement potential with regard to the taxonomies themselves.

Fig. 3.

Fig. 3. Context of most relevant words when searching with keywords for the topic “World War I.”

A second interesting observation with regard to the taxonomies and the most relevant topical words is that there seems to be an inverse relation between the percentage of relevant results in a query and the number of relevant topical words suggested. As shown in Table C, the higher the number of new relevant results detected by the tool, the lower the number of relevant topical words suggested and vice versa. While the picture across all nine topics of the test dataset is not entirely conclusive, this is an area to investigate further. It could especially be interesting to study:

(1)

if there is a certain correlation between the type of topic (general vs. specific, entity vs. concept, etc.) and these results: at the moment, the two topics that show the highest percentage of relevant taxonomy entries are both entity-related: “German Democratic Republic” as a geographic entity as well as a collection of organisational and personal entities and “Notaries” as a combination of a functional entity as well as a personal entity;

(2)

if this inverse relation continues to hold when enlarging the test dataset;

(3)

if this, again, depends on the way in which the taxonomies are created.

7.3 Concept Searches and Entity Searches

The testing included a relatively even number of concept (78) and entity (75) searches. Entity searches had an average success rate (as in number of relevant new results) of 11.26%; concept searches had the slightly lower value of 10.56% average success rate, but with much lower variance. However, the lack of Boolean operators and wildcards in the tool is currently affecting these results, as a multiple terms search does not allow the tool to consider the search as a combination of words or as a complete string. For example, when searching for the Treaty of Versailles in the context of “First World War,” many irrelevant results belonged to the topic “Notaries,” because they focussed on different types of treaties ratified by notaries, rather than to the specific peace treaty in question. The next phases of the project will include the implementation of Boolean operators and wildcards to allow for the combination of several keywords and to compensate for different spellings.

One unexpected result related to entity searches. The initial assumption was that when searching for an entity, the results offered by the tool would be the same, independent from the language. As shown in Table D, this was often not the case. One hypothesis to explain this is the discrepancies in Wikipedia pages in different languages (e.g., a person might have their life dates included in their Wikipedia name in one language, but not in the other, or might have academic or other honorary titles added in one case, but not in all). Throughout this proof-of-concept phase, we tried to address some of these discrepancies, but others may be playing an important role as well. Furthermore, different entities with the same or very similar names present in Wikipedia are not disambiguated yet by the tool. This is, for example, the case with searching for “Wilhelm von Preußen” (DE), which returned one result more compared to searches for “Wilhelm, German Crown Prince” (EN), for “Guillaume de Prusse” (FR), or for “Guglielmo di Prussia” (IT), because in the German language the tool currently also included results relating to his great-grandfather “Wilhelm I. (Deutsches Reich)”.

Skip 8CONCLUSIONS Section

8 CONCLUSIONS

This first round of testing was aimed at checking the performance of the tool in its proof-of-concept phase, identifying areas of further investigation, and establishing a methodology for developing and testing the tool further. Our initial findings with regard to the tool's predominant function for the time being, the discovery of documents relevant to a topic, gave promising results, in spite of several limitations due to the application of automated topic detection methodologies to a corpus of data that is particularly vast, heterogeneous, and multilingual. The unevenness of the sample, both in terms of topic representation and of languages included—as well as the vastness of the possible keyword searches around each topic, which makes it hard to predict users’ behaviour—constitute important limitations and challenges for the objectives of the tool.

Still, a 10% success rate, particularly considering that it was evenly distributed between concepts and entities, should be considered a positive result. These results indeed show that the tool allowed us to identify relevant documents related to one topic well beyond the narrowness of direct keyword matching. Initial findings have also shown that good results levels come also from not very largely annotated topics (e.g., “Catholicism” only has 1,500 tagged documents, but a success rate of 31.5%), which opens the door for smaller-scale projects in the future and once the tool has been developed further through alpha and beta phases to extend existing topics and potentially create new ones. This success rate also confirmed the feasibility of the project and the overall soundness of the tool architecture.

With the proof-of-concept confirmed, a closer look at the results allowed us to elaborate on the next steps of the project, which will continue to focus on the discovery of documents relevant to a topic of interest as a future alternative search approach offered to users of the Archives Portal Europe. To get a better grip on the aforementioned unevenness of the data, the first line of action for the next phase will be to enlarge the sample both in terms of topics under consideration and of available languages. Second, taxonomies will be improved by combining machine-based extraction and human-based input more interactively and more iteratively. Third, new functionalities will be added to the tool, starting with features that improve retrievability:

Boolean operators;

Wildcards;

Disambiguation of entities;

The inclusion of other data from the documents, e.g., dates, that might be useful in determining whether a search result is relevant or not.

Furthermore, we will include features that prepare for the full integration of the tool within the functionalities of the portal, which is envisaged as a final step during beta development in the future:

Designing a more user-friendly graphical user interface;

For each result, the provision of a link to the full description in Archives Portal Europe, so a user could check all details of a result to decide on its relevance.

Finally, the next stage of the project will start to reflect on the potential of a second user scenario for the tool, enabling content providers and users to flag a result in the tool as being part of a topic, when relevant.

In second instance, and with a more long-term research approach to the project, combining the tool with Linked Data approaches, such as adding URIs to the current literals used for the topics, may open to whole new lines of research based on human-computer interactions at the service of historical archives. Unlike datasets and corpora created in the digital sphere, historical archives’ catalogues are built through a dishomogeneous sedimentation that has continued for many decades (and often many centuries): Automation and Artificial Intelligence in this environment is therefore particularly challenging and requires strong human input.

At the same time, the vastness of this material imposes the usage of automation in a digital environment, not only to explore new ways of doing archival research and improve on the work of historians and archives-goers, but also to be able to perform the traditional functions of archival institutions in the enhanced but dispersive web-environment. In spite of the noise to sift through, Archives Portal Europe's automated topic detection tool demonstrates that even one extra document is useful for research in this multilingual environment when it constitutes something relevant to the researchers, who would not have had the possibility to find it otherwise.

Skip AAPPENDIX Section

A APPENDIX

Table A.
TopicNo. of tagged documentsCountries
35,677France
Architecture78,145France; Germany
Armedforces23,068France
Arts4,093France
Buildings88,178France
Catholicism1,499France
Charity321France
Charters3,331France; Germany
Churchrecordsandregisters2,056France
Churches721France
Colonialism1,130France
Communism433France
Concentrationcamp43,016France; Germany
Crime29,970France
Culture89,248France
Democracy29,829France; Germany
Earlymodernperiod58France
Economics144,157France; Germany
Education94,914France
EuropeanUnion15,277France
FirstWorldWar(1914–1918)57,445France; Germany
FrenchRevolution(1789–1799)615France
GDR(GermanDemocraticRepublic)117,268Germany
GDRpartiesandtradeunions23,029Germany
Genealogy43,792France; Poland; Latvia
Genealogyarchives13,763France
Health53,966France
Heresy6France
Industrialisation56,793France
Justice91,334France
Lifestyle28,588France
Maps57,119France; Finland
Medicalsciences21,637France
Medievalperiod3,447France
Monasteries204France
Municipalgovernment27,088France
Music11,172France
NapoléonI,EmperoroftheFrench,1769–18216France
NapoléonIII,EmperoroftheFrench,1808–18734,641France
Nationaladministration45,395France
Notaries35,487France; Poland
Photography149,697France
Politics41,764France
Populationcensuses629France
Poverty13,059France
Protestantism15France
Religion7,545France
Revolutionsof18486France
Royalty658France
Schools73,112France
Science94,465France
SecondWorldWar(1939–1945)32,169France
Slavery765France
Socialhistory1,093France
Socialism15France
Statistics11France
Taxation30,621France
Tradeunions21,980France
Transport97,417France
Universities11,682France
Wars(events)10,270France
Women6,390France
Total1,792,908
Total archival descriptions (on October 13, 2020)282,110,269

Table A. Topics in Archives Portal Europe as of October 2020

Table B.
TopicNo. of tagged documentsCountry(ies)No. of institutions
Catholicism1,499France1
Economics144,157France7
FirstWorldWar(1914–1918)57,445France; Germany7
Genealogy43,792France; Poland; Latvia7
GDR(GermanDemocraticRepublic)117,268Germany1
Maps57,119France; Finland8
NapoléonI,EmperoroftheFrench,1769–18216France2
Notaries35,487France; Poland7
Slavery765France1
Total457,538

Table B. Topics Selected for the First Round of Testing

Table C.
TopicNo. of tagged documentsCountriesNo. of institutionsNo. of results (total)No. of results (checked)New relevant resultsAlready tagged resultsNo. of topic words (total)Relevant topic words
Catholicism1,499France11,67454031.48%1.79%2201.36%
Economics144,157France71,65658424.32%18.60%1806.67%
FirstWorldWar(1914–1918)57,445France; Germany71,2782236.73%40.53%21060.00%
Genealogy43,792France; Poland; Latvia71,00028516.6%1.90%1006.00%
GDR(GermanDemocraticRepublic)117,268Germany11,226630.00%81.40%16079.38%
Maps57,119France; Finland870510187.13%70.64%9051.11%
NapoléonI,EmperoroftheFrench,1769–18216France21,6555317.72%0.00%19013.68%
Notaries35,487France; Poland7903931.08%70.10%12078.33%
Slavery765France19033703.51%37.21%12045.00%
TOTAL457,538

Table C. Overview of the Results

Table D.
Topic: “Catholicism”
Number of keyword queries22
Searched for the following entities/concepts (translated to English)Solidarność, nicean, pope, Marian, Hyperdulia, Holy Inquisition witches
Number of total results1,674
Number of retrieved results already tagged as “Catholicism”30
Number of new relevant results in the checked sample (that is, relevant to the topic but tagged under other topics)170 on 540 checked (31.5% of the sample)
Number of relevant topical words3 over 220 (1.36%)
Number of times that entity search in different languages did not give the same results2 out of 4 (Solidarność; Holy Inquisition witches)
Topic: “Economics”
Number of keyword queries20
Searched for the following entities/concepts (translated to English)Keynes, Bank of France, Marxist, Spanish GDP
Number of total results1,656
Number of retrieved results already tagged as “Economics”308
Number of new relevant results in the checked sample (that is, relevant to the topic but tagged under other topics)142 on 584 checked (24.3% of the sample)
Number of relevant topical words12 over 180 (6.7%)
Number of times that entity search in different languages did not give the same results2 out of 3 (Keynes, Spanish GDP)
Topic: “First World War”
Number of keyword queries21
Searched for the following entities/concepts (translated to English)Great War, Liège, Triple Alliance, Wilhelm German Crown Prince, Treaty of Versailles, mustard gas
Number of total results1,278
Number of retrieved results already tagged as “First World War”518
Number of new relevant results in the checked sample (that is, relevant to the topic but tagged under other topics)15 out of 223 checked (6.7% of the sample)
Number of relevant topical words126 out of 210 (60%)
Number of times that entity search in different languages did not give the same results1 out of 2 (Wilhelm German Crown Prince)
Topic: “Genealogy”
Number of keyword queries10
Searched for the following entities/concepts (translated to English)Registry Office, family tree, father
Number of total results1,000
Number of retrieved results already tagged as “Genealogy”19
Number of new relevant results in the checked sample (that is, relevant to the topic but tagged under other topics)47 on 285 checked (16.6 % of the sample)
Number of relevant topical words6 over 100 (6%)
Number of times that entity search in different languages did not give the same results1 out of 1 (Father)
Topic: “German Democratic Republic (GDR)”
Number of keyword queries17
Searched for the following entities/concepts (translated to English)Erich Honecker, Schabowski, Hohenschönhausen, Fall of the Berlin Wall, Stasi Records Agency
Number of total results1,226
Number of retrieved results already tagged as “German Democratic Republic (GDR)”998
Number of new relevant results in the checked sample (that is, relevant to the topic but tagged under other topics)0 out of 63 checked (0% of the sample)
Number of relevant topical words127 out of 160 (79.4%)
Number of times that entity search in different languages did not give the same results1 out of 3 (Hohenschönhausen)
Topic: “Maps”
Number of keyword queries17
Searched for the following entities/concepts (translated to English)Ptolemy, Gerardus Mercator, (only) Mercator, topographical map, map AND town
Number of total results705
Number of retrieved results already tagged as “Maps”498
Number of new relevant results in the checked sample (that is, relevant to the topic but tagged under other topics)88 out of 101 checked (87.1% of the sample)
Number of relevant topical words46 out of 90
Number of times that entity search in different languages did not give the same results2 out of 3 (Ptolemy and Mercator)
Topic: Napoléon I, Emperor of the French, 1769–1821
Number of keyword queries22
Searched for the following entities/concepts (translated to English)Napoleon, Napoleon and France, Napoleon Russia, Empress Joséphine Martinique, Saint Helena, Waterloo battle, Nouveau Régime, Bonapartian
Number of total results1,655
Number of retrieved results already tagged as “Napoléon I, Emperor of the French, 1769–1821”0
Number of new relevant results in the checked sample (that is, relevant to the topic but tagged under other topics)41 on 531 checked (7.7% of the sample)
Number of relevant topical words26 over 190 (13.7%)
Number of times that entity search in different languages did not give the same results2 out of 6 (Napoleon; Saint Helena)
Topic: “Notaries”
Number of keyword queries12
Searched for the following entities/concepts (translated to English)Rue Saint-Honoré, Notary, Notary AND testament, authentication
Number of total results903
Number of retrieved results already tagged as “Notaries”633
Number of new relevant results in the checked sample (that is, relevant to the topic but tagged under other topics)1 out of 93 checked (1.1% of the sample)
Number of relevant topical words94 out of 120
Number of times that entity search in different languages did not give the same results0 out of 1
Topic: “Slavery”
Number of keyword queries12
Searched for the following entities/concepts (translated to English)Spartacus, encomienda, slave, Slave traffic port
Number of total results903
Number of retrieved results already tagged as “Slavery”336
Number of new relevant results in the checked sample (that is, relevant to the topic but tagged under other topics)13 on 370 checked (3.5% of the sample)
Number of relevant topical words54 over 120 (45%)
Number of times that entity search in different languages did not give the same results0 out of 1
Results are shown as aggregated for each topic. For a complete overview of each single keyword search, please refer to Appendix A, available online <https://docs.google.com/spreadsheets/d/1MWXJkC6EQjPW8wtf9DSmWnlXorTGJMMMz-AWNDWVL1k/edit?usp=sharing>

Table D. Summary of the Keyword Searches by Topic

Table E.
Language of the searchNo. of searches% of overall searchesLanguage of the majority of search resultsNo. of times when language dominant in search resultsPercentage
English1711.11%English00.00%
Finnish85.23%Finnish21.31%
French4932.03%French6643.14%
German4831.37%German6643.14%
Italian149.15%Italian42.61%
Polish138.50%Polish00.00%
Slovenian42.61%Slovenian00.00%
Total153Total138
(search without results)15n/a

Table E. Results by Language of Search

Footnotes

  1. 1 For a general overview of archival practices, see Reference [39].

    Footnote
  2. 2 See McKemmish [31], Williams [46], Duranti and Franks [13], and Duchein [10].

    Footnote
  3. 3 See Lytle [28], Lytle and Dürr [29], and Duranti [12].

    Footnote
  4. 4 For a study on keyword-based research in information retrieval, see Williams [46] and Mann [30].

    Footnote
  5. 5 For examples of application to scientific publications on one topic, see Yau et al. [47] and Kee et al. [22]; for examples of application to social media (Twitter), see Lim et al. [26] and Hong and Davison [19].

    Footnote
  6. 6 See, for example, the early works of Doszkocs [9] and Greenberg [16]. For a survey of NER applied to historical documents, see Reference [14].

    Footnote
  7. 7 Please refer to the homepage www.archivesportaleurope.net for the latest figures.

    Footnote
  8. 8 For example, searching for “N?pol?* OR ‘Ναπολέων A’’ OR ‘” will provide many more results on Napoleon Bonaparte than a standard query for “Napoleon,” which only encompasses one form of spelling of this specific query.

    Footnote
  9. 9 https://archiveshub.jisc.ac.uk/

    Footnote
  10. 10 For example, France follows a series of vocabularies set up by the French Ministry of Culture [36].

    Footnote
  11. 11 With regard to languages used in the archival descriptions, it is important to notice that, most of the time, the language of description equals the official language (or languages) of the country where the institution holding the materials is located. This in turn often matches the language of the records themselves, but not necessarily: For example, a medieval charter from the Hungarian National Archives may be described in Hungarian but be written in Latin. Furthermore, there are also cases where not only the documents, but the archival descriptions of those documents, are in another language than the (currently) official language of the country. For example, the city archives of Nice, France, hold many archival descriptions in Italian, for historical reasons. The metadata standard currently used in APE only allows for the identification of the language of the actual document, not the language of the description; furthermore, this element is an optional metadata field, making it difficult to rely on the metadata for detecting a document's language. In future versions of the prototype the language of each description will be automatically assessed to generate a new metadata field.

    Footnote
  12. 12 A descriptive unit in this context is any unit of archival description that is treated as a potential result by the current search process in Archives Portal Europe. These can be descriptions of the actual records themselves or descriptions of higher levels, including the collection level. The tool used the Solr results in JSON format for each of these "documents,” where some major parts of the archival description are captured in singular fields (e.g., the title of the unit itself or of the upper hierarchical levels that this unit is a part of). However, other parts of the archival descriptions are only included in a placeholder field of the Solr index, capturing all additional metadata that might be part of the original EAD-XML file. This is currently not part of the “document” as used by the tool.

    Footnote
  13. 13 Word embeddings, pre-trained on Wikipedia, are available on GitHub at this link: <https://github.com/facebookresearch/MUSE>

    Footnote
  14. 14 We obtained over 0.9 of both micro and macro F1-score, which is the harmonic mean of precision and recall. To know more, see the documentation of the metrics on Scikit-learn, the library we adopted: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html

    Footnote
  15. 15 For the first experiment, we focused on French, German, and Polish. A Named Entity Recogniser (NER) is a tool that automatically identifies mentions of entities (such as people, locations, organisations) in a string of text.

    Footnote
  16. 16 The Faiss library is available on GitHub at this link: <https://github.com/facebookresearch/faiss>

    Footnote
  17. 17 We pre-process name variations removing leaving life dates aside for persons or other characteristics sometimes included in brackets.

    Footnote
  18. 18 For example, if a search query related to “Economics” returned 100 results of which 36 were already tagged with “Economics,” 34 with “World War I,” 26 with “German Democratic Republic,” and 4 with “Notaries,” then we checked the first 10 results from the results’ list that were tagged as “World War I,” the first 10 results tagged as “German Democratic Republic,” and all the results tagged as “Notaries.”

    Footnote
  19. 19 Appendix A available online at this link: https://docs.google.com/spreadsheets/d/1MWXJkC6EQjPW8wtf9DSmWnlXorTGJMMMz-AWNDWVL1k/edit?usp=sharing

    Footnote

REFERENCES

  1. [1] Archives Portal Europe Foundation. 2020. ArchivesPortalEuropeFoundation/Topic-Detection. Retrieved from https://github.com/ArchivesPortalEuropeFoundation/Topic-DetectionGoogle ScholarGoogle Scholar
  2. [2] Archives Portal Europe Foundation. 2020. How to use topics - Archives Portal Europe Wiki. Retrieved from http://wiki.archivesportaleurope.net/index.php/How_to_use_topicsGoogle ScholarGoogle Scholar
  3. [3] Bizer Christian, Heath Tom, and Berners-Lee Tim. 2011. Linked data: The story so far. In Semantic Services, Interoperability and Web Applications: Emerging Concepts., Sheth Amit (Ed.). IGI Global, Hershey PA, 205227. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Blei David M., Ng Andrew Y., and Jordan Michael I.. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3 (Mar. 2003), 9931022.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Lin Blodgett Su, Barocas Solon, Daumé Hal, and Wallach Hanna. 2020. Language (Technology) Is Power: A Critical Survey of “Bias” in NLP. Retrieved from http://arxiv.org/abs/2005.14050Google ScholarGoogle Scholar
  6. [6] Chang Jonathan, Gerrish Sean, Wang Chong, Boyd-Graber Jordan L., and Blei David M.. 2009. Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems Vol. 22, Bengio Y., Schuurmans D., Lafferty J. D., Williams, C. K. I. and Culotta A. (Eds.). Curran Associates, Inc., 288296. Retrieved from http://papers.nips.cc/paper/3700-reading-tea-leaves-how-humans-interpret-topic-models.pdfGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Conneau Alexis, Lample Guillaume, Ranzato Marc'Aurelio, Denoyer Ludovic, and Jégou Hervé. 2018. Word translation without parallel data. Retrieved from http://arxiv.org/abs/1710.04087Google ScholarGoogle Scholar
  8. [8] Cook Terry. 1992. The concept of the archival fonds in the post-custodial era: Theory, problems and solutions. ARCH 35, 0 (Jan. 1992). Retrieved from https://archivaria.ca/index.php/archivaria/article/view/11882Google ScholarGoogle Scholar
  9. [9] Doszkocs Tamas E.. 1986. Natural language processing in information retrieval. J. Am. Societ. Inf. Sci. 37, 4 (1986), 191196. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Duchein Michel. 1983. Theoretical principles and practical problems of Respect des fonds in archival science. Archivaria (Jan. 1983), 6482.Google ScholarGoogle Scholar
  11. [11] Duranti Luciana. 1993. The archival body of knowledge: Archival theory, method, and practice, and graduate and continuing education. J. Educ. Libr. Inf. Sci. 34, 1 (1993), 824. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Duranti Luciana. 1999. Concepts and principles for the management of electronic records, or records management theory is archival diplomatics. Records Manag. J. 9, 3 (Jan. 1999), 149171. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Duranti Luciana and Franks Patricia C. (Eds.). 2015. Encyclopedia of Archival Science. Rowman & Littlefield Publishers, Lanham, MD.Google ScholarGoogle Scholar
  14. [14] Ehrmann Maud, Hamdi Ahmed, Linhares Pontes Elvys, Romanello Matteo, and Doucet Antoine. 2021. Named entity recognition and classification on historical documents: A survey.. Retrieved from http://arxiv.org/abs/2109.11406Google ScholarGoogle Scholar
  15. [15] Glavaš Goran, Nanni Federico, and Ponzetto Simone Paolo. 2017. Cross-lingual classification of topics in political texts. In Proceedings of the 2nd Workshop on NLP and Computational Social Science, Association for Computational Linguistics, 4246. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Greenberg Jane. 1998. The applicability of natural language processing (NLP) to archival properties and objectives. Am. Archiv. 61, 2 (Jan. 1998), 400425. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Hagedoorn B., Iakovleva K., and Tatsi I.. 2019. Data Science Contextualization for Storytelling and Creative Reuse with Europeana 1914–1918. Europeana Research Grants Final Report. University of Groningen.Google ScholarGoogle Scholar
  18. [18] Hamilton William L., Ying Rex, and Leskovec Jure. 2018. Representation learning on graphs: Methods and applications. IEEE Data Eng. Bull. (Apr. 2018). Retrieved from http://arxiv.org/abs/1709.05584Google ScholarGoogle Scholar
  19. [19] Hong Liangjie and Davison Brian D.. 2010. Empirical study of topic modeling in Twitter. In Proceedings of the 1st Workshop on Social Media Analytics (SOMA’10). Association for Computing Machinery, New York, NY, 8088. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Jain Nitisha and Krestel Ralf. 2019. Who is Mona L.? Identifying mentions of artworks in historical archives. In Digital Libraries for Open Knowledge, Springer International Publishing, Cham, 115122.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Joachims Thorsten. 2002. Learning to Classify Text Using Support Vector Machines. Springer US. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Hwa Kee Ying, Li Chunxiao, Kong Leng Chee, Tang Crystal Jieyi, and Chuang Kuo-Liang. 2019. Scoping review of mindfulness research: A topic modelling approach. Mindfulness 10, 8 (Aug. 2019), 14741488. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Kim Yoon. 2014. Convolutional neural networks for sentence classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14), Association for Computational Linguistics, 17461751. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Lauscher Anne, Fabo Pablo Ruiz, Nanni Federico, and Ponzetto Simone Paolo. 2016. Entities as topic labels: combining entity linking and labeled LDA to improve topic interpretability and evaluability. IJCol-Italian Journal of Computational Linguistics 2, 2 (2016), 67--88.Google ScholarGoogle Scholar
  25. [25] Le Quoc V. and Mikolov Tomas. 2014. Distributed Representations of Sentences and Documents. Retrieved from http://arxiv.org/abs/1405.4053Google ScholarGoogle Scholar
  26. [26] Lim Kwan Hui, Karunasekera Shanika, and Harwood Aaron. 2017. ClusTop: A clustering-based topic modelling algorithm for Twitter using word networks. In Proceedings of the IEEE International Conference on Big Data (Big Data’17). 20092018. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Cuadrado Ana María López. 2018. Archives Portal Europe: Los trabajos de normalización archivística en el ámbito europeo y su influencia en el acceso e intercambio de información. TRIA 22, (2018), 4965.Google ScholarGoogle Scholar
  28. [28] Lytle Richard H.. 1980. Intellectual access to archives; I. Provenance and content indexing methods of subject retrieval. Am. Archiv. 43, 1 (1980), 6475.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Lytle Richard H. and Theodore Dürr W.. 1980. Intellectual access to archives: II. Report of an experiment comparing provenance and content indexing methods of subject retrieval. Am. Archiv. 43, 2 (1980), 191207.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Mann Thomas. 2008. Will Google's keyword searching eliminate the need for LC cataloging and classification? J. Libr. Metad. 8, 2 (June 2008), 159168. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] McKemmish Sue, Piggott Michael, Reed Barbara, and Upward Frank (Eds.). 2005. Archives: Recordkeeping in Society. National Library of Australia, Wagga Wagga.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Meeks E. and Weingart S. B.. 2012. The digital humanities contribution to topic modeling. J. Digit. Human. 2, 1 (2012), 16.Google ScholarGoogle Scholar
  33. [33] Merz Nicolas, Regel Sven, and Lewandowski Jirka. 2016. The Manifesto Corpus: A new resource for research on political parties and quantitative text analysis. Res. Polit. 3, 2 (Apr. 2016), 2053168016643346. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Mikolov Tomas, Sutskever Ilya, Chen Kai, Corrado Greg S., and Dean Jeff. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems Vol. 26, Burges C. J. C., Bottou L., Welling M., Ghahramani, Z. and Weinberger K. Q. (Eds.). Curran Associates, Inc., 31113119. Retrieved from http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdfGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Mikolov Tomas, Wen-tau Yih, and Zweig Geoffrey. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, 746751. Retrieved from https://www.aclweb.org/anthology/N13-1090Google ScholarGoogle Scholar
  36. [36] Ministère de la Culture et de la Communication (France). 2024. Les vocabulaires du Ministère de la Culture et de la Communication. Retrieved February 5, 2024 from http://data.culture.fr/thesaurus/Google ScholarGoogle Scholar
  37. [37] Samuel Muller, J. A. Feith, and R. Fruin. 1940. Manual for the Arrangement and Description of Archives (Translation of the 2nd ed.). The Society of American Archivists, Chicago, IL. Retrieved from http://hdl.handle.net/2027/mdp.39015057022447Google ScholarGoogle Scholar
  38. [38] Musso Marta and Arnold Kerstin. 2020. An archival repository of archival repositories: Integrating metadata in Archives Portal Europe. Moderna Arhivistika III, 1 (2020), 120139.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Federico Nanni, H. Kümper, and S. P. Ponzetto. 2016. Semi-supervised textual analysis and historical research helping each other: Some thoughts and observations. International Journal of Humanities and Arts Computing 10, 1 (2016), 63--77. Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Owens Trevor. 2012. Discovery and Justification are Different: Notes on Science-ing the Humanities. Retrieved from http://www.trevorowens.org/2012/11/discovery-and-justification-are-different-notes-on-sciencing-the-humanities/Google ScholarGoogle Scholar
  41. [41] Randby Teddy and Marciano Richard. 2020. Digital curation and machine learning experimentation in archives. In Proceedings of the IEEE International Conference on Big Data (Big Data’20), IEEE, 19041913. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] The National Archives (UK). 2016. Archive Principles and Practice: An Introduction to Archives for Non-archivists. Retrieved from https://www.nationalarchives.gov.uk/documents/archives/archive-principles-and-practice-an-introduction-to-archives-for-non-archivists.pdfGoogle ScholarGoogle Scholar
  43. [43] UK Archival Thesaurus. 2020. Welcome to UKAT. Retrieved October 29, 2020 from https://ukat.aim25.com/Google ScholarGoogle Scholar
  44. [44] UNESCO. 2020. UNESCO Thesaurus. Retrieved October 29, 2020 from http://vocabularies.unesco.org/browser/thesaurus/en/Google ScholarGoogle Scholar
  45. [45] Ivan Vulic and Marie-Francine Moens. 2015. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th Annual ACM SIGIR Conference on Research and Development in Information Retrieval - Full Papers (SIGIR'15, New York, NY), ACM, New York, NY, 363--372. Retrieved from Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Williams Caroline. 2006. Managing Archives Foundations, Principles and Practice. Chandos Publishing. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Yau Chyi-Kwei, Porter Alan, Newman Nils, and Suominen Arho. 2014. Clustering scientific documents with topic modeling. Scientometrics 100, 3 (Sep. 2014), 767786. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. What Is in a <unittitle>? Cross-lingual Topic Detection & Information Retrieval in Archives Portal Europe

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image Journal on Computing and Cultural Heritage
            Journal on Computing and Cultural Heritage   Volume 17, Issue 2
            June 2024
            355 pages
            ISSN:1556-4673
            EISSN:1556-4711
            DOI:10.1145/3613557
            Issue’s Table of Contents

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 26 March 2024
            • Online AM: 17 January 2024
            • Accepted: 25 October 2021
            • Revised: 8 October 2021
            • Received: 1 November 2020
            Published in jocch Volume 17, Issue 2

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
          • Article Metrics

            • Downloads (Last 12 months)180
            • Downloads (Last 6 weeks)76

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader