What Is in a <unittitle>? Cross-lingual Topic Detection & Information Retrieval in Archives Portal Europe
Abstract
1 Introduction
2 Natural Language Processing and Archival Research
3 An Archival Catalogue of Archival Catalogues: Searching in Archives Portal Europe
4 Topic Association in Archives Portal Europe

5 Topic Detection: Building A New Tool
5.1 Cross-lingual Topic Classification
5.2 Topic Taxonomies Generation
5.3 Concept Search
5.4 Entity Search
6 Testing the Tool
7 Results of the First Testing
7.1 Cross-lingual Topics Classification
Language of the search term DIFFERS FROM the predominant language of search results | Language of the search term IS THE SAME AS the predominant language of search results | |
Overall | 56.8% | 43.2% |
In concept searches | 51.9% | 48.1% |
In entity searches | 62.3% | 37.7% |
7.2 Most Relevant Topical Words
Language | Representation in the dataset | Representation in the taxonomies |
English | Not represented | Most represented for 5 topics |
French | Most represented | Most represented for 3 topics |
German | Second-most represented | Second-most represented for 2 topics |
Portuguese | Not represented | Most represented for 1 topic |


7.3 Concept Searches and Entity Searches
8 Conclusions
A Appendix
Topic | No. of tagged documents | Countries |
---|---|---|
35,677 | France | |
Architecture | 78,145 | France; Germany |
Armedforces | 23,068 | France |
Arts | 4,093 | France |
Buildings | 88,178 | France |
Catholicism | 1,499 | France |
Charity | 321 | France |
Charters | 3,331 | France; Germany |
Churchrecordsandregisters | 2,056 | France |
Churches | 721 | France |
Colonialism | 1,130 | France |
Communism | 433 | France |
Concentrationcamp | 43,016 | France; Germany |
Crime | 29,970 | France |
Culture | 89,248 | France |
Democracy | 29,829 | France; Germany |
Earlymodernperiod | 58 | France |
Economics | 144,157 | France; Germany |
Education | 94,914 | France |
EuropeanUnion | 15,277 | France |
FirstWorldWar(1914–1918) | 57,445 | France; Germany |
FrenchRevolution(1789–1799) | 615 | France |
GDR(GermanDemocraticRepublic) | 117,268 | Germany |
GDRpartiesandtradeunions | 23,029 | Germany |
Genealogy | 43,792 | France; Poland; Latvia |
Genealogyarchives | 13,763 | France |
Health | 53,966 | France |
Heresy | 6 | France |
Industrialisation | 56,793 | France |
Justice | 91,334 | France |
Lifestyle | 28,588 | France |
Maps | 57,119 | France; Finland |
Medicalsciences | 21,637 | France |
Medievalperiod | 3,447 | France |
Monasteries | 204 | France |
Municipalgovernment | 27,088 | France |
Music | 11,172 | France |
NapoléonI,EmperoroftheFrench,1769–1821 | 6 | France |
NapoléonIII,EmperoroftheFrench,1808–1873 | 4,641 | France |
Nationaladministration | 45,395 | France |
Notaries | 35,487 | France; Poland |
Photography | 149,697 | France |
Politics | 41,764 | France |
Populationcensuses | 629 | France |
Poverty | 13,059 | France |
Protestantism | 15 | France |
Religion | 7,545 | France |
Revolutionsof1848 | 6 | France |
Royalty | 658 | France |
Schools | 73,112 | France |
Science | 94,465 | France |
SecondWorldWar(1939–1945) | 32,169 | France |
Slavery | 765 | France |
Socialhistory | 1,093 | France |
Socialism | 15 | France |
Statistics | 11 | France |
Taxation | 30,621 | France |
Tradeunions | 21,980 | France |
Transport | 97,417 | France |
Universities | 11,682 | France |
Wars(events) | 10,270 | France |
Women | 6,390 | France |
Total | 1,792,908 | |
Total archival descriptions (on October 13, 2020) | 282,110,269 |
Topic | No. of tagged documents | Country(ies) | No. of institutions |
---|---|---|---|
Catholicism | 1,499 | France | 1 |
Economics | 144,157 | France | 7 |
FirstWorldWar(1914–1918) | 57,445 | France; Germany | 7 |
Genealogy | 43,792 | France; Poland; Latvia | 7 |
GDR(GermanDemocraticRepublic) | 117,268 | Germany | 1 |
Maps | 57,119 | France; Finland | 8 |
NapoléonI,EmperoroftheFrench,1769–1821 | 6 | France | 2 |
Notaries | 35,487 | France; Poland | 7 |
Slavery | 765 | France | 1 |
Total | 457,538 |
Topic | No. of tagged documents | Countries | No. of institutions | No. of results (total) | No. of results (checked) | New relevant results | Already tagged results | No. of topic words (total) | Relevant topic words |
---|---|---|---|---|---|---|---|---|---|
Catholicism | 1,499 | France | 1 | 1,674 | 540 | 31.48% | 1.79% | 220 | 1.36% |
Economics | 144,157 | France | 7 | 1,656 | 584 | 24.32% | 18.60% | 180 | 6.67% |
FirstWorldWar(1914–1918) | 57,445 | France; Germany | 7 | 1,278 | 223 | 6.73% | 40.53% | 210 | 60.00% |
Genealogy | 43,792 | France; Poland; Latvia | 7 | 1,000 | 285 | 16.6% | 1.90% | 100 | 6.00% |
GDR(GermanDemocraticRepublic) | 117,268 | Germany | 1 | 1,226 | 63 | 0.00% | 81.40% | 160 | 79.38% |
Maps | 57,119 | France; Finland | 8 | 705 | 101 | 87.13% | 70.64% | 90 | 51.11% |
NapoléonI,EmperoroftheFrench,1769–1821 | 6 | France | 2 | 1,655 | 531 | 7.72% | 0.00% | 190 | 13.68% |
Notaries | 35,487 | France; Poland | 7 | 903 | 93 | 1.08% | 70.10% | 120 | 78.33% |
Slavery | 765 | France | 1 | 903 | 370 | 3.51% | 37.21% | 120 | 45.00% |
TOTAL | 457,538 |
Topic: “Catholicism” | |
---|---|
Number of keyword queries | 22 |
Searched for the following entities/concepts (translated to English) | Solidarność, nicean, pope, Marian, Hyperdulia, Holy Inquisition witches |
Number of total results | 1,674 |
Number of retrieved results already tagged as “Catholicism” | 30 |
Number of new relevant results in the checked sample (that is, relevant to the topic but tagged under other topics) | 170 on 540 checked (31.5% of the sample) |
Number of relevant topical words | 3 over 220 (1.36%) |
Number of times that entity search in different languages did not give the same results | 2 out of 4 (Solidarność; Holy Inquisition witches) |
Topic: “Economics” | |
Number of keyword queries | 20 |
Searched for the following entities/concepts (translated to English) | Keynes, Bank of France, Marxist, Spanish GDP |
Number of total results | 1,656 |
Number of retrieved results already tagged as “Economics” | 308 |
Number of new relevant results in the checked sample (that is, relevant to the topic but tagged under other topics) | 142 on 584 checked (24.3% of the sample) |
Number of relevant topical words | 12 over 180 (6.7%) |
Number of times that entity search in different languages did not give the same results | 2 out of 3 (Keynes, Spanish GDP) |
Topic: “First World War” | |
Number of keyword queries | 21 |
Searched for the following entities/concepts (translated to English) | Great War, Liège, Triple Alliance, Wilhelm German Crown Prince, Treaty of Versailles, mustard gas |
Number of total results | 1,278 |
Number of retrieved results already tagged as “First World War” | 518 |
Number of new relevant results in the checked sample (that is, relevant to the topic but tagged under other topics) | 15 out of 223 checked (6.7% of the sample) |
Number of relevant topical words | 126 out of 210 (60%) |
Number of times that entity search in different languages did not give the same results | 1 out of 2 (Wilhelm German Crown Prince) |
Topic: “Genealogy” | |
Number of keyword queries | 10 |
Searched for the following entities/concepts (translated to English) | Registry Office, family tree, father |
Number of total results | 1,000 |
Number of retrieved results already tagged as “Genealogy” | 19 |
Number of new relevant results in the checked sample (that is, relevant to the topic but tagged under other topics) | 47 on 285 checked (16.6 % of the sample) |
Number of relevant topical words | 6 over 100 (6%) |
Number of times that entity search in different languages did not give the same results | 1 out of 1 (Father) |
Topic: “German Democratic Republic (GDR)” | |
Number of keyword queries | 17 |
Searched for the following entities/concepts (translated to English) | Erich Honecker, Schabowski, Hohenschönhausen, Fall of the Berlin Wall, Stasi Records Agency |
Number of total results | 1,226 |
Number of retrieved results already tagged as “German Democratic Republic (GDR)” | 998 |
Number of new relevant results in the checked sample (that is, relevant to the topic but tagged under other topics) | 0 out of 63 checked (0% of the sample) |
Number of relevant topical words | 127 out of 160 (79.4%) |
Number of times that entity search in different languages did not give the same results | 1 out of 3 (Hohenschönhausen) |
Topic: “Maps” | |
Number of keyword queries | 17 |
Searched for the following entities/concepts (translated to English) | Ptolemy, Gerardus Mercator, (only) Mercator, topographical map, map AND town |
Number of total results | 705 |
Number of retrieved results already tagged as “Maps” | 498 |
Number of new relevant results in the checked sample (that is, relevant to the topic but tagged under other topics) | 88 out of 101 checked (87.1% of the sample) |
Number of relevant topical words | 46 out of 90 |
Number of times that entity search in different languages did not give the same results | 2 out of 3 (Ptolemy and Mercator) |
Topic: Napoléon I, Emperor of the French, 1769–1821 | |
Number of keyword queries | 22 |
Searched for the following entities/concepts (translated to English) | Napoleon, Napoleon and France, Napoleon Russia, Empress Joséphine Martinique, Saint Helena, Waterloo battle, Nouveau Régime, Bonapartian |
Number of total results | 1,655 |
Number of retrieved results already tagged as “Napoléon I, Emperor of the French, 1769–1821” | 0 |
Number of new relevant results in the checked sample (that is, relevant to the topic but tagged under other topics) | 41 on 531 checked (7.7% of the sample) |
Number of relevant topical words | 26 over 190 (13.7%) |
Number of times that entity search in different languages did not give the same results | 2 out of 6 (Napoleon; Saint Helena) |
Topic: “Notaries” | |
Number of keyword queries | 12 |
Searched for the following entities/concepts (translated to English) | Rue Saint-Honoré, Notary, Notary AND testament, authentication |
Number of total results | 903 |
Number of retrieved results already tagged as “Notaries” | 633 |
Number of new relevant results in the checked sample (that is, relevant to the topic but tagged under other topics) | 1 out of 93 checked (1.1% of the sample) |
Number of relevant topical words | 94 out of 120 |
Number of times that entity search in different languages did not give the same results | 0 out of 1 |
Topic: “Slavery” | |
Number of keyword queries | 12 |
Searched for the following entities/concepts (translated to English) | Spartacus, encomienda, slave, Slave traffic port |
Number of total results | 903 |
Number of retrieved results already tagged as “Slavery” | 336 |
Number of new relevant results in the checked sample (that is, relevant to the topic but tagged under other topics) | 13 on 370 checked (3.5% of the sample) |
Number of relevant topical words | 54 over 120 (45%) |
Number of times that entity search in different languages did not give the same results | 0 out of 1 |
Results are shown as aggregated for each topic. For a complete overview of each single keyword search, please refer to Appendix A, available online <https://docs.google.com/spreadsheets/d/1MWXJkC6EQjPW8wtf9DSmWnlXorTGJMMMz-AWNDWVL1k/edit?usp=sharing> |
Language of the search | No. of searches | % of overall searches | Language of the majority of search results | No. of times when language dominant in search results | Percentage |
---|---|---|---|---|---|
English | 17 | 11.11% | English | 0 | 0.00% |
Finnish | 8 | 5.23% | Finnish | 2 | 1.31% |
French | 49 | 32.03% | French | 66 | 43.14% |
German | 48 | 31.37% | German | 66 | 43.14% |
Italian | 14 | 9.15% | Italian | 4 | 2.61% |
Polish | 13 | 8.50% | Polish | 0 | 0.00% |
Slovenian | 4 | 2.61% | Slovenian | 0 | 0.00% |
Total | 153 | Total | 138 | ||
(search without results) | 15 | n/a |
Footnotes

References
Index Terms
- What Is in a
? Cross-lingual Topic Detection & Information Retrieval in Archives Portal Europe
Recommendations
Metadata models for organizing digital archives on the web: metadata-centric projects at Tsukuba and lessons learned
DCMI'18: Proceedings of the 2018 International Conference on Dublin Core and Metadata ApplicationsThere exist many digital collections of cultural and historical resources, referred to as digital archives in this paper. Domains of digital archives are expanding from traditional cultural heritage objects to new areas such as popular culture and ...
Exploring the Kyoto Digital Archives Project: Challenging the Funding Model of Digital Archive Development
Proceedings of the 17th International Conference on Asia-Pacific Digital Libraries - Volume 9469Within the Japanese world of digital archives, Kyoto plays a key role. The city became a pioneer in digital archive development by partnering with academics, private organizations with cultural treasures, and private industry. They created the Kyoto ...
‘Digitalising a National Archive’: interview with John Sheridan, Digital Director at The National Archives, UK
AbstractJohn Sheridan talks with Clare L E Foster, sharing some wider observations about the challenges of the digital transformation of The National Archives. ().
Comments
Information & Contributors
Information
Published In

Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
Check for updates
Author Tags
Qualifiers
- Research-article
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- 0Total Citations
- 741Total Downloads
- Downloads (Last 12 months)664
- Downloads (Last 6 weeks)137
Other Metrics
Citations
View Options
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Sign in