1 Introduction

In recent years more and more data in various domains has become available publicly and for free. The usefulness of such data varies widely and depends on the structure as well as a maybe existing standardized representation of the data. In the legal domain information is usually provided in legal documents such as laws or court decisions, containing the respective information as well as links to related documents. Laws link to each other (e.g. exceptions defined in another law) and court decisions are linked to laws and previous court decisions which have been taken into account in the particular case. All links are highly eligible for representation as linked data in RDF. Although the information is available, the look-up of legal information may be tricky when being interested in the legal situation of specific circumstances or investigating a particular case as a legal professional. The desired information is spread over various data sources, which follow different access and pricing policies.

Information provided by legal information systems (LIS) is often accessed with simple, keyword-based search interfaces and presented as a simple list of hits based on particular search terms, maybe enhanced with meta information about the document (law, court decision,...) to allow a first evaluation whether a result might be interesting or not. The manual process of information retrieval that is very time consuming and when searching for the wrong or not optimal key words, the results might be overwhelming.

Research performed in the area of legal semantics shows that this is a hot topic and will be approached both from an information systems and legal perspective due to the author’s legal background. Previous research is typically tailored to a particular subdomain of the law or country [23, 33, 34]. Specialized laws often also reference other laws and it is therefore appropriate to focus on a jurisdiction, e.g. with a focus on Austria. A legal information system providing related information in a common knowledge graph enhanced with semantics would be beneficial for legal and non-legal professionals for information search and argumentation at court or to understand the evolution of legislation over time.

The remainder of this research proposal is structured as follows: In Sect. 2 we will describe the state of the art and related work. The problem statement and contributions follow in Sect. 3. The methodology and approach to this problem are described in Sect. 4. First results are shown in Sect. 5 and the evaluation plan is outlined in Sect. 6. Section 7 concludes the paper.

2 State of the Art

Applying semantics in the legal domain is not new and was a hot topic around the turn of the millennium until 2008 (e.g. [1, 5, 7, 9, 15, 18, 28, 29]). However, with the advances in information technology and the semantic world we think that this is still a very interesting topic and should receive appropriate treatment.

The representation of legal information as natural language text is not optimal, therefore another forms of representations have been proposed, for instance legal ontologies, which can be seen as explicit specifications of conceptualizations [19]. In the area of the creation of legal ontologies Gangemi investigated ontology design patterns that are typical for the legal domains and shows some examples [18]. For exchanging information between legal knowledge systems formats like the legal knowledge interchange format (LKIF) [21] or LegalRuleML [2] have been suggested. Using the Resource Description Framework (RDF) as means to represent legal information has been investigated by Ebenhoch, who describes the challenges of legal resource description points and out that the key approach to enrich legal data is enhancing it with metadata [15]. Saias et al. describe the problem of missing semantics in legal information retrieval systems and propose an ontology to enrich the legal data with semantics based on the Portuguese legal system [28]. Winkels et al. describe the need of semantics in a legal context from the practical point of view of the Dutch Tax and Customs Administration, who have to deal with legal information from various sources and formats and developed a parser to automatically detect the identity of sources and references to other legal documents [32]. RDF is also used to describe a particular subdomain of law by Rodríguez-Doncel et al. to express software licenses [26].

A summary of existing legal ontologies is provided by Breuker et al. [11], listing 23 ontologies and categorizing them by application (general language for expressing legal knowledge, information retrieval,...), type (knowledge representation), roles (understanding a domain, searching,...), character (general vs domain-specific), ontology construction (manual, automatic,...) and language.

Legal data needs not only be stored in an appropriate way, but also being searchable. A very suitable way to represent relations between data is using graphs, which has already been investigated in the semantic world [10]. Furthermore, graphs allow the application of already well-known and researched graph algorithms for search and traversal. Mimouni et al. present a solution to graph query legal documents not only on their intertextual relationships but also taking content descriptors into account [25].

Recommender systems provide the user with related information based on particular metrics, for instance similarity. They can also be used in the legal domain to show which information is related or similar to currently displayed information. Drumond et al. describe the requirements and architectural design for such a system [14]. Zeleznikow et al. apply game theory from the economics domain to the Australian family law in order to provide negotiation support as litigation is usually a zero-sum game [34]. A legal recommender system has also been invented by Winkels et al. for Dutch case lawFootnote 1 and provides the user with related information for the searched case [33].

The related work clearly shows that there is research done in this field. Because of the fact that legal systems vary from country to country and also the prevalent legal system, there is no common ontology or recommender system available. Research is tailored to the specifics of a particular legal area, system or both. Nonetheless, whereas axiomatization of laws and norms has received considerable attention, case law and a semantic, graph-based representation of court decisions, cases and their links, has not yet been tackled systematically. We believe that a systematic approach to fill this gap could complement the existing aforementioned efforts of enhancing law by semantics and enable new research directions.

3 Problem Statement and Contributions

In Austria, legal information is provided free of charge by the legal information system (RIS)Footnote 2, which is operated by the Austrian Federal Chancellery, containing information about legislation, law gazettes and case law limited to decisions by the respective supreme courts (Supreme, Constitutional, Administrative, etc.). In addition to RIS, information about the law, comments on decisions and additional information contained in legal commentaries are provided by some companies by paid subscription. These platforms have in common that they allow searching for keywords, specific laws or court decisions and, depending on the data provider, may also offer some related information. However, the search and assessment process of the result takes a long time and requires a lot of manual browsing and reading until a legal professional is able to come to the final conclusion whether or not this particular search result is of relevance for a specific case.

Although legal professionals usually know the law and the most important decisions, in non-trivial cases the look-up of additional information is essential. The RIS can be accessed by everybody with internet access, it is mainly used by legal professionals. Hence, time matters as it generates costs for their clients and the information retrieval process should be kept as short as possible without a negative impact on the quality of the outcome. Therefore, a software supported search and assessment system would be beneficial.

Figure 1 shows a generic use case. All information is contained in a LIS and can be queried. The search results are marked as applicable/positive or contradicting/negative for a case. Furthermore, also clustering the results would enhance the explanatory power of the results, shorten the search process and allow focusing on the matters of fact.

Fig. 1.
figure 1

Interlinked legal documents, laws and decisions, contributing to a legal case.

To outline the problem with an example: A client of a legal professional had a car accident and is involved in legal proceedings. Searching in the RIS for judgments containing the term “Auto” (car) provides information about several application areas. The results contain court decisions dealing with: (i) auto-completion function of search engines, (ii) Observation of a suspect with a GPS unit attached to his car, (iii) several cases dealing with the assignment of rights and defects liability involving a car, (iiii) court judgments having an accident as a fact of the case and many others. This example illustrates the problem on a specific case, of course refining the search terms would be the first approach to get a more fine-grained result. Enhancing the results with semantics and relations to the search terms would be very beneficial for all users of such legal information systems, save a lot of time by classifying the results and facilitating argumentation before a court. This requires contextualised, unambiguous entity recognition and appropriate linkage of similar cases, which is not present in the current LIS.

The legal domain has some special properties, for instance the focus is not just on information retrieval but also on question answering [4], which implies having semantics of the texts available [28]. Dealing with legal documents requires a transformation from natural language text into a more structured format. In the last decade, several efforts were taken to represent legal documents in a more formal way by focusing on XML [7,8,9, 30] and also in RDF [1, 15, 24]. Moreover, several research projects have been carried out so far, for instance ship certification (CLIME) [31] or tax law related (E-POWER) [6] legal documents.

Our research is targeted towards proposing a system for legal and non-legal professionals searching for information and the relation of the found information in a legal information system, which requires the available information to be represented in a suitable way and allowing to link the information properly.

Legal information is very diverse, differs from country to country and might be specific to a particular legal domain. Moreover, it is typically available in natural language text and not in a structured data format which allows processing out of the box and without any further restrictions. Therefore, we deduct our main hypothesis:

Legal information available in systems such as the Austrian legal information system RIS or European legal databases can be structured and enhanced by semantics to support unambiguous and useful interlinking of legal cases in a legal knowledge graph, that helps legal professionals, along with suitable graph traversal and summarization techniques.

Due to the fact that the legal domain is very broad and can be divided into many, very specialized subdomains, we split the hypothesis into two problem areas, which have to be tackled to achieve the ultimate goal of having a legal information system providing legal and non-legal professionals with all required case-related information and turning the information retrieval into a “one stop shop”. Therefore, we have discovered two problem areas:

Fig. 2.
figure 2

Excerpt of the law on rights of legal owners in Austria from RIS

Fig. 3.
figure 3

Court decisions for search term “Auto” (car) in the Austrian RIS

  • P1 (Research Problem 1) Representation of legal information. Legal information is distributed over various sources and typically represented as natural language text. Although all required information about a law or court decision is mentioned in the text, there is no metadata about this specific source of information available. The metadata is expected to be different for different kinds of legal information (for instance laws and court decisions) and will require semantic alignment in case the same metadata fields are used to represent information differently.

    In general, a legal system can be classified in several ways e.g. into areas of law (civil, criminal,...), but this classification might not always be satisfying, for instance being to fine- or coarse-grained. Therefore, all related legal information has to be enhanced with additional information. Figure 2 shows how a law is displayed in the RIS, in particular the rights of the legal owner. It contains structured information (bold printed), for instance, “Abkürzung” (abbreviation), the actual law text or “Schlagworte” (keywords). These kind of information can be parsed easily as it is already available in a structured format. Additional information might still be incorporated in the actual law text and not covered by the provided keywords.

    Figure 3 shows the search results for the keyword “Auto” (car), which presents all court decisions containing the search term. A legal professional knows the circumstances of the case and for what court (“Gericht”) to look and it is also possible to restrict the search to a particular court, but the results still have to be browsed and their usefulness evaluated manually.

  • C1 Contribution to P1. The classification of a legal system can be made by several aspects. Usually, legal systems use different codes of law, which can serve as a classification basis. However, terms might be used comprehensively in several different codes of law and searching for a keyword contained in both codes, for instance, Fig. 2 shows an article in the civil code about “Eigentum” (legal ownership). The same term also appears several times in the criminal code.

    Besides the classification using the respective code a legal norm is contained in, we can also use the provided keywords by the RIS to assign legal norms and judicial decisions to appropriate categories. For instance, the example shown in Fig. 2 can be assigned to the civil code or to the categories mentioned in the keyword section, which then might contain also legal norms and court decision with the same keyword. Therefore, we will be able to classify the legal norms and decisions based on the information they provide.

  • P2 The search of legal information is difficult. Graphs are a convenient and intuitive way to display relations between information. In the legal domain, the relation and the importance of a relation is case-dependent. The meaning of words and expressions is often ambiguous. Therefore, search engines for legal information need to be context-aware and consider this fact in the applied search algorithm. The search for a keyword and all related documents will result in an unmanageable list of results which has to be analyzed manually. Current LIS and legal databases partly provide the option to sort the results based on relevance, date or different types of legal documents, for instance, decisions (with subcategories of different courts), legal norms, commentaries, journals etc. However, the assessment of the result is still up to the user of such an information system.

  • C2 Contribution to P2. The additional benefit of a graph-based semantics-enhanced information system lies in the availability of graph metrics to adjust the search process. Graph metrics describe the structure of it, for instance the degree distribution of a graph for finding the nodes with the highest number of incoming or outgoing links. The betweenness centrality indicates how central a particular node is within the network based on the number of shortest paths going through this node. The closeness centrality relates to the length of the shortest paths and therefore to how close a node is to the other nodes. These and other graph metrics are useful to find important nodes in a graph and to optimize the search algorithm as well as to find the most important related documents as already shown by Hulpus et al. [22]. Based on a proper representation of an information system, this would be beneficial to build summary graphs which enable the user to capture the structure very intuitively and choose which subgraph might seem worth a further inspection.

    The number of nodes in a graph is highly varying. Depending on the context, it is possible to classify nodes based on certain properties and merge them in a way such that parent categories can be formed which contain all the information of the merged nodes. A graph size reduction it more comprehensible and in the case of a LIS it does not overwhelm the user with an unmanageable number of results but provides a summary and the user can choose which category seems more promising and investigate it further. As a first starting point, we will use a breadth-first search algorithm to find all related information before building the summary graph. That is done because a BFS algorithm stepwise follows all related information and is expected to provide a more general overview. On the other hand, a depth-first search algorithm can be used when you already know a given source and target node in a graph, which is not the case when building summary graphs. Therefore, a breadth-first search is expected to be more suitable to solve this problem.

4 Research Methodology and Approach

We decided to apply the iterative research methodology Action Research (AR) as described by Checkland and Holwell [12]. This means that we start with a literature review and evaluate the already proposed ontologies in the legal domain. We then will either choose an appropriate one and see which adaptions need to be made to meet our requirements. If no appropriate ontology can be found or the adaption process is not successful we will develop and apply our own ontology. The adapting and/or development process will go hand in hand with continuous checks whether the ontology is still applicable when adding new data sources and continued as long as new and different legal data sources are added which leads to a continued assessment and adaption process. All iterations targeted to a specific task are continued until a satisfying result has been found or we came to the conclusion that this task cannot be solved. The advantages of an iterative research methodology are that the return to a previous step in a sequence of tasks is possible and we are not restricted by previously taken decisions.

In terms of the graph-based search algorithms we will start with a small subset of the legal domain and apply a breadth-first search (BFS) to navigate through the graph. Furthermore, we will calculate several graph metrics like the number of nodes and edges, graph density and centrality scores. The content of nodes with high scores will be compared to the keywords mentioned in the related documents. Subgraphs of DBPedia and a financial transaction network serve as comparison graphs as we already gained knowledge on how a BFS algorithm performs on these graphs.

Furthermore, this thesis is related to two different research projects, which will also serve as a source for ideas of how to approach problems arising as well as for result comparison. The law-related project is called DALICC and deals with the machine-readable representation and comparison of software licenses. The GraphSense project focuses on the processing and analysis of large graphs. Therefore both projects will serve as input combined with the author’s mainly legal background.

5 Preliminary Results

We surveyed suitable graph search algorithms such as different graph search algorithms [13, 16] as well as summarization methods [22, 27], which we already applied in different use cases, for instance finding the k-shortest paths in different datasets [17, 20]. We plan to leverage the already gained knowledge in the proposed PhD combined with the author’s legal background. Furthermore, our research group has already gained preliminary expertise in advanced search features within the Austrian RIS and their limitations [3] as well as formats and structure of legal data.

6 Evaluation Plan

The results of this research project will be evaluated continuously. The evaluation of the results for the representation of legal data as linked data will be based on existing ontologies and their capability to represent the Austrian legal system. Suitable ontologies will be found by evaluating existing ontologies and their likeliness to applicable to our specific research project. We will start with a small subset of legal documents and try to represent it with the chosen ontology. Throughout the research project we will continuously add different types of legal documents to check whether the new information can still be represented. In terms of the performance and scalability we will have to investigate which sizes of the input and output graphs are feasible for processing in terms of processing time and scalability. Furthermore, the size of a summary graph must be comprehensible such that it can be captured intuitively. We will compare this with the sizes and complexity of legal graphs we can construct from the available case law knowledge extracted from the LIS. The performance of the algorithm will be evaluated by the time it needs to perform the search and memory consumption.

7 Conclusion

Our research addresses the problem of data representation of different domains exemplified by the legal domain. We outlined the problem of a proper representation of domain-specific legal data involving laws, regulations and court decisions, as well as their associated meta-data and graph-based search problems. The motivation, research problems and approach are described in this paper. The future work consists of an analysis of already existing ontologies and evaluation whether they can be taken as a basis for our work.