Keywords

1 Introduction

Large number of football games are played on a daily basis around the globe. It is a challenge to write a news report instantly after the match or a session of match has completed. Portable mobile applications or web services in this field will revolutionise journalism and sports news. Also shorter text snippets, rather than long news articles, highlighting salient news of the match will attract a lot of sport fans. Huge amounts of live text commentary (unstructured) of the game is available in the web. In this paper, we consider a specific use case of football live text commentary to generate a short new article or snippet. However producing grammatically correct and human readable summaries is still a challenging task.

Existing extractive summarization consist of supervised learning and unsupervised learning approaches are surveyed extensively [5]. Existing algorithms of Lex Rank, Text Rank are unsupervised learning techniques for text summarization based on graph centrality and sentence similarity for ranking sentences and picking up the top ranked sentences as summaries [2]. Micro-opinions are concise and readable phrases in text to represent important opinions in the text using a novel unsupervised learning approach [4]. An automated method for implicitly crowd sourcing summaries of events using only status updates posted to Twitter as a source, also proved to be successful [8]. Application of supervised learning framework for constructing sports news from live text commentary is also being studied [9], which proves as a foundation for further study in our supervised learning approach. An ontology based approach was proposed in [1].

In this paper we show that entity graph based model produces better results than the supervised learning model. We compare our results with other existing techniques using standard metric, as well as human evaluators. We discuss our approach in Sect. 2. In Sect. 3, we describe the proposed Football Match Concept Graph (FMCG) model, followed by the smart symmetrization algorithm in Sect. 4. Experimental results are described in Sect. 5, while Sect. 6 concludes the paper.

Fig. 1.
figure 1

Framework diagram of the proposed smart football story snippet synthesis.

2 Methodology

Figure 1 depicts the overall framework of our proposed algorithm. We first obtain the football match commentary by parsing it from one of the many websites that provide live match commentaries. The data used in the research was scraped from websites that host live match commentaries (www.sportsmole.com). We scraped the commentaries for 60 matches that spanned two leagues and various teams. The raw data is then processed and cleaned. The text data is first tokenized into sentences which are then each assigned a time-stamp based on the time they are attributed to in the commentary. Sentences with only one word are removed. With the cleaned and processed data we generate a entity map that holds all the relevant and important information present in the commentary. With the cleaned and processed data we generate a concept map that holds all the relevant and important information present in the commentary. The detailed algorithm is discussed in Sect. 3. The graph is constructed on the basis of domain knowledge and cross referencing from a game information database. Named entity recognition and regular expressions are used to identify the entities like team names, venue, player names, time stamps etc. The goal is to achieve a snippet as closer to the gold standard summaries available, and that closely matches with the expectation of a common reader. The last part is the synthesis step. Once the graph is obtained, we parse the graph to generate a summary of the match. The parsing algorithm and the synthesis method is discussed in Sect. 4.

3 Football Match Concept Graph (FMCG) Creation

We propose a novel Football Match Concept Graph (FMCG) to capture the relationship among various entities across various entities in a football match. The nodes of the graph represent entities involved in the match. In a football match the entities are team names, players, goals scored, leagues, venue, player scoring the goals, full time and half time scores, and final result. The edges define the different relationships between the entities [6]. It is observed that the football match summary written by a sports journalist follows a structured flow of events. Our FMCG model, due its structure provides consistent results close to human readable summaries. Domain knowledge of football game is utilised in the construction of the FMCG.

Fig. 2.
figure 2

Structure of graph for football match entities.

We have adapted a two phase approach to extract specific entities from a football commentary. First, named entities are recognized using [3]. Since [3] is not designed for football matches, it results in many false alarms. E.g. given a sentence “the newly-named Estadio de la Ceramica and it is a real cracker of an affair as an exciting Villarreal team”, [3] detects “Villarreal” as person, and “Estadio de la Ceramica” as organization. However, the former is a team, while the later is a stadium. To rectify these errors, in phase 2, we parse the results using regular expressions and database from external sources consisting of team names, leagues, and stadiums; to cross reference the database with the available commentary data of the specific match. This helps us to achieve a correct name-entity mapping. Figure 2 depicts a typical FMCG.

4 Snippet Synthesis Algorithm

Once the graph is obtained we try to answer some specific questions to synthesize a snippet or summary of the match. They are: 1. Name of the league/ tournament? 2. The participating teams? 3. Venue and which team’s home stadium is it? 4. Who was the winner? 5. The goal scorers and goal timing? 6. What was the score at half-time and full-time? We obtain the answers to these questions by traversing the graph and searching for the relevant information. We implemented the Depth First Search (DFS) algorithm to search for information in the graph. So, for example if we need to find who scored the goal Goal 1 (shown in Fig. 2) and which team the player belonged to we first find Goal 1 through DFS, then we search for the player and team connected to our node Goal 1 and obtain the answer. Once the information is obtained we pass it on to the synthesis algorithm. The synthesis algorithm has predefined sentence formats that are filled in with the information to obtain the snippet. To synthesise a summary that is coherent and concise we have defined a logical sequence and format of the information to be presented. An example of the summary is as follows:

Villarreal tied the match against Barcelona in the la liga league with 1 goal(s) each. Villarreal was playing at home in their stadium Estadio de la Ceramica. The score at half time was Villarreal 0-0 Barcelona. Sansone scored a magnificent goal for Villarreal at 50 min. Messi scored a magnificent goal for Barcelona at 90 min. The score at the end of the match was Villarreal 1-1 Barcelona.

5 Experiments and Results

We used a total data-set of 60 matches with an average of 223 sentences in each commentary. For each commentary, there are time-stamps from 0–90 min, pointing out the occurrence of every sentence with respect to a match. For example the name of the teams and the league is mentioned at the start of the commentary in all the matches so we only need to search for it in the first 10 sentences. Sentences with only one word do not hold any useful information, so were removed during processing of the data.

5.1 Evaluation of Summaries

For evaluation, we have used two methods, the ROUGE Score and human evaluation of summaries. The ROUGE-N [7] metric to compare generated snippets to gold standards generated manually. The gold standards [8] used is game recap articles from a reliable website source. The recap articles provide a gold standard for reference to be used for evaluating the algorithm generated summaries using both ROUGE and human evaluation. Although studies have suggested that there is a correlation with ROUGE metric and manually generated summaries [7], however this evaluation is not perfect. Hence, we have used both the techniques to evaluate. Each of the three human evaluators then used a 5-level Likert scale to score the generated summaries on three dimensions: readability, syntax correctness, and interpreted meaning of the summary. To provide a baseline the three human evaluators also evaluated the recap gold standard articles.

Table 1. 5-level Likert scale to score to Human evaluation of the summaries. M = Mean, Mdn = Median, SD = Standard Deviation.
Table 2. Comparison of F-scores of the models that we implemented

5.2 Quantitative Results

Human evaluation of the summaries for readability, syntax correctness and semantic meaning compared to the gold standards in Table 1. Table 1 clearly indicate comparatively high readability, lower syntax errors in English, and higher content meaning interpreted in Entity Graph based summary as compared to NER Stanford model using supervised learning (see Table 2). These result can be attributed to the fact of inclusion of domain knowledge of a football, news article writing in entity graph model as compared to supervised learning model.

5.3 Successful and Failure Scenarios

Our framework is effective on a use case with sufficient domain knowledge. Here, we used the domain knowledge in football to create a graph that stores the information of a match. The algorithm works as discussed when the teams are in the database and the commentary has all the information needed to answer the questions. If there is insufficient information the algorithm will return incomplete snippets. For the creation of the graph, we use a database to extract information from the commentaries, if there is a mention of a team or a player in a manner different from that in the database we will not be able to extract that information successfully. For example, players and teams are often mentioned in various different ways: “Ronaldo” is sometimes mentioned as “CR7”, while “Barcelona” is mentioned as “Barca”. These variations if not accounted for will cause errors. Our algorithm fails in cases the team names are not present in the database.

5.4 Discussion

Our proposed model generates substantially better results than previous approaches. The unsupervised lex-rank and text-rank approaches are good generalized methods but they do not take into account the domain information of the commentary and thus give poor summaries. The supervised method is a good approach for automating summaries and takes into account the domain knowledge as well. However, the summaries generated are not coherent since it extracts important sentences rather than synthesizing. Thus the summaries generated do not score high on human readability. We have used a database to extract information rather than using named entity recognizers [3], which performs poorly. Therefore to generate information rich snippets that are coherent and score high on the human readability test we propose the above algorithm. The synthesis algorithm can be further improved to generate more organic summaries, which further resemble human written summaries. The graph creation algorithm can also be improved such that it relies less and less on domain knowledge and it automatically extracts important information.

6 Conclusion

Our proposed algorithm scored high on the human readability index and better than the previous models on the F-Score index. It generates coherent and information rich summaries and the summaries can be easily changed while keeping the entity graph the same. The algorithm can be adopted to different use cases and sports with sufficient domain knowledge. In future, we are planning to extend this work to provide instant news summaries by summarising commentary streams and thus save the journalists from the repetitive work of summarising matches. The proposed technique can also be clubbed with video symmetrization techniques to identify key events in sports videos.