Abstract
Sports Commentaries offer sparse and redundant information in a lengthy format. Patterns can be observed in news articles written by sports journalists. In this paper, we propose a graphical method to synthesise story snippet from football match commentaries. Our model effectively extracts important information from lengthy text documents. Experimental study reveals that our model closely matches with human expectations. Both qualitative and quantitative analysis proves the effectiveness of our proposed method.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Large number of football games are played on a daily basis around the globe. It is a challenge to write a news report instantly after the match or a session of match has completed. Portable mobile applications or web services in this field will revolutionise journalism and sports news. Also shorter text snippets, rather than long news articles, highlighting salient news of the match will attract a lot of sport fans. Huge amounts of live text commentary (unstructured) of the game is available in the web. In this paper, we consider a specific use case of football live text commentary to generate a short new article or snippet. However producing grammatically correct and human readable summaries is still a challenging task.
Existing extractive summarization consist of supervised learning and unsupervised learning approaches are surveyed extensively [5]. Existing algorithms of Lex Rank, Text Rank are unsupervised learning techniques for text summarization based on graph centrality and sentence similarity for ranking sentences and picking up the top ranked sentences as summaries [2]. Micro-opinions are concise and readable phrases in text to represent important opinions in the text using a novel unsupervised learning approach [4]. An automated method for implicitly crowd sourcing summaries of events using only status updates posted to Twitter as a source, also proved to be successful [8]. Application of supervised learning framework for constructing sports news from live text commentary is also being studied [9], which proves as a foundation for further study in our supervised learning approach. An ontology based approach was proposed in [1].
In this paper we show that entity graph based model produces better results than the supervised learning model. We compare our results with other existing techniques using standard metric, as well as human evaluators. We discuss our approach in Sect. 2. In Sect. 3, we describe the proposed Football Match Concept Graph (FMCG) model, followed by the smart symmetrization algorithm in Sect. 4. Experimental results are described in Sect. 5, while Sect. 6 concludes the paper.
2 Methodology
Figure 1 depicts the overall framework of our proposed algorithm. We first obtain the football match commentary by parsing it from one of the many websites that provide live match commentaries. The data used in the research was scraped from websites that host live match commentaries (www.sportsmole.com). We scraped the commentaries for 60 matches that spanned two leagues and various teams. The raw data is then processed and cleaned. The text data is first tokenized into sentences which are then each assigned a time-stamp based on the time they are attributed to in the commentary. Sentences with only one word are removed. With the cleaned and processed data we generate a entity map that holds all the relevant and important information present in the commentary. With the cleaned and processed data we generate a concept map that holds all the relevant and important information present in the commentary. The detailed algorithm is discussed in Sect. 3. The graph is constructed on the basis of domain knowledge and cross referencing from a game information database. Named entity recognition and regular expressions are used to identify the entities like team names, venue, player names, time stamps etc. The goal is to achieve a snippet as closer to the gold standard summaries available, and that closely matches with the expectation of a common reader. The last part is the synthesis step. Once the graph is obtained, we parse the graph to generate a summary of the match. The parsing algorithm and the synthesis method is discussed in Sect. 4.
3 Football Match Concept Graph (FMCG) Creation
We propose a novel Football Match Concept Graph (FMCG) to capture the relationship among various entities across various entities in a football match. The nodes of the graph represent entities involved in the match. In a football match the entities are team names, players, goals scored, leagues, venue, player scoring the goals, full time and half time scores, and final result. The edges define the different relationships between the entities [6]. It is observed that the football match summary written by a sports journalist follows a structured flow of events. Our FMCG model, due its structure provides consistent results close to human readable summaries. Domain knowledge of football game is utilised in the construction of the FMCG.
We have adapted a two phase approach to extract specific entities from a football commentary. First, named entities are recognized using [3]. Since [3] is not designed for football matches, it results in many false alarms. E.g. given a sentence “the newly-named Estadio de la Ceramica and it is a real cracker of an affair as an exciting Villarreal team”, [3] detects “Villarreal” as person, and “Estadio de la Ceramica” as organization. However, the former is a team, while the later is a stadium. To rectify these errors, in phase 2, we parse the results using regular expressions and database from external sources consisting of team names, leagues, and stadiums; to cross reference the database with the available commentary data of the specific match. This helps us to achieve a correct name-entity mapping. Figure 2 depicts a typical FMCG.
4 Snippet Synthesis Algorithm
Once the graph is obtained we try to answer some specific questions to synthesize a snippet or summary of the match. They are: 1. Name of the league/ tournament? 2. The participating teams? 3. Venue and which team’s home stadium is it? 4. Who was the winner? 5. The goal scorers and goal timing? 6. What was the score at half-time and full-time? We obtain the answers to these questions by traversing the graph and searching for the relevant information. We implemented the Depth First Search (DFS) algorithm to search for information in the graph. So, for example if we need to find who scored the goal Goal 1 (shown in Fig. 2) and which team the player belonged to we first find Goal 1 through DFS, then we search for the player and team connected to our node Goal 1 and obtain the answer. Once the information is obtained we pass it on to the synthesis algorithm. The synthesis algorithm has predefined sentence formats that are filled in with the information to obtain the snippet. To synthesise a summary that is coherent and concise we have defined a logical sequence and format of the information to be presented. An example of the summary is as follows:
Villarreal tied the match against Barcelona in the la liga league with 1 goal(s) each. Villarreal was playing at home in their stadium Estadio de la Ceramica. The score at half time was Villarreal 0-0 Barcelona. Sansone scored a magnificent goal for Villarreal at 50 min. Messi scored a magnificent goal for Barcelona at 90 min. The score at the end of the match was Villarreal 1-1 Barcelona.
5 Experiments and Results
We used a total data-set of 60 matches with an average of 223 sentences in each commentary. For each commentary, there are time-stamps from 0–90 min, pointing out the occurrence of every sentence with respect to a match. For example the name of the teams and the league is mentioned at the start of the commentary in all the matches so we only need to search for it in the first 10 sentences. Sentences with only one word do not hold any useful information, so were removed during processing of the data.
5.1 Evaluation of Summaries
For evaluation, we have used two methods, the ROUGE Score and human evaluation of summaries. The ROUGE-N [7] metric to compare generated snippets to gold standards generated manually. The gold standards [8] used is game recap articles from a reliable website source. The recap articles provide a gold standard for reference to be used for evaluating the algorithm generated summaries using both ROUGE and human evaluation. Although studies have suggested that there is a correlation with ROUGE metric and manually generated summaries [7], however this evaluation is not perfect. Hence, we have used both the techniques to evaluate. Each of the three human evaluators then used a 5-level Likert scale to score the generated summaries on three dimensions: readability, syntax correctness, and interpreted meaning of the summary. To provide a baseline the three human evaluators also evaluated the recap gold standard articles.
5.2 Quantitative Results
Human evaluation of the summaries for readability, syntax correctness and semantic meaning compared to the gold standards in Table 1. Table 1 clearly indicate comparatively high readability, lower syntax errors in English, and higher content meaning interpreted in Entity Graph based summary as compared to NER Stanford model using supervised learning (see Table 2). These result can be attributed to the fact of inclusion of domain knowledge of a football, news article writing in entity graph model as compared to supervised learning model.
5.3 Successful and Failure Scenarios
Our framework is effective on a use case with sufficient domain knowledge. Here, we used the domain knowledge in football to create a graph that stores the information of a match. The algorithm works as discussed when the teams are in the database and the commentary has all the information needed to answer the questions. If there is insufficient information the algorithm will return incomplete snippets. For the creation of the graph, we use a database to extract information from the commentaries, if there is a mention of a team or a player in a manner different from that in the database we will not be able to extract that information successfully. For example, players and teams are often mentioned in various different ways: “Ronaldo” is sometimes mentioned as “CR7”, while “Barcelona” is mentioned as “Barca”. These variations if not accounted for will cause errors. Our algorithm fails in cases the team names are not present in the database.
5.4 Discussion
Our proposed model generates substantially better results than previous approaches. The unsupervised lex-rank and text-rank approaches are good generalized methods but they do not take into account the domain information of the commentary and thus give poor summaries. The supervised method is a good approach for automating summaries and takes into account the domain knowledge as well. However, the summaries generated are not coherent since it extracts important sentences rather than synthesizing. Thus the summaries generated do not score high on human readability. We have used a database to extract information rather than using named entity recognizers [3], which performs poorly. Therefore to generate information rich snippets that are coherent and score high on the human readability test we propose the above algorithm. The synthesis algorithm can be further improved to generate more organic summaries, which further resemble human written summaries. The graph creation algorithm can also be improved such that it relies less and less on domain knowledge and it automatically extracts important information.
6 Conclusion
Our proposed algorithm scored high on the human readability index and better than the previous models on the F-Score index. It generates coherent and information rich summaries and the summaries can be easily changed while keeping the entity graph the same. The algorithm can be adopted to different use cases and sports with sufficient domain knowledge. In future, we are planning to extend this work to provide instant news summaries by summarising commentary streams and thus save the journalists from the repetitive work of summarising matches. The proposed technique can also be clubbed with video symmetrization techniques to identify key events in sports videos.
References
Bouayad-Agha, N., Casamayor, G., Mille, S., Wanner, L.: Perspective-oriented generation of football match summaries: old tasks, new challenges. ACM Trans. Speech Lang. Process. 9(2), 3:1–3:31 (2012)
Erkan, G., Radev, D.R.: Lexrank: graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22, 457–479 (2004)
Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the rd Annual Meeting on Association for Computational Linguistics, pp. 363–370, ACL 2005 (2005)
Ganesan, K., Zhai, C., Viegas, E.: Micropinion generation: an unsupervised approach to generating ultra-concise summaries of opinions. In: Proceedings of the 21st international conference on World Wide Web, pp. 869–878. ACM (2012)
Gupta, V., Lehal, G.S.: A survey of text summarization extractive techniques. J. Emerg. Technol. Web Intell. 2(3), 258–268 (2010)
Jiang, Z., Li, P., Zhang, Y., Li, X.: Generating semantic concept map for MOOCs. In: International Conference on Educational Data Mining (2016)
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out: Proceedings of the ACL-2004 Workshop, vol. 8, Barcelona, Spain (2004)
Nichols, J., Mahmud, J., Drews, C.: Summarizing sporting events using twitter. In: Proceedings of the 2012 ACM International Conference on Intelligent User Interfaces, pp. 189–198. ACM (2012)
Zhang, J., Yao, J.G., Wan, X.: Toward constructing sports news from live text commentary. In: Proceedings of ACL (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Vyas, A., Gaikwad, S., Chattopadhyay, C. (2017). A Graphical Model for Football Story Snippet Synthesis from Large Scale Commentary. In: Shankar, B., Ghosh, K., Mandal, D., Ray, S., Zhang, D., Pal, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2017. Lecture Notes in Computer Science(), vol 10597. Springer, Cham. https://doi.org/10.1007/978-3-319-69900-4_61
Download citation
DOI: https://doi.org/10.1007/978-3-319-69900-4_61
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69899-1
Online ISBN: 978-3-319-69900-4
eBook Packages: Computer ScienceComputer Science (R0)