Abstract
Nowadays, very often decision making relies on information that is found in the various Internet sources. Preferred are texts of the encyclopedic style, which contain mostly factual information. We propose to combine the logic-linguistic model and the universal dependency treebank to extract facts of various quality levels from texts. Based on Random Forest as a classification algorithm, we show the most significant types of facts and types of words that most affect the encyclopedic-style of the text. We evaluate our approach on four corpora based on Wikipedia, social and mass media texts. Our classifier achieves over 90% F-measure.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
We use ‘Predicate’, ‘Subject’ and ‘Object’ with the first upper-case letters to emphasize the semantic meaning of words in a sentence.
- 3.
- 4.
- 5.
- 6.
Groups description available on the page http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm.
- 7.
- 8.
- 9.
References
Cai, L., Zhu, Y.: The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 14, 1–10 (2015)
Béjoint, H.: Modern Lexicography: An Introduction, pp. 30–31. Oxford University Press (2000)
Khairova, N., Petrasova, S., Gautam, A.: The logical-linguistic model of fact extraction from English texts. In: International Conference on Information and Software Technologies, Communications in Computer and Information Science, CCIS 2016, pp. 625–635 (2016)
Nivre, J., et al.: Universal dependencies v1: a multilingual treebank collection In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France, May. European Language Resources Association (ELRA) (2016)
Schler, J., Koppel, M., Argamon, S,. Pennebaker, J.: Effects of age and gender on blogging. In: Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, pp. 191–197 (2006)
Leafgren, J.: Degrees of explicitness: information structure and the packaging of Bulgarian subjects and objects. John Benjamins, Amsterdam & Philadelphia (2002)
Berman, R.A., Ravid, D.: Analyzing narrative informativeness in speech and writing. In: Tyler, A., Kim, Y., Takada, M. (eds.) Language in the Context of Use: Cognitive Approaches to Language and Language Learning. Cognitive Linguistics Research Series. pp. 79–101. Mouton de Gruyter, The Hague (2008)
Rennie, J.D.M., Jaakkola, T.: Using term informativeness for named entity detection. In: Proceedings of SIGIR 2005, pp. 353–360 (2005)
Kireyev, K.: Semantic-based estimation of term informativeness. In: Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pp. 530–538 (2009)
Wu, Z., Giles, L.C.: Measuring term informativeness in context. In: Proceedings of NAACL 2013, Atlanta, Georgia, pp. 259–269 (2013)
Shams, R.: Identification of informativeness in text using natural language stylometry. Electronic Thesis and Dissertation Repository, 2365 (2014)
Huang, A.H., Zang, A.Y., Zheng, R.: Evidence on the information content of text in analyst reports. Acc. Rev. 89(6), 2151–2180 (2014)
Sokolova, M., Lapalme, G.: How much do we say? Using informativeness of negotiation text records for early prediction of negotiation outcomes. Group Decis. Negot. 21(3), 363–379 (2012)
Lex, E., Voelske, M., Errecalde, M., Ferretti, E., Cagnina, L., Horn, C., Granitzer, M.: Measuring the quality of web content using factual information. In: Proceedings of the 2nd joint WICOW/AIRWeb Workshop on Web Quality, pp. 7–10. ACM (2012)
De Marneffe, M.C., Manning, C.D.: Stanford typed dependencies manual, pp. 338–345. Technical report. Stanford University (2008)
Lewoniewski, W.: Enrichment of information in multilingual wikipedia based on quality analysis. In: Abramowicz, W. (ed.) BIS 2017. LNBIP, vol. 303, pp. 216–227. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69023-0_19
Węcel, K., Lewoniewski, W.: Modelling the quality of attributes in wikipedia infoboxes. In: Abramowicz, W. (ed.) BIS 2015. LNBIP, vol. 228, pp. 308–320. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-26762-3_27
Lewoniewski, W., Węcel, K., Abramowicz, W.: Quality and importance of wikipedia articles in different languages. In: Dregvaite, G., Damasevicius, R. (eds.) ICIST 2016. CCIS, vol. 639, pp. 613–624. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46254-7_50
McEnery, T., Hardie, A.: Corpus Linguistics: Method, Theory and Practice, pp. 48–52. Cambridge University Press, Cambridge (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Khairova, N., Lewoniewski, W., Węcel, K., Orken, M., Kuralai, M. (2018). Comparative Analysis of the Informativeness and Encyclopedic Style of the Popular Web Information Sources. In: Abramowicz, W., Paschke, A. (eds) Business Information Systems. BIS 2018. Lecture Notes in Business Information Processing, vol 320. Springer, Cham. https://doi.org/10.1007/978-3-319-93931-5_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-93931-5_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93930-8
Online ISBN: 978-3-319-93931-5
eBook Packages: Computer ScienceComputer Science (R0)