Abstract
In this paper, we propose models, algorithms, and implementation details of an approach that extract the most relevant entity rankings from large datasets. This is done in a fully automated way, as with large amounts of structured data, beyond well understood databases (schemas), manual solutions do not scale. The core task of our approach is to decide which categorical constraints, ranking order (descending or ascending), and length form together an interesting ranking. We make use of a model based on information entropy to find interesting/relevant categorical constraints and devise pruning conditions to avoid generating too many irrelevant rankings. We further investigate the skewness of the value distributions of ranking criteria to find suitable ranking dimensions and ranking order, and present an overall scoring model to assess the meaningfulness of a ranking. For each individual step of our approach, we discuss iterative MapReduce-based algorithms. Finally, the experimental evaluation on real-world data is reported where the users manually evaluate our approach of generating most relevant rankings.




Similar content being viewed by others
Literatur
Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases, vol 22. ACM SIGMOD Record, New York, pp 207–216
Alvanaki F, Ilieva E, Michel S, Stupar A (2013) Interesting event detection through hall of fame rankings. In: Proceedings of the 3rd ACM SIGMOD Workshop on Databases and Social Networks, DBSocial, New York, 23 June 2013, pp 7–12
Alvanaki F, Michel S, Stupar A (2012) Building and maintaining halls of fame over a database. CoRR abs/1208.1231
Balanda KP, MacGillivray H (1988) Kurtosis: a critical review. Am Stat 42(2):111–119
Bizer C, Heath T, Berners-Lee T (2009) Linked data - the story so far. Int J Semantic Web Inf Syst 5(3):1–22
Chambers J, Cleveland W, Kleiner B, Tukey P (1983) Graphical methods for data analysis. The Wadsworth statistics/probability series. Duxury, Boston
Chaudhuri S, Dayal U (1997) An overview of data warehousing and olap technology. SIGMOD Rec 26(1), 65–74
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Dean J, Ghemawat S (2010) Mapreduce: a flexible data processing tool. Commun ACM 53(1):72–77
DeCarlo LT (1997) On the meaning and use of kurtosis. Psychol Methods 2(3):292
Fayyad UM, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery in databases. AI Magazine 17(3):37–54
Foundation, A.S. (2014) Apache Hadoop. http://hadoop.apache.org/
Foundation, A.S. (2014) Apache Hive. http://hive.apache.org/
Foundation, A.S. (2014) Apache Hive language manual. https://cwiki.apache.org/confluence/display/Hive/LanguageManual
Gray J, Chaudhuri S, Bosworth A, Layman A, Reichart D, Venkatrao M, Pellow F, Pirahesh H (1997) Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub totals. Data Min Knowl Disco 1(1), 29–53
Ilieva E, Michel S, Stupar A (2013) The essence of knowledge (bases) through entity rankings. In: 22nd ACM International Conference on Information and Knowledge Management, CIKM'13, San Francisco, October 27 - November 1, 2013, pp 1537–1540
Ledermann W, Lloyd E (1984) Handbook of applicable mathematics: statistics, part B. Handbook of applicable mathematics. Wiley, New York
Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Database Theory – ICDT '99, 7th International Conference, Jerusalem, January 10–12, Proceedings., pp 398–416
Pébay P (2008) Formulas for robust, one-pass parallel computation of covariances and arbitrary-order statistical moments. Sandia Report SAND2008-6212, Sandia National Laboratories
Schwarte A, Haase P, Hose K, Schenkel R, Schmidt M (2011) Fedx: optimization techniques for federated query processing on linked data. In: International Semantic Web Conference (1), pp 601–616
Shannon CE (2001) A mathematical theory of communication. SIGMOBILE Mob Comput Commun Rev 5(1):3–55
Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pp 1–10. IEEE
Snedecor GW, Cochran WG (1989) Statistical Methods, 8th Edn. Iowa State University Press, Iowa
Suchanek FM, Kasneci G, Weikum G (2007) Yago: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, Banff, Alberta, Canada, May 8–12, 2007, pp 697–706
Terriberry TB (2008) Computing higher-order moments online
Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive: a warehousing solution over a map-reduce framework. Proceedings VLDB Endowment 2(2):1626–1629
Zaki MJ (2002) Efficiently mining frequent trees in a forest. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 71–80. ACM
Author information
Authors and Affiliations
Corresponding author
Additional information
This work has been partially supported by the German Research Foundation (DFG) in project MI 1794/1-1.
Rights and permissions
About this article
Cite this article
Pal, K., Reinartz, F. & Michel, S. Mining Entity Rankings. Datenbank Spektrum 16, 27–38 (2016). https://doi.org/10.1007/s13222-015-0205-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13222-015-0205-2