ABSTRACT
Typical Big Data frameworks do not consider the architecture of the servers that make up the cluster. However, these computers are increasingly heterogeneous and are based on a ccNUMA architecture. In such architectures, main memory access times differ depending on the core on which access is requested. Hence, as well as locality of data access throughout a cluster of servers, locality of memory access within individual servers can have an impact on performance.
Java is a commonly-used language for Big Data applications (through the popularity of Hadoop) and the newly-released Java 8 introduces streams to simplify data-parallel programming. However, this paper argues that there are no built-in parallel stream sources that can efficiently operate on very large datasets and take data locality into account. This paper details recent work from the JUNIPER project, an EU Framework 7 Project, which is investigating how the Java 8 platform (augmented by the Real-Time Specification for Java) can be used for real-time Big Data applications. JUNIPER introduces architecture-aware stream sources which are suitable for Big Data systems and which preserve locality of data. Our results show that when reading data from disk, thread affinity can seriously degrade the performance of standard Java streams, but JUNIPER's architecture-aware streams maintain their performance.
- Apache Software Foundation. Apache Hadoop. http://hadoop.apache.org/, accessed 2013/09/01.Google Scholar
- Apache Software Foundation. Apache Spark -- Lightning-Fast Cluster Computing. http://spark.incubator.apache.org/, accessed 2013/10/03.Google Scholar
- Greg Bollella and James Gosling. The Real-Time Specification for Java. Computer, 33(6):47--54, 2000. Google ScholarDigital Library
- Yu Chan, Ian Gray, and Andy Wellings. Exploiting Multicore Architectures in Big Data Applications: The JUNIPER Approach. In Proceedings of 7th Workshop on Programmability Issues for Heterogeneous Multicores (MULTIPROG-2014), January 2014.Google Scholar
- Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, January 2008. Google ScholarDigital Library
- William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI: portable parallel programming with the message-passing interface. MIT Press, Cambridge, MA, USA, 1994. Google ScholarDigital Library
- IBM Corporation. X10: Performance and Productivity at Scale. http://x10-lang.org/, accessed 2013/10/07.Google Scholar
- Adam Jacobs. The pathologies of big data. Queue, 7(6):10:10--10:19, July 2009. Google ScholarDigital Library
- JUNIPER Consortium. Java Platform for High-Performance and Real-Time Large Scale Data. http://www.juniper-project.org, accessed 2013/09/16.Google Scholar
- Alan Kaminsky. Parallel Java 2 Library. http://www.cs.rit.edu/~ark/pj2.shtml, accessed 2014/08/07.Google Scholar
- Nathan Marz. Storm -- Distributed and fault-tolerant realtime computation. http://storm-project.net/, accessed 2013/10/03.Google Scholar
- Oracle Corporation. AbstractTask.java. http://hg.openjdk.java.net/jdk8/tl/jdk/file/tip/src/share/classes/java/util/stream/AbstractTask.java, accessed 2014/05/05.Google Scholar
- Oracle Corporation. Java Stream interface, draft ea-b109. http://download.java.net/jdk8/docs/api/java/util/stream/Stream.html, accessed 2013/09/07.Google Scholar
- Oracle Corporation. JEP 107: Bulk Data Operations for Collections. http://openjdk.java.net/jeps/107, accessed 2013/09/05.Google Scholar
- Oracle Corporation. Lesson: Aggregate Operations. http://docs.oracle.com/javase/tutorial/collections/streams/, accessed 2014/05/14.Google Scholar
- Oracle Corporation. Project Lambda. http://openjdk.java.net/projects/lambda/, accessed 2013/09/05.Google Scholar
- Terracotta. Terracotta Documentation. http://terracotta.org/documentation/4.0, accessed 2013/12/10.Google Scholar
- Xabier Cid Vidal and Ramon Cid Manzano. Taking a closer look at LHC. http://www.lhc-closer.es/1/3/12/0, accessed 2014/05/05.Google Scholar
- Tim Weilkiens. Systems Engineering with SysML/UML: Modeling, Analysis, Design. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008. Google ScholarDigital Library
Index Terms
- On the Locality of Java 8 Streams in Real-Time Big Data Applications
Comments