ABSTRACT
Multicore is now the dominant processor trend, and the number of cores is rapidly increasing. The paradigm shift to multicore forces the redesign of the software stack, which includes dynamic analysis. Dynamic analyses provide rich features to software in various areas, such as debugging, testing, optimization, and security. However, these techniques often suffer from excessive overhead, which make it less practical. Previously, this overhead has been overcome by improved processor performance as each generation gets faster, but the performance requirements of dynamic analyses in the multicore era cannot be fulfilled without redesigning for parallelism.
Scalable design of dynamic analysis is a challenging problem. Not only must the analysis itself must be parallel, but the analysis must also be decoupled from the application and run concurrently. A typical method of decoupling the analysis from the application is to send the analysis data from the application to the core that runs the analysis thread via buffering. However, buffering can perturb application cache performance, and the cache coherence protocol may not be efficient, or even implemented, with large numbers of cores in the future.
This paper presents our initial effort to explore the hardware design space and software approach that will alleviate the scalability problem for dynamic analysis on multicore. We choose to make use of explicit inter-core communication that is already available in a real processor, the TILE64 processor and evaluate the opportunity for scalable dynamic analyses. We provide our model and implement concurrent call graph profiling as a case study. Our evaluation shows that pure communication overhead from the application point of view is as low as 1%. We expect that our work will help design scalable dynamic analyses and will influence the design of future many-core processors.
- M. Arnold and D. Grove. Collecting and Exploiting High-Accuracy Call Graph Profiles in Virtual Machines. In International Symposium on Code Generation and Optimization, pages 51--62, San Jose, CA, Mar. 2005. Google ScholarDigital Library
- M. Arnold and B. G. Ryder. A Framework for Reducing the Cost of Instrumented Code. In ACM Conference on Programming Language Design and Implementation, pages 168--179, Snowbird, UT, June 2001. Google ScholarDigital Library
- M. Arnold, S. J. Fink, D. Grove, M. Hind, and P. Sweeney. Adaptive optimization in the Jalapeño JVM. In ACM Conference on Object-Oriented Programming, Systems, Languages, and Applications, pages 47--65, Minneapolis, MN, Oct. 2000. Google ScholarDigital Library
- K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183, University of California, Berkeley, EECS Department, December 2006.Google Scholar
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, October 2008. Google ScholarDigital Library
- M. Bond, K. E. Coons, and K. S. McKinley. Pacer: Proportional detection of data races. In ACM Conference on Programming Language Design and Implementation, Toronto, Canada, June 2010. Google ScholarDigital Library
- M. D. Bond and K. S. McKinley. Continuous Path and Edge Profiling. In ACM/IEEE International Symposium on Microarchitecture, pages 130--140, Barcelona, Spain, Nov. 2005. Google ScholarDigital Library
- J. Ha, C. J. Rossbach, J. V. Davis, I. Roy, H. E. Ramadan, D. E. Porter, D. L. Chen, and E. Witchel. Improved Error Reporting for Software that uses Black-Box Components. In ACM Conference on Programming Language Design and Implementation, pages 101--111, San Diego, CA, 2007. Google ScholarDigital Library
- J. Ha, M. Arnold, S. M. Blackburn, and K. S. McKinley. A Concurrent Dynamic Analysis Framework for Multicore Hardware. In ACM Conference on Object-Oriented Programming, Systems, Languages, and Applications, pages 155--174, Orlando, FL, October 2009. Google ScholarDigital Library
- M. Hirzel and T. Chilimbi. Bursty tracing: A framework for low-overhead temporal profiling. In ACM Workshop on Feedback-Directed and Dynamic Optimization, pages 117--126, December 2001.Google Scholar
- S. Lu, J. Tucek, F. Qin, and Y. Zhou. AVIO: Detecting Atomicity Violations via Access Interleaving Invariants. In ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 37--48, San Jose, CA, 2006. Google ScholarDigital Library
- I. Roy, D. E. Porter, M. D. Bond, K. S. McKinley, and E. Witchel. Laminar: Practical Fine-grained Decentralized Information Flow Control. In ACM Conference on Programming Language Design and Implementation, pages 63--74, Dublin, Ireland, 2009. Google ScholarDigital Library
- R. Shetty, M. Kharbutli, Y. Solihin, and M. Prvulovic. HeapMon: a Helper-thread Approach to Programmable, Automatic, and Low-overhead Memory Bug Detection. IBM Journal of Research and Development, 50(2/3):261--275, 2006. Google ScholarDigital Library
- P. Zhou, F. Qin, W. Liu, Y. Zhou, and J. Torrellas. iWatcher: Efficient Architectural Support for Software Debugging. In ACM/IEEE International Symposium on Computer Architecture, pages 224--235, München, Germany, June 2004. Google ScholarDigital Library
Index Terms
- Opportunities for concurrent dynamic analysis with explicit inter-core communication
Recommendations
A portable, efficient inter-core communication scheme for embedded multicore platforms
Multicore processor designs have become increasingly popular for embedded applications in recent years, but diversified inter-core communication mechanisms have led to the difficulties in software development, integration and migration. A unified, ...
A concurrent dynamic analysis framework for multicore hardware
OOPSLA '09Software has spent the bounty of Moore's law by solving harder problems and exploiting abstractions, such as high-level languages, virtual machine technology, binary rewriting, and dynamic analysis. Abstractions make programmers more productive and ...
Delegation-Based MPI communications for a hybrid parallel computer with many-core architecture
EuroMPI'12: Proceedings of the 19th European conference on Recent Advances in the Message Passing InterfaceMany-core architecture draws much attention in HPC community towards the Exascale era. Many ongoing research activities using GPU or the Many Integrated Core (MIC) architecture from Intel exist worldwide. Many-core CPUs have a great deal of impact to ...
Comments