Abstract
Advanced bit manipulation operations are not efficiently supported by commodity word-oriented microprocessors. Programming tricks are typically devised to shorten the long sequence of instructions needed to emulate these complicated bit operations. As these bit manipulation operations are relevant to applications that are becoming increasingly important, we propose direct support for them in microprocessors. In particular, we propose fast bit gather (or parallel extract), bit scatter (or parallel deposit) and bit permutation instructions (including group, butterfly and inverse butterfly). We show that all these instructions can be implemented efficiently using both the fast butterfly and inverse butterfly network datapaths. Specifically, we show that parallel deposit can be mapped onto a butterfly circuit and parallel extract can be mapped onto an inverse butterfly circuit. We define static, dynamic and loop invariant versions of the instructions, with static versions utilizing a much simpler functional unit. We show how a hardware decoder can be implemented for the dynamic and loop-invariant versions to generate, dynamically, the control signals for the butterfly and inverse butterfly datapaths. The simplest functional unit we propose is smaller and faster than an ALU. We also show that these instructions yield significant speedups over a basic RISC architecture for a variety of different application kernels taken from applications domains including bioinformatics, steganography, coding, compression and random number generation.

































Similar content being viewed by others
References
Warren Jr., S. (2002). Hacker’s delight. Boston: Addison-Wesley Professional (revised online: http://www.hackersdelight.org/revisions.pdf).
Schwartz, S., Kent, W. J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R. C., et al. (2003). Human–mouse alignments with BLASTZ. Genome Research, 13(1), 103–107, January.
Beeler, M., Gosper, B., & Schroeppel, R. (1972). “Hackmem,” Massachusetts Institute of technology-Artificial Intelligence Laboratory Memo 239, available online: ftp://publications.ai.mit.edu/ai-publications/pdf/AIM-239.pdf.
Cray Corporation (2003). Cray Assembly Language (CAL) for Cray X1 Systems Reference Manual, version 1.2, October, available online: http://docs.cray.com/books/S-2314-51/S-2314-51-manual.pdf.
Lee, R. B., & Hilewitz, Y. (2005). Fast pattern matching with parallel extract instructions. Princeton University Department of Electrical Engineering Technical Report CE-L2005-002, February.
Hilewitz, Y., & Lee, R. B. (2006). Fast bit compression and expansion with parallel extract and parallel deposit instructions. Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP), 65–72, September 11–13.
Lee, R. B., Shi, Z., & Yang, X. (2002). How a processor can permute n bits in O(1) cycles. Proceedings of Hot Chips 14—A symposium on High Performance Chips, August.
Shi, Z., Yang, X., & Lee, R. B. (2003). Arbitrary bit permutations in one or two cycles. Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP), 237–247, June.
Lee, R. B., Yang, X., & Shi, Z. J. (2005). Single-cycle bit permutations with MOMR execution. Journal of Computer Science and Technology, 20(5), 577–585 (September).
Lee, R. B., Shi, Z., & Yang, X. (2001). Efficient permutation instructions for fast software cryptography. IEEE Micro, 21(6), 56–69 (December).
Shi, Z., & Lee, R. B. (2000). Bit permutation instructions for accelerating software cryptography. Proceedings of the IEEE International Conf. on Application-Specific Systems, Architectures and Processors, 138–148, July.
Lee, R. (1989). Precision architecture. IEEE Computer, 22(1), 78–91 (Jan).
Lee, R., Mahon, M., & Morris, D. (1992). Pathlength reduction features in the PA-RISC architecture. Proceedings of IEEE Compcon, 129–135. San Francisco, California, Feb 24–28.
Intel Corporation (2002). Intel® Itanium® Architecture Software Developer’s Manual, 1–3, rev. 2.1, Oct.
Hilewitz, Y., Shi, Z. J., & Lee, R. B. (2004). Comparing fast implementations of bit permutation instructions. Proceedings of the 38th Annual Asilomar Conference on Signals, Systems, and Computers, Nov.
Beneš, V. E. (1964). Optimal rearrangeable multistage connecting networks. Bell System Technical Journal, 43(4), 1641–1656 (July).
Lee, R. B., Rivest, R. L., Robshaw, M. J. B., Shi, Z. J., & Yin, Y. L. (2004). On permutation operations in Cipher design. Proceedings of the International Conference on Information Technology (ITCC), 2, 569–577 (April).
Intel Corporation (2007). IA-32 Intel® Architecture Software Developer’s Manual, 1–2.
Sun Microsystems (2002). The VIS™ Instruction Set, Version 1.0, June.
The Mathworks, Inc., Image Processing Toolbox User’s Guide: http://www.mathworks.com/access/helpdesk/help/toolbox/images/images.html.
Franz, E., Jerichow, A., Möller, S., Pfitzmann, A., & Stierand, I. (1996). Computer based steganography. Information Hiding, Springer Lecture Notes in Computer Science, 1174, 7–21.
“Uuencode,” Wikipedia: The Free Encyclopedia, http://en.wikipedia.org/wiki/Uuencode.
Cray Corporation, Man Page Collection: Bioinformatics Library Procedures, 2004, available online: http://www.cray.com/craydoc/manuals/S-2397-21/S-2397-21.pdf.
National Center for Biotechnology Information, Translating Basic Local Alignment Search Tool (BLASTX), available online: http://www.ncbi.nlm.nih.gov/blast/.
Fiskiran, A. M., & Lee, R. B. (2005). Fast parallel table lookups to accelerate symmetric-key cryptography. Proceedings of the International Conference on Information Technology Coding and Computing (ITCC), Embedded Cryptographic Systems Track, 526–531, April.
Fiskiran, A. M., & Lee, R. B. (2005). On-chip lookup tables for fast symmetric-key encryption. Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP), 356–363, July.
Josephson, W., Lee, R. B., & Li, K. (2007). ISA support for fingerprinting and erasure codes. Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP), July.
Scholer, F., Williams, H., Yiannis, J., & Zobel, J. (2002). Compression of inverted indexes for fast query evaluation. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 222–229.
Jun, B., & Kocher, P. (1999). The Intel random number generator. Technical Report, Cryptography Research Inc.
McGregor, J. P., & Lee, R. B. (2001). Architectural enhancements for fast subword permutations with repetitions in cryptographic applications. Proceedings of the International Conference on Computer Design (ICCD 2001), 453–461, September.
Moldovyan, N. A., Moldovyanu, P. A., & Summerville, D. H. (2007). On software implementation of fast DDP-based Ciphers. International Journal of Network Security, 4(1), 81–89 (January).
NIST, Cryptographic Hash Function Competition, http://csrc.nist.gov/groups/ST/hash/sha-3/index.html.
Burger, D., & Austin, T. (1997). The SimpleScalar Tool Set, Version 2.0. University of Wisconsin-Madison Computer Sciences Department Technical Report #1342.
Swartzlander, E. E., Jr. (2004). A review of large parallel counter designs. IEEE Symposium on VLSI, 89–98, February.
Han, T., & Carlson, D. A. (1987). Fast area-efficient VLSI adders. Proceedings of the 8th Symposium on Computer Arithmetic, 49–55, May.
Taiwan Semiconductor Manufacturing Corporation (2003). TCBN90G: TSMC 90 nm Core Library Databook, Oct.
Broukhis, L. A. “BESM-6 Instruction Set,” available online: http://www.mailcom.com/besm6/instset.shtml.
Hilewitz, Y., & Lee, R. B. (2007). Achieving very fast bit matrix multiplication in commodity microprocessors. Princeton University Department of Electrical Engineering Technical Report CE-L2007-4, July.
IBM Corporation (2003). PowerPC Microprocessor Family: AltiVec™ Technology Programming Environments Manual, Version 2.0, July.
Lee, R. (1996). Subword parallelism with MAX-2. IEEE Micro, 16(4), 51–59 (August).
Lee, R. (1997). Multimedia extensions for general-purpose processors. Proceedings of the IEEE Signal Processing Systems Design and Implementation, 9–23, November.
Lee, R. B. (1999). Efficiency of MicroSIMD architectures and index-mapped data for media processors. Proceedings of Media Processors 1999 IS&T/SPIE Symposium on Electric Imaging: Science and Technology, 34–46, January.
Lee R. B. (2000). Subword permutation instructions for two-dimensional multimedia processing in MicroSIMD architectures. Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP 2000), 3–14, July.
Hanson, C. (1996). MicroUnity’s mediaprocessor architecture. IEEE Micro, 16(4), 34–41 (August).
Burke, J., McDonald, J., & Austin, T. (2000). Architectural support for fast symmetric-key cryptography. Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), November.
Yang, X., Vachharajani, M., & Lee, R. B. (2000). Fast subword permutation instructions based on butterfly networks. Proceedings of Media Processors IS&T/SPIE Symposium on Electric Imaging: Science and Technology, 80–86, January.
Yang, X., & Lee, R. B. (2000). Fast subword permutation instructions using omega and flip network stages. Proceedings of the International Conference on Computer Design (ICCD 2000), 15–22, September.
McGregor, J. P., & Lee, R. B. (2003). Architectural techniques for accelerating subword permutations with repetitions. IEEE Transactions on Very Large Scale Integration Systems, 11(3), 325–335 (June).
Moldovyan, A. A., Moldovyan, N. A., & Moldovyanu, P. A. (2007). Architecture types of the bit permutation instruction for general purpose processors. Springer LNG&G, 14, 147–159.
Acknowledgements
This work was supported in part by the Department of Defense and a research gift from Intel Corporation. Hilewitz is also supported by a Hertz Foundation Graduate Fellowship and an NSF Graduate Fellowship. The authors would also like to thank Roger Golliver of Intel Corporation for suggesting some applications that might benefit from bit manipulation instructions.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hilewitz, Y., Lee, R.B. Fast Bit Gather, Bit Scatter and Bit Permutation Instructions for Commodity Microprocessors. J Sign Process Syst Sign Image Video Technol 53, 145–169 (2008). https://doi.org/10.1007/s11265-008-0212-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-008-0212-8