+ All Categories
Home > Documents > 10.1 A 280mV-to-1.1V 256b Reconfigurable SIMD Vector ...

10.1 A 280mV-to-1.1V 256b Reconfigurable SIMD Vector ...

Date post: 04-Oct-2021
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
2
10.1 A 280mV-to-1.1V 256b Reconfigurable SIMD Vector Permutation Engine With 2-Dimensional Shuffle in 22nm CMOS Steven Hsu, Amit Agarwal, Mark Anders, Sanu Mathew, Himanshu Kaul, Farhana Sheikh, Ram Krishnamurthy Intel, Hillsboro, OR Energy-efficient SIMD permutation operations are key for maximizing high- performance microprocessor vector datapath utilization in multimedia, graphics, and signal processing workloads [1-3]. A wide SIMD vector permutation engine is required to achieve high-throughput data rearrangement operations on large data sets, with scaled supply voltages to deliver high energy efficiency. An ultra-low-voltage reconfigurable 4-way to 32-way SIMD vector permutation engine consisting of a 32-entry × 256b 3-read/1-write-ported register file with a 256b byte-wise any-to-any permute crossbar for 2-dimensional shuffle is fabricated in 22nm CMOS. The register file integrates a vertical shuffle across multiple entries into read/write operations, and includes clockless static reads with shared P/N dual-ended transmission gate (DETG) writes, improving register file VMIN by 250mV across PVT variations with a wide dynamic operating range of 280mV-1.1V. The permute crossbar implements an interleaved folded byte-wise multiplexer layout forming an any-to-any fully connected tree to perform a horizontal shuffle with permute accumulate circuits, and includes vector flip-flops, stacked min-delay buffers, shared gates to average min-sized transistor variation, and ultra-low-voltage split-output (ULVS) level shifters improving logic VMIN by 150mV, while enabling peak energy efficiency of 585GOPS/W measured at 260mV, 50ºC. The permutation engine occupies a dense layout of 0.048mm 2 (Fig. 10.1.7) while achieving: (i) nominal register file performance of 1.8GHz, 106mW measured at 0.9V, 50ºC; (ii) robust register file functionality measured down to 280mV (subthreshold) with peak energy efficiency of 154GOPS/W; (iii) scalable permute crossbar performance of 2.9GHz, 69mW measured at 1.1V, 50ºC with deep sub-threshold operation at 240mV, 10MHz consuming 19μW; and (iv) a 64b 4×4 matrix transpose algorithm with 53% energy savings and 42% improved peak throughput of 263Gbps measured at 1.8GHz, 0.9V. The reconfigurable SIMD vector permutation engine, comprising the multi- ported register file and permute cross bar, is shown in Fig. 10.1.1. The register file is divided into 32 byte-wide banks to enable reconfigurable operand bit- widths of 8b/16b/32b/64b, with single-cycle read/write latency and throughput. Each bank implements a fully static one-hot 5b address decoder to generate read/write select word lines and a static local bitline (LBL). This LBL merges two memory cells with a static AOI multiplexer, followed by a balanced tree of clustered 2-input NAND and NOR gates to complete the global merge (Fig. 10.1.2). Unique addresses for each bank decoder, stored within the register file or external control register, enable simultaneous byte accesses across multiple entries implementing a 32-entry read/write vertical shuffle. The register file transmits and receives data/control operands from the 256b permute crossbar, consisting of an array of 32:1 byte-wise multiplexers, to perform single-cycle horizontal permute/shuffle/broadcast operations across multiple bit-width boundaries (Fig. 10.1.3). The permute crossbar utilizes a folded layout for 50% reduction in critical wire length and interleaved input data wires in opposite directions for 50% lower line-to-line capacitive coupling, improving interconnect and repeater delay by 2.6x. A permute accumulate circuit enables 256b shuffles across two entries with two cycle latency and 50% crossbar area reduction, scalable to complex multi-entry shuffles. Byte-wise enables (RdEn/WrEn/PEn) prevent data switching in un-accessed banks/multiplexers during data rearrangement operations, resulting in up to 49% power reduction. Robust operation at ultra-low voltages is limited by the register file and logic VMIN across PVT variations [4]. Static register file read circuits eliminate keeper contention present in dynamic bitlines, improving read VMIN by 200mV across fast/slow systematic variations, 0°C-85°C, and 6σ random variation (Fig. 10.1.4a). Inherent transistor-level redundancy in DETG cells compensate for systematic/random variations improving write delay by 24% and write VMIN by 150mV compared to a conventional dual-ended (DE) write cell. Shared P/N on virtual supplies of DETG memory cells limits the strength of cross-coupled inverters across variations, reducing write contention by 22% with an additional register file VMIN by 250mV. Vector flip-flops across two adjacent cells with shared local min-sized clock inverters average the variation, reducing low- voltage hold time violations and improving VMIN by 175mV (Fig. 10.1.4b). Stacked min-delay buffers limit variation-induced transistor speed up, improving hold time margin at low voltage by 7%-30%. Ganged NANDs and vector multiplexers average the variation effect of min-sized devices by sharing transistors across gates. Permutation engine operation at low supply voltages requires level shifters to communicate with circuits at the higher I/O voltage. The ULVS level shifter decouples the CVSL stage from the output driver stage and interrupts contention devices, improving VMIN by 125mV (Fig. 10.1.4c). For equal fan-in/out, the ULVS level shifter weakens contention devices thereby reducing power by 25%-32%. Circuit optimizations for ultra-low-voltage operation provide 150mV reduction in logic VMIN, resulting in iso-VMIN for the register file and logic circuits (Fig. 10.1.4d). The vector register file operates at a maximum frequency of 1.8GHz consuming 106mW (measured at nominal 0.9V, 50ºC) while executing simultaneous 256b three reads and one write in a single clock cycle, with an active leakage power component of 193μW or 0.2% of total power (Fig. 10.1.5). Performance scales up to 2.5GHz with 227mW power consumption at 1.1V. Register file circuit optimizations for ultra-low voltage operation enable robust functionality measured down to 280mV consuming 109μW at 16.8MHz, with peak energy efficiency of 154GOPS/W (9× higher than nominal). The byte-wise any-to-any permute crossbar is functional across a wide supply voltage range of 240mV- 1.1V, with measured scalable performance of 10MHz-2.9GHz and power consumption of 19μW-69mW. Nominal permute crossbar performance at 0.9V, 50ºC is 2.3GHz with 36mW of total power. Peak energy efficiency of 585GOPS/W (9× higher than nominal) is measured at an ultra-low supply voltage of 260mV with a maximum frequency of 15MHz and total power of 25.6µW. The permutation engine with 2-dimensional shuffle and permute accumulate capabilities accelerate wide bit-width read/write patterns, such as column, diagonal, chessboard, and stride accesses, which are critical data rearrangement operations in multimedia, graphics, and signal processing workloads (Fig. 10.1.6a). Two-entry blend operations are integrated into a single-cycle register file read, bypassing the permute crossbar, reducing the total energy by 66%. A 64b 4×4 matrix transpose mapped onto the 256b reconfigurable SIMD permutation engine with both register file and permute crossbar operating at 1.8GHz, 0.9V, 50ºC, requires 36%-63% fewer register file reads/writes and permutes, resulting in measured throughput of 263Gb/s consuming 352pJ, with 53% energy reduction compared to a conventional 256b shuffle-based implementation (Fig. 10.1.6b). Acknowledgments: The authors thank V. De, R. Forand, S. Borkar, W. H. Wang, G. Taylor, C. Webb, J. Maiz, F. Merchant, S. Wijeratne, P. Newman, V. Erraguntla, S. L. Lu, L. Peake, D. Davis, and C. Placek for encouragement and discussions. This research was, in part, funded by the U.S. Government under contract number HR0011-10-3-0007. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. References: [1] B. Flachs, et al., “A Streaming Processing Unit for a Cell Processor,” ISSCC Dig. Tech. Papers, pp. 134-135, 2005. [2] S. Arakawa et al., “A 512GOPS fully-programmable digital image processor with full HD 1080p processing capabilities,” ISSCC Dig. Tech. Papers, pp. 312– 313, 2008. [3] H.-J. Stolberg, et al., “An SoC with Two Multimedia DSPs and a RISC Core for Video Compression Applications,” ISSCC Dig. Tech. Papers, pp. 330-331, 2004. [4] A. Wang, et al., “A 180mV FFT Processor Using Sub-threshold Circuits Techniques,” ISSCC Dig. Tech. Papers, pp. 292-293, 2004.
Transcript
Page 1: 10.1 A 280mV-to-1.1V 256b Reconfigurable SIMD Vector ...

10.1 A 280mV-to-1.1V 256b Reconfigurable SIMD Vector Permutation Engine With 2-Dimensional Shuffle in 22nm CMOS

Steven Hsu, Amit Agarwal, Mark Anders, Sanu Mathew, Himanshu Kaul, Farhana Sheikh, Ram Krishnamurthy Intel, Hillsboro, OR Energy-efficient SIMD permutation operations are key for maximizing high-performance microprocessor vector datapath utilization in multimedia, graphics, and signal processing workloads [1-3]. A wide SIMD vector permutation engine is required to achieve high-throughput data rearrangement operations on large data sets, with scaled supply voltages to deliver high energy efficiency. An ultra-low-voltage reconfigurable 4-way to 32-way SIMD vector permutation engine consisting of a 32-entry × 256b 3-read/1-write-ported register file with a 256b byte-wise any-to-any permute crossbar for 2-dimensional shuffle is fabricated in 22nm CMOS. The register file integrates a vertical shuffle across multiple entries into read/write operations, and includes clockless static reads with shared P/N dual-ended transmission gate (DETG) writes, improving register file VMIN by 250mV across PVT variations with a wide dynamic operating range of 280mV-1.1V. The permute crossbar implements an interleaved folded byte-wise multiplexer layout forming an any-to-any fully connected tree to perform a horizontal shuffle with permute accumulate circuits, and includes vector flip-flops, stacked min-delay buffers, shared gates to average min-sized transistor variation, and ultra-low-voltage split-output (ULVS) level shifters improving logic VMIN by 150mV, while enabling peak energy efficiency of 585GOPS/W measured at 260mV, 50ºC. The permutation engine occupies a dense layout of 0.048mm2 (Fig. 10.1.7) while achieving: (i) nominal register file performance of 1.8GHz, 106mW measured at 0.9V, 50ºC; (ii) robust register file functionality measured down to 280mV (subthreshold) with peak energy efficiency of 154GOPS/W; (iii) scalable permute crossbar performance of 2.9GHz, 69mW measured at 1.1V, 50ºC with deep sub-threshold operation at 240mV, 10MHz consuming 19µW; and (iv) a 64b 4×4 matrix transpose algorithm with 53% energy savings and 42% improved peak throughput of 263Gbps measured at 1.8GHz, 0.9V. The reconfigurable SIMD vector permutation engine, comprising the multi-ported register file and permute cross bar, is shown in Fig. 10.1.1. The register file is divided into 32 byte-wide banks to enable reconfigurable operand bit-widths of 8b/16b/32b/64b, with single-cycle read/write latency and throughput. Each bank implements a fully static one-hot 5b address decoder to generate read/write select word lines and a static local bitline (LBL). This LBL merges two memory cells with a static AOI multiplexer, followed by a balanced tree of clustered 2-input NAND and NOR gates to complete the global merge (Fig. 10.1.2). Unique addresses for each bank decoder, stored within the register file or external control register, enable simultaneous byte accesses across multiple entries implementing a 32-entry read/write vertical shuffle. The register file transmits and receives data/control operands from the 256b permute crossbar, consisting of an array of 32:1 byte-wise multiplexers, to perform single-cycle horizontal permute/shuffle/broadcast operations across multiple bit-width boundaries (Fig. 10.1.3). The permute crossbar utilizes a folded layout for 50% reduction in critical wire length and interleaved input data wires in opposite directions for 50% lower line-to-line capacitive coupling, improving interconnect and repeater delay by 2.6x. A permute accumulate circuit enables 256b shuffles across two entries with two cycle latency and 50% crossbar area reduction, scalable to complex multi-entry shuffles. Byte-wise enables (RdEn/WrEn/PEn) prevent data switching in un-accessed banks/multiplexers during data rearrangement operations, resulting in up to 49% power reduction. Robust operation at ultra-low voltages is limited by the register file and logic VMIN across PVT variations [4]. Static register file read circuits eliminate keeper contention present in dynamic bitlines, improving read VMIN by 200mV across fast/slow systematic variations, 0°C-85°C, and 6σ random variation (Fig. 10.1.4a). Inherent transistor-level redundancy in DETG cells compensate for systematic/random variations improving write delay by 24% and write VMIN by 150mV compared to a conventional dual-ended (DE) write cell. Shared P/N on virtual supplies of DETG memory cells limits the strength of cross-coupled inverters across variations, reducing write contention by 22% with an additional 125mV write VMIN reduction. Read/write circuit optimizations improve the overall

register file VMIN by 250mV. Vector flip-flops across two adjacent cells with shared local min-sized clock inverters average the variation, reducing low-voltage hold time violations and improving VMIN by 175mV (Fig. 10.1.4b). Stacked min-delay buffers limit variation-induced transistor speed up, improving hold time margin at low voltage by 7%-30%. Ganged NANDs and vector multiplexers average the variation effect of min-sized devices by sharing transistors across gates. Permutation engine operation at low supply voltages requires level shifters to communicate with circuits at the higher I/O voltage. The ULVS level shifter decouples the CVSL stage from the output driver stage and interrupts contention devices, improving VMIN by 125mV (Fig. 10.1.4c). For equal fan-in/out, the ULVS level shifter weakens contention devices thereby reducing power by 25%-32%. Circuit optimizations for ultra-low-voltage operation provide 150mV reduction in logic VMIN, resulting in iso-VMIN for the register file and logic circuits (Fig. 10.1.4d). The vector register file operates at a maximum frequency of 1.8GHz consuming 106mW (measured at nominal 0.9V, 50ºC) while executing simultaneous 256b three reads and one write in a single clock cycle, with an active leakage power component of 193µW or 0.2% of total power (Fig. 10.1.5). Performance scales up to 2.5GHz with 227mW power consumption at 1.1V. Register file circuit optimizations for ultra-low voltage operation enable robust functionality measured down to 280mV consuming 109µW at 16.8MHz, with peak energy efficiency of 154GOPS/W (9× higher than nominal). The byte-wise any-to-any permute crossbar is functional across a wide supply voltage range of 240mV-1.1V, with measured scalable performance of 10MHz-2.9GHz and power consumption of 19µW-69mW. Nominal permute crossbar performance at 0.9V, 50ºC is 2.3GHz with 36mW of total power. Peak energy efficiency of 585GOPS/W (9× higher than nominal) is measured at an ultra-low supply voltage of 260mV with a maximum frequency of 15MHz and total power of 25.6µW. The permutation engine with 2-dimensional shuffle and permute accumulate capabilities accelerate wide bit-width read/write patterns, such as column, diagonal, chessboard, and stride accesses, which are critical data rearrangement operations in multimedia, graphics, and signal processing workloads (Fig. 10.1.6a). Two-entry blend operations are integrated into a single-cycle register file read, bypassing the permute crossbar, reducing the total energy by 66%. A 64b 4×4 matrix transpose mapped onto the 256b reconfigurable SIMD permutation engine with both register file and permute crossbar operating at 1.8GHz, 0.9V, 50ºC, requires 36%-63% fewer register file reads/writes and permutes, resulting in measured throughput of 263Gb/s consuming 352pJ, with 53% energy reduction compared to a conventional 256b shuffle-based implementation (Fig. 10.1.6b). Acknowledgments: The authors thank V. De, R. Forand, S. Borkar, W. H. Wang, G. Taylor, C. Webb, J. Maiz, F. Merchant, S. Wijeratne, P. Newman, V. Erraguntla, S. L. Lu, L. Peake, D. Davis, and C. Placek for encouragement and discussions. This research was, in part, funded by the U.S. Government under contract number HR0011-10-3-0007. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. References: [1] B. Flachs, et al., “A Streaming Processing Unit for a Cell Processor,” ISSCC Dig. Tech. Papers, pp. 134-135, 2005. [2] S. Arakawa et al., “A 512GOPS fully-programmable digital image processor with full HD 1080p processing capabilities,” ISSCC Dig. Tech. Papers, pp. 312–313, 2008. [3] H.-J. Stolberg, et al., “An SoC with Two Multimedia DSPs and a RISC Core for Video Compression Applications,” ISSCC Dig. Tech. Papers, pp. 330-331, 2004. [4] A. Wang, et al., “A 180mV FFT Processor Using Sub-threshold Circuits Techniques,” ISSCC Dig. Tech. Papers, pp. 292-293, 2004.

Page 2: 10.1 A 280mV-to-1.1V 256b Reconfigurable SIMD Vector ...

Figure 10.1.1: Organization of 256b reconfigurable SIMD vector permutation engine with 2-dimensional shuffle.

Figure 10.1.2: Ultra-low voltage 32-entry × 256b 3-read/1-write register file with PVT-tolerant techniques and vector flip-flops.

Figure 10.1.3: 256b byte-wise any-to-any permute crossbar with interleaved folded layout, permute accumulate circuit, and ULVS level shifter.

Figure 10.1.4: 22nm CMOS VMIN at 0°C-85°C, 3σ systematic, 6σ random variation (a) register file read/write; (b) flip-flop; (c) level shifter; (d) register file and logic co-optimization.

Figure 10.1.5: 22nm CMOS permutation engine maximum frequency, total power, energy efficiency, and active leakage measurements vs. supply voltage.

Figure 10.1.6: (a) Register file vertical shuffle access patterns and measurements; (b) 64b 4×4 matrix transpose measurements and comparisons.


Recommended