IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 42, NO. 11, NOVEMBER 2007 1 480-GMACS/mW Resonant Adiabatic Mixed-Signal Processor Array for Charge-Based Pattern Recognition Rafal Karakiewicz, Student Member, IEEE, Roman Genov, Member, IEEE, and Gert Cauwenberghs, Senior Member, IEEE Abstract— A resonant adiabatic mixed-signal VLSI ar- coding functions along with sensing and communi- ray delivers 480 GMACS (109 multiply-and-accumulates cation. Portable real-time pattern recognition systems, per second) throughput for every mW of power, a 25-fold such as wearable face detection and recognition sys- improvement over the energy efﬁciency obtained when resonant clock generator and line drivers are replaced tems for the blind, are examples of such applica- with static CMOS drivers. Losses in resonant clock tions. The energy efﬁciency, deﬁned as computational generation are minimized by activating switches between throughput per unit power (or, equivalently, the recip- LC tank and DC supply with a periodic pulse signal, rocal of the energy per unit computation), thus has and by minimizing the variability of the capacitive load to be maximized. In this work the adiabatic charge- to maintain resonance. We show that minimum energy is attained for relatively wide pulse width, and that recycling principle is applied to mixed-signal charge- typical load distribution in template-based charge-mode based computing to decrease power dissipation be- computation implies almost constant capacitive load. yond f CV dd2 , while high computational throughout The resonantly driven 256×512 array of 3-T charge- is maintained by employing an array-based parallel conserving multiply-accumulate cells is embedded in a computing architecture. template matching processor for image classiﬁcation and validated on a face detection task. When a CMOS inverter, in Fig. 1(a), charges a load capacitance C to voltage V dd, the total energy taken Index Terms— Adiabatic low-power techniques, reso- from the voltage supply source is CV dd2 . Half of it nant clock supply, computational memory, pattern recog- nition. is used to charge C, and the other half is dissipated in the pull-up network. When the output is driven low, the pull-down network discharges the energy stored in 1 I. I NTRODUCTION C, 2 CV dd2 , to ground. The resistances of the pull-up and pull-down networks affect the minimum charging L OW power dissipation is a critical objective in the design of portable and implantable microsys- tems supporting the use of a miniature battery power and discharging times, but not the dynamic energy dissipated. The dynamic energy dissipation can be lowered by reducing supply voltage, load capacitance, supply, wireless power harvesting, or other low-energy or both. power sources. Typical power budgets are in the low Dynamic energy dissipation has a quadratic depen- milliwatts for wearable devices and low microwatts dence on the supply voltage. This makes the reduction for implantable systems. Despite the shrinking power of supply voltages the most effective way to reduce budgets, there is ever more a need for high throughput dynamic energy dissipation. Dynamic voltage scaling computing and embedded signal processing. Future has become a standard approach for reducing power generations of wearable and implantable devices call dissipation when performance requirements vary in for the integration of complex signal extraction and time. In modern processors the voltage and frequency are controlled in a feedback loop to maintain op- R. Karakiewicz was with the Department of Electrical and Com- eration within a target power and temperature bud- puter Engineering, University of Toronto, Toronto, ON M5S 3G4, get . Local voltage dithering which toggles the Canada. He is now with SNOWBUSH Microelectronics, Toronto, ON M5G 1Y8, Canada (e-mail: email@example.com). supply between a small number of voltage levels to R. Genov is with the Department of Electrical and Computer locally optimize energy consumption based on the Engineering, University of Toronto, Toronto, ON M5S 3G4, Canada workload of each circuit block has been reported . (e-mail: firstname.lastname@example.org). G. Cauwenberghs is with the University of California San Diego, Subthreshold circuits operate with the supply voltage La Jolla, CA 92093-0357, USA (e-mail: email@example.com). below the threshold voltage of devices to further 2 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 42, NO. 11, NOVEMBER 2007 Vdd Vdd voltage drop across the pull-up network. The voltage drop is made arbitrarily small by keeping the ramp OUT period sufﬁciently longer than the time constant of the IN OUT CLK IN driver . Generation of the ramp signal implies that C C power dissipation is reduced at the system level, not only the gate level. For long ramp periods, the voltage across C is approximately equal to the supply ramp voltage and the energy taken from the voltage source (a) (b) is 1 CV dd2 , the minimum required to charge C to 2 V dd. In general, a linear increase in the ramp charging Vdd IN time results in a linear decrease in the voltage drop across the pull-up network, and thus a linear decrease CLK VHC OUT HC in dynamic energy dissipation. In the pull-down phase C the energy stored on C is slowly discharged back IN into the supply voltage source by slowly ramping V dd back to 0 V, again keeping resistive losses at a minimum. A number of adiabatic logic families (c) utilizing voltage ramps have been developed such as adiabatic dynamic logic (ADL) , efﬁcient charge re- Vdd IN covery logic (ECRL) (2N2P) logic , pass-transistor CLK VHC OUT adiabatic logic (PAL) , clocked adiabatic logic HC (CAL) , and true single-phase energy-recovery IN M1 M2 logic (TSEL) . Generating ideal linear voltage ramps to provide constant charging and discharging currents incurs (d) power dissipation in the supply generator, defeating the Fig. 1. Dynamic dissipation and resonant adiabatic energy recovery. savings by adiabatic energy recovery. An oscillatory (a) CMOS logic modeled as inverter driving a capacitive load; (b) waveform, or hot clock (HC), from a resonator is CMOS dynamic logic equivalent; (c) Adiabatic logic modeled as typically used instead –, , . The increased transmission gate driving a capacitive load from a ‘hot clock’ VHC ; and (d) Adiabatic mixed-signal multiply-accumulation (MAC). A energy dissipation in the pull-up network, due to the single cell in the MAC array is shown, with the charge-coupled non-optimal sinusoidal shape , is offset by the low MOS pair comprising a variable capacitive load. energy dissipation and simplicity of resonant hot clock generation. Resonant adiabatic computing, in Fig. 1(c), recycles energy in an oscillating LC tank where the reduce dynamic energy dissipation. A subthreshold total on-chip load capacitance, C, is utilized as the tank static random access memory (SRAM)  and fast capacitor. The inductor can be implemented externally Fourier transform (FFT) processor  have recently or can be distributed over the chip . In each period, been reported with optimal supply voltages of 300 mV the charge stored on C is shifted back into the inductor and 350 mV respectively. By applying forward body and is reused in subsequent computations, decreasing bias the threshold voltage can be shifted lower to allow the dynamic energy dissipated well below CV dd2 . further voltage scaling and thus energy reduction . In principle, dynamic energy consumption per unit The dynamic energy dissipation is reduced linearly computation in adiabatic circuits approaches zero with with the load capacitance. If the speed is not critical, increasing oscillation period. In practice, the energy minimum device sizing reduces the capacitance at the gain is limited by resistive losses in the tank and vari- cost of non-optimal propagation delay times. Dynamic ability of load capacitance, which depends on signal logic, in Fig. 1(b), can be used to eliminate most of activity. Resonant adiabatic arithmetic units , , the PMOS capacitance. Finally, capacitance can be  and line drivers ,  have been reported with lowered by migrating to a new technology process with up to seven-fold energy efﬁciency gains over their non- smaller minimum feature size at the cost of increased adiabatic mode. static power dissipation due to transistor leakage. Some existing adiabatic digital circuits rely on re- As opposed to static or dynamic CMOS logic versible logic  to minimize non-adiabatic energy drivers, adiabatic drivers slowly ramp the supply volt- losses . Fully adiabatic circuits  require a age from 0 V during the pull-up phase to reduce the backward path, where computations are reversed, to IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 42, NO. 11, NOVEMBER 2007 3 recover the energy. The need to reverse information scribes the architecture and circuit implementation ﬂow places great constraints on what can be computed. of the charge-mode template-matching array. In Sec- For example, an AND gate requires an auxiliary output tion III, a resonant adiabatic clock generator is in- in order to make the architecture reversible. troduced in order to achieve high energy efﬁciency Instead of implementing digital adiabatic logic, of the array-based computation. Limitations of the we perform reversible computing adiabatically in the resonant clock generator are formulated and analyzed. mixed-signal domain . Reversibility is inherent to Section IV describes the circuits and VLSI imple- the reversal of charge ﬂow between two coupled MOS mentation of the resonant adiabatic charge-mode array transistors, shown in Fig. 1(d). Transistors M1 and processor overcoming these limitations. Section V M2 comprise a charge-injection device (CID) which presents experimental results from the adiabatic array performs a one-bit multiply-and-accumulate (MAC) processor prototyped in 0.35-µm CMOS technology, operation as detailed in Section II. To maintain high and Section VI concludes with ﬁnal remarks. computational throughput, the mixed-signal adiabatic computing is performed on a charge-mode array , II. C HARGE -M ODE T EMPLATE -M ATCHING A RRAY . This work demonstrates that simple adiabatic The charge-mode array supports general analog techniques such as a resonant single-phase clock gen- multiplication of a digital matrix by a digital vector, by erator can be utilized effectively in parallel signal using reversible charge ﬂow between coupled transis- processing applications where array power dissipation tors , , . As shown in Figure 1(d), each cell often dominates. in the array performs a multiply-accumulation (MAC) There are a number of beneﬁts in the choice of operation by selectively transferring charge between a parallel mixed-signal architecture. While parallel two charge-coupled transistors M1 and M2, where the digital processors offer high throughput and energy gate of the ﬁrst transistor M1 connects to the input line, efﬁciency with high accuracy , parallel analog and the gate of the second transistor M2 connects to processors often allow for further increases in inte- the output line. Hence M1 implements multiplication gration density, computational throughput, and energy by selectively performing or not performing the charge efﬁciency at the expense of reduced accuracy – transfer, and M2 implements the accumulation by . High integration density is achieved by compact capacitive coupling onto the output line. The charge analog circuits such as those operating in charge do- transfer is non-destructive, and therefore the computa- main. Computational throughput is enhanced by larger tion performed is intrinsically reversible, returning the dimensions of computing arrays with compact cells transferred charge after deactivation of the input. The and by the low-cost nature of some of analog opera- adiabatic mixed-signal principle outlined here exploits tions such as zero-latency addition in charge domain. the lossless nature of reversible charge ﬂow in an array Energy efﬁciency is increased as clocking is reduced of MAC cells, with inputs supplied by adiabatic line or performed adiabatically as in the case of the pre- drivers from a hot clock supply. The multiplication sented architecture. Lower accuracy of computation is and accumulation are performed in parallel in a single a result of non-idealities of analog components such as cycle of the resonant clock, with the energy recycled inherent non-linearity and mismatches and is typically upon recovery of the charge at the end of the cy- only weakly dependent on the dissipated power for a cle . given implementation. A detailed quantitative analysis The resonant generator is critical in achieving high of the analog-versus-digital trade-off is given in . energetic efﬁciency, and is described in Section III. In targeted applications such as pattern recognition and data classiﬁcation a modest accuracy of under 8 bits is often sufﬁcient. A. Array Architecture and Circuit Implementation The charge-mode computing array presented here The array performs general-purpose vector-matrix is embedded in a processor which performs general multiplication (VMM), the computational core of a purpose vector-matrix multiplication (VMM), the com- variety of linear transform based algorithms in signal putational core of any template-matching linear trans- processing and pattern recognition. The VMM opera- form. The combination of resonant power generation tion is deﬁned as: and mixed-signal adiabatic computing on a massively N −1 parallel charge-mode array yields a 25-fold gain in Ym = W m · X = Wmn Xn (1) energy efﬁciency relative to the same array operated n=0 with static CMOS logic line drivers. with N -dimensional input vector X, M -dimensional The paper is organized as follows. Section II de- output vector Y, and M × N matrix elements W. 4 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 42, NO. 11, NOVEMBER 2007 INPUT DATA 0 1 1 1 COMPUTE LINES (CL) ML A/D 1 WL 0 0 0 1 M3 M2 M1 (ML) CID DRAM A/D 2 (WL) BL CL LINES 1 1 1 0 LINES A/D 3 BL WL ML CL MATCH 1 1 1 1 Write 0 2Vdd WORD A/D 2 0 2Vdd 0 0 1 1 Compute 0 2Vdd 0 BIT LINES (BL) Recover 2Vdd 1 1 1 0 STORED DATA Fig. 2. Array processor architecture (left), circuit diagram of CID computational cell with integrated DRAM storage (right, top), and charge transfer diagram for active write and compute operations (right, bottom). A 1-bit binary data example is shown. Fig. 2(left) depicts a simpliﬁed architecture of the array cells in the second row in Fig. 2), and if the potential processor for one-bit binary input vector and matrix on the gate of M1 rises above that of M2, to 2Vdd coefﬁcients, and matrix dimensions of M = N = (e.g., second, third, and fourth columns in Fig. 2). In 4 . The analog array is interfaced with a bank this case, the high-impedance gate of M2 couples to of on-chip row-parallel analog-to-digital converters its channel and raises above Vdd by a ﬁxed voltage (ADCs) to provide convenient digital outputs as depending on the charge and capacitance of M2 and needed in some applications as well as in the array the number of active cells in that row (e.g., second and experimental testing and demonstration. third cells in the second row in Fig. 2). The output The unit cell in the analog array shown in Fig. 2(top, of a row is a discrete analog quantity reﬂecting the right) combines a charge injection device (CID) com- number of active cells coupling into the ML of that putational element ,  with a DRAM storage row (e.g., two cells, second and third, corresponding to element . During the write operation the data to the output of the second row equal to two in Fig. 2). In be stored is broadcast on the vertical bit-lines (BLs), the numerical example given in Fig. 2, the correlation which extend across the array. A row to be written to of the binary vector “1110” stored in the second is selected by activating its word-line (WL) turning row of the array with the binary input vector “0111” transistor M3 on (e.g., the second row in Fig. 2). computed by the method described above yields the The output match-line (ML) is held at Vdd during correct output equal to two. the write phase creating a potential well under the As said, the cell performs non-destructive computa- gate of transistor M2. This potential well is ﬁlled with tion since the transferred charge is sensed capacitively electrons or emptied depending on whether the BL on the MLs. Once computation is performed, the is logic-one or logic-zero respectively. Logic-one on charge is shifted back from M1 into the DRAM storage BLs corresponds to 0V, while logic-zero corresponds transistor M2. Capacitive coupling of all cells in a to Vdd. During the compute operation, the input data single row into a single ML implements zero-latency is broadcast on the compute-lines (CLs) while MLs, analog accumulation along each row. An array of cells previously precharged to Vdd, are now left ﬂoating. thus performs analog multiplication of a binary matrix Logic-one CL bit corresponds to voltage 2Vdd, while with a binary vector. The architecture easily extends logic-zero corresponds to 0V. Each cell performs a to multi-bit data . one-quadrant binary-binary multiplication of its stored logic value and its CL logic value. An active charge B. Accuracy and Power Considerations transfer from M2 to M1 can occur only if there is a Sizing of transistors in the cell is of importance. The non-zero charge stored (e.g., ﬁrst, second, and third switch transistor M3 is of minimum size as needed to IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 42, NO. 11, NOVEMBER 2007 5 lower its parasitic capacitance and charge injection. R Transistor M2 is 30 times larger than M3 in order to V L RL RC VHC (t) avoid DRAM soft errors, as dictated by the DRAM BL capacitance and by subthreshold leakage in the storage cell. Transistor M1 is sized such that the output PSfrag replacements RS dynamic range of the array is large yielding sufﬁcient C noise margins. It can be shown that the voltage on MLs is a monotonically increasing saturating function pullHC S of the area of transistor M1. The area of M1 is chosen to be 50 percent of that of M2. Increasing the area of M1 beyond this value does not yield a substantial increase in the dynamic range but reduces the density of the array and the resonant frequency of the LC tank. When the computational array is integrated with VHC (t) VHC (T ) high-speed digital CMOS circuits on the same chip, excessive interference due to crosstalk may affect the operation of charge-mode cells. The resulting noise may be correlated for many cells and thus may not be pullHC averaged out during row-wise accumulation. One way to remove the effect of interference is by utilizing one 0 T t row in the array as a reference row. This dedicated row has all logic-zero bits stored in it and has the Fig. 3. Lossy LC oscillator and switch with ﬁxed load capaci- same inputs as all the other rows. The output of the tance C. reference row is subtracted from outputs of all rows in a differential fashion in digital domain rejecting any The efﬁciency of resonant power generation is thus common-mode signals. limited by resistive losses in the tank and variability Most of the power in the computational array is of on-chip load capacitance. Each of these limitations dissipated on driving CLs. If CLs are driven by is analyzed next. conventional CMOS inverters, the power dissipated in the array is proportional to the frequency, array A. Tank Resistive Losses capacitance and the square of the supply voltage. As described in Section I, this power is lost and can not be A simple model of a constant-capacitance LC oscil- recovered. To reduce the energy dissipated in the array, lator used to generate the hot clock voltage, VHC (t), instead of being driven by CMOS inverters, all CLs are is shown as an RLC circuit in Fig. 3, where C is the selectively coupled to an off-chip inductor such that load capacitance implied by the charge-mode array, the energy needed for computing can be adiabatically L is the tank inductor, and R represents parasitic recycled by means of resonance, as described next. resistive losses in the tank. The tank resistance R = RL +RC decomposes into two contributions: parasitic resistance in the inductor RL due its ﬁnite quality III. R ESONANT P OWER G ENERATION factor QL = ωL/RL ; and parasitic resistance in the capacitor RC accounting for non-zero on-resistance of The array capacitance together with an external the adiabatic line drivers represented by the IN switch inductor form an LC resonator, driven by an external in Fig. 1(d). The parasitic shunt resistance R accounts S clock CLK at resonance frequency to generate the hot for non-zero on-resistance of the switch S, when it is clock power supply waveform VHC (t) in Fig. 1(d). activated. The switch S is used to initiate and maintain Resistive losses in the adiabatic line drivers of the oscillations by periodically discharging C to ground. charge-mode array are minimized by keeping hot clock It is activated by a narrow pulse pullHC. The step oscillation frequency sufﬁciently low. High compu- response of the RLC circuit with small damping factor tational throughput is nevertheless maintained by a (in the limit for small R) is of the form: ﬁne-grain parallel architecture of the processor. The R 1 massive parallelism also allows to maintain the on- VHC (t) = V 1 − e− 2L t cos √ t . (2) chip load capacitance at or near its mean value, tuned LC at resonance where the energy dissipation in the tank Switch S dissipates energy if the voltage across the is lowest. capacitor is non-zero when it goes active. This energy 6 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 42, NO. 11, NOVEMBER 2007 16 V L VHC (t) 14 pullHC S C Energy (pJ) 12 PSfrag replacements 10 8 PSfrag replacements VHC (T ) 6 VHC (t) 0 4 8 12 16 20 24 28 32 36 40 Inductance, L (mH) Fig. 4. Energy dissipation asymptotically approaching a ﬁnite non- zero value determined by the quality of inductor. pullHC 0 T is minimized by pulsing pullHC at the minima of the t LC tank√ voltage, and thus at the resonant frequency f = (2π LC)−1 , as shown in Fig. 3 for T = 1/f . Fig. 5. Lossless LC oscillator and switch with variable load Resistive losses in R cause minima of the voltage capacitance C. VHC (T ) at the next pullHC pulse to be non-zero as described by the exponential envelope: √C through the series resistor combination RC + RS . VHC (T ) = V 1 − e−πR L . Incomplete settling in this RC network implies incom- plete compensation of the exponential decay in the Assuming a constant value of C, the dynamic energy, sinusoidal hot clock waveform, leading to a reduced 1 2 2 CVHC (T ), dissipated in each computation cycle is amplitude hot clock and further resistive losses. A thus given by: sufﬁciently small value of RC + RS , and sufﬁciently √C 2 large pulse duration pullHC ensures that VHC settles 1 E|C=const = C V 1 − e−πR L . (3) close to zero. 2 As the sine wave of VHC (t) is approximately The choice of a minimum capacitance value is ob- quadratic around its minima, the energy in (3) is vious. As for inductance, in theory, for a given load insensitive to the pulse width variation of pullHC near capacitance C, the dynamic energy dissipation can be its minima. This allows for a relatively large pulse made arbitrarily small by increasing L as is evident width resulting in small energy losses. For larger pulse from (3). In practice, the dynamic energy dissipation widths, the current through the inductor signiﬁcantly asymptotically approaches a ﬁnite value, determined affects the resonant clock waveform which extends by the quality of the inductor, as the parasitic resis- √ outside the 2Vdd interval and exerts extra energy tance of a wire-wound inductor RL ∝ L dominates losses . the total resistance R for large L. 1 Thus increasing L beyond a certain level may not be justiﬁable as it yields C. Load Capacitance Variability diminishing reduction in energy dissipation as shown in Fig. 4 but results in a lower oscillation frequency A simple model of a variable-capacitance LC oscil- and thus lower throughput. lator is shown in Fig. 5, where C has a mean value of C, and resistive losses as modeled in Section III-A are here assumed zero for simplicity. Signal pullHC is B. Switch Resistance and Pulse Width pulsed at the LC tank mean-capacitance resonant fre- The shunt resistance of the switch RS , to ﬁrst order, quency f = (2π LC)−1 . However the instantaneous √ does not contribute losses and does not affect the LC tank resonant frequency, (2π LC)−1 , depends on efﬁciency of the hot clock supply generator, provided the load capacitance C, causing the pulse pullHC ac- that RS is sufﬁciently small. During the duration of the tivating S to miss the minima of the oscillations when pullHC pulse, the load capacitance C is discharged C deviates from C as shown in Fig. 5. Substituting 1 For an integrated, spiral-wound inductor, the dynamic energy t = T = 2π LC into (2) and ignoring resistive losses dissipation increases for large L. yields the instantaneous voltage on C just before it is IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 42, NO. 11, NOVEMBER 2007 7 20 2.0 Energy per cycle (pJ) Energy per cycle (pJ) 15 1.5 J 10 1.0 PSfrag replacements CB ↓ 5 PSfrag replacements 0.5 K I K 0 0.0 0 0.5 1 1.5 2 0 0.5 1 1.5 2 C/C (C + CB )/(C + CB ) (a) Fig. 7. Increasing the bypass capacitor CB desensitizes dynamic energy dissipation to C variation. dissipation approaches zero as the load capacitance C I approaches its mean C (point K in Fig. 6). Adding an external bypass capacitor, CB , in parallel PSfrag replacements J with C increases the total load capacitance to C + CB . The addition of sufﬁciently large CB attenuates the effect of capacitive variations in the array on K oscillation frequency and hence energy dissipation. In theory, without resistive losses, such oscillator would pullHC always operate at its ideal point, point K in Fig. 6, 0 T and dissipate zero energy. In practice, the energy t dissipation due to both resistive losses and C variation must be considered: 2 (b) 1 C ¡ C E = C V 1 − e−πR L cos 2π . (5) Fig. 6. (a) Dynamic energy dissipation in switch S of a lossless 2 C varying-capacitance LC oscillator, and (b) corresponding VHC (t) waveforms. As shown in Fig. 7, adding CB desensitizes the dynamic energy dissipation to C variation at the cost of increasing the resistive energy dissipation. Thus an discharged to ground by S: external capacitor was not utilized in this design. C VHC (T ) = V 1 − cos 2π . D. Parallel Architecture C As shown in Section III-A and in Fig. 4, the resistive 1 2 The dynamic energy 2 CVHC (T ) dissipated each com- losses can be reduced by increasing inductance, which putation cycle, also reduces oscillation frequency. In order to maintain 2 high computational throughput, a parallel array-based 1 C architecture is needed to perform large numbers of E|R=0 = C V 1 − cos 2π , (4) 2 C operations each clock cycle. Furthermore, as shown in Section V, data-dependent load statistics over large is plotted in Fig. 6(a), along with the corresponding numbers of inputs in the array allow to maintain the hot clock waveforms in Fig. 6(b). When C = C (case array load capacitance at or near a constant value (at K) or C = C/4 (case I), VHC (t) completes one or two point K in Fig. 6) with approximately half of all cells full oscillation(s), respectively, before pullHC is pulsed active at any time. This minimizes dynamic losses so no energy is dissipated in S. At the minimum point not only in the array, but also in the resonant clock with the widest concavity region, the dynamic energy generator as shown in Section III-C. Next, we present 8 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 42, NO. 11, NOVEMBER 2007 L Vdd I/O I/O Latch Latch Charge Charge Recycle Recycle Compute Compute EN + − ERL Xn Xn+1 ERL DRIVER DRIVER Refresh Store Store CL CL ML A/D Refresh WL0 Latch & Logic WL1 Vdd BL BL Fig. 8. Circuit diagram of functions peripheral to the cell including store, refresh and charge-recycling adiabatic compute. a massively-parallel array ﬁrst introduced in Section II The capacitance of all active CLs is utilized to per- that implements mixed-signal resonant adiabatic com- form adiabatic computing on the full array as schemat- puting over large numbers of charge-coupled transistor ically shown in Fig. 9(a). The LC tank is replenished pairs as shown in Figure 1(d). with external energy from the DC voltage source Vdd by pulsing pullHC at the minima of voltage waveform IV. A DIABATIC A RRAY P ROCESSOR VHC (t). A doubled dynamic range of 2Vdd is thus The hot clock supply generator sees the array of obtained. Signal pullHC also serves to synchronize the MAC cells as a variable load capacitance C as in hot clock waveform to other circuits in the processor. Fig. 5, where the variation in C is implied by variations The choice of the frequency of signal pullHC is in input. As demonstrated further below, these varia- important. As discussed in Section III-C, variations tions are kept at a minimum by virtue of the parallel in the total CLs capacitance cause the frequency of nature of the computation. The architecture, circuits tank oscillation to deviate from that of signal pullHC and implementation of the processor are described f = 1/T resulting in additional energy losses. One next. solution to this problem is to use differential coding of data, with complementary inputs and complementary A. Circuits stored coefﬁcients. This ensures that exactly half of Fig. 8 shows the block diagram of the array periph- all CLs are connected to the inductor. The capacitance eral functions with signal paths for store, refresh, com- of each CID/DRAM cell is approximately identical, pute and charge recycle functions marked . Two regardless of whether charge is stored as determined by columns, n − th and (n + 1) − th, of the ﬁrst row are the binary matrix element value. This invariance owes shown. Matrix coefﬁcients are loaded into the dynamic to the fact that transistor M1 operates either in strong random access memory (DRAM) from a shift register inversion or accumulation mode, with approximately in the store phase. The CID/DRAM cells on folded same gate capacitance. Thus, by ensuring that always bit-lines (BLs) are periodically refreshed after several half of all CLs are active, the array capacitance is compute cycles, alternating between even and odd kept constant. This approach, however, requires twice columns with separate word-lines (WLs). In the com- the number of cells and thus doubles the silicon pute cycle the input data, Xn , n = 0, .., N − 1, enable area. Instead, we observe that in typical data, such as adiabatic energy recovery logic (ERL) drivers . images, the probability of a binary coefﬁcient being They conditionally connect the off-chip off-the-shelf zero or one is approximately half for most of the inductor to the on-chip capacitance of active compute- coefﬁcients. This implies that the number of logic-one lines (CLs) to enable charge recycling through reso- bits in the input vector is typically approximately half. nance. In the Central Limit, the number of logic-one bits in IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 42, NO. 11, NOVEMBER 2007 9 logic-one level of Xn is Vdd, a cross-coupled PMOS X1023 transistor pair ensures that the pass gate is turned off Sfrag replacements X0 completely when Xn is low. High-voltage devices are used to accommodate the doubled dynamic range. The signal pullHC is synchronized with the clocks for all Vdd L VHC (t) peripheral circuits, generated from the same master ERL clock. Tuning of the resonance condition is achieved DRIVERS either by tuning of the master clock frequency, or by CCL1023 adjusting the value of the external inductance. Integrat- pullHC ing adaptive mechanisms for tuning may further reduce 102.4µm/0.6µm CCL0 power dissipation, especially for highly variable data or for off-the-shelf inductors with a large spread of 2Vdd values, but with a potentially signiﬁcant overhead. VHC (t) The transistor sizes shown in Figs. 9(a) and 9(b) are 0 determined as follows. Referring to Fig. 3, RS corre- pullHC sponds to the resistance of the nMOS switch driven by signal pullHC in Fig. 9(a), and RC corresponds (a) to the parallel combination of the ERL drivers on- resistance in Figs. 9(a) and 9(b). The value of RS is PSfrag replacements chosen such that the circuit operates at an optimum To L point where the resistance of the switch driven by pullHC is small enough to keep resistive losses in it Xn Xn small, and the capacitance of its gate is small enough to keep the energy needed to drive it small. In this To CLn design, RS = 10Ω and the gate capacitance of the Xn 1.2µm/0.6µm (all) Xn switch is 270f F . To minimize resistive losses, RC has to be small, but there is little beneﬁt in making it much smaller than RS . The sizing shown in Fig. 9(b) (b) yields the average resistance of the pass gate in an ERL driver of less than 5kΩ. With approximately half of Fig. 9. (a) Resonant clock generator for adiabatic power supply, the ERL drivers active for typical inputs (see below), and (b) input-enabled energy recovery logic (ERL) driver. the corresponding value for RC is less than 10Ω as needed to balance losses in RC and RS under silicon area constraints. an N -dimensional binary vector follows a binomial distribution approximated by a Gaussian distribution √ B. Implementation with mean N/2 and standard deviation N. Hence the The integrated prototype of the mixed-signal adi- relative width of the distribution tends to zero for large abatic vector-matrix multiplication (VMM) processor N . This property of typical data is exploited here in depicted in Fig. 10 occupies 4×4 mm2 in 0.35-µm order to minimize energy losses in the LC tank due to CMOS. The processor consists of four self-contained array capacitance variability as validated in Section V. cores. Each core contains 128×256 CID/DRAM com- For applications where many binary coefﬁcients are putational storage elements, a row-parallel bank of non-Bernoulli, we have developed a simple stochastic 128 8-bit ∆Σ algorithmic analog-to-digital converters data modulation scheme to pseudo-randomize input (ADCs) , pipelined input shift registers, sense data with any statistics at the expense of a small ampliﬁers, refresh logic, and scan-out logic. All of modulation and demodulation overhead . the supporting digital clocks and control signals are The circuit diagram of a modiﬁed ERL driver is generated on the chip. The modular architecture allows shown in Fig. 9(b). When the input vector component the four cores to operate in 1×4, 2×2, and 4×1 bit, Xn , is logic-one, the corresponding compute- conﬁgurations to compute 128×1024, 256×512, and line, CLn , is connected to the inductor through a 512×256 dimensional binary vector-matrix products pass gate. A pass gate is utilized in order to real- respectively. This ﬂexibility is necessary in implement- ize an energy-efﬁcient fully-adiabatic driver. As the ing linear transforms with various input and output maximum voltage on the inductor is 2Vdd, while the dimensions. 10 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 42, NO. 11, NOVEMBER 2007 Fig. 11. Examples of faces and non-faces correctly classiﬁed by the prototyped VMM processor from a face detection experiment. implies that the sum of logic-one bits in a fragment of a typical image (the same as the number of logic- one bits) follows a normal (Gaussian) distribution with Fig. 10. Adiabatic VMM processor micrograph and ﬂoorplan. low variance. Over 95 percent of the data in Fig. 12 fall Fabricated in a standard 0.35-µm CMOS process, the processor occupies 4×4 mm2 . within less than 18 percent of the entire input range, within two standard deviations of the mean. Points labeled µ, −2σ, and +2σ are ﬁtted parameters based V. E XPERIMENTAL R ESULTS on an ideal normal distribution and mark the mean and two standard deviations spread, evaluated over The processor functionality was validated in a the face data set. The corresponding experimentally template-based face detection application. Real-time measured hot clock waveforms are shown on the top detection of objects such as faces on a low-power of Fig. 12. The hot clock oscillates at a frequency wearable platform allows to implement miniature vi- determined by the number of compute-lines (CLs) sual aids for the blind. Template-based pattern recog- connected to the external inductor plus all the parasitic nition is computationally expensive as it requires capacitance in the hot clock path. The pullHC signal matching of each input with a set of characteristic frequency and its duty-cycle are tuned to coincide with templates. The parallel processing architecture lends the minima of the hot clock oscillations when half itself naturally to such an application. A pattern recog- of the inputs, Xn , are active. As input data deviates nition engine was trained off-line on a face recogni- from this mean, pullHC misses the minimum voltage tion data set distributed by the Center for Biological point in discharging the tank capacitor, increasing the and Computational Learning (CBCL) at MIT. 2 The dynamic energy dissipation. classiﬁer was then programmed on the processor with Fig. 13(a) shows the experimental setup utilized for visual templates stored in the CID/DRAM array. Inner- measuring power consumption of the array. The array product based similarities between each input and all is conﬁgured to operate in one of the two modes, templates were computed on the array. Both inputs adiabatic and static, for comparative purposes. DC and templates are 11 × 11 pixel image segments. Ex- current delivered by the DC power supply is measured perimentally, we validated that the processor produces in each case. In the adiabatic mode, each active CL classiﬁcation results on an out-of-class test set that are is driven by the hot clock through the pass gate of identical to those obtained by emulation in software, an energy recovery logic (ERL) driver. The product testifying to the robustness of the architecture and of measured average current through the DC supply circuit implementation. For this task, perfect classi- and its voltage V dd represents the total measured ﬁcation was obtained. A few examples of the correct power which includes the losses in the resonant tank classiﬁcations of faces and non-faces by the processor supply generator, implemented using an external wire- are given in Fig. 11. wound inductor, as well as in the ERL drivers and Fig. 12 shows typical statistics of images from the CID/DRAM array. Power dissipated to generate and MIT CBCL face data set. Most of natural scene images drive the signal pullHC in the adiabatic mode is have binary coefﬁcients which are equally probable small compared to the power dissipated in the clock (Bernoulli distributed, P (0) = P (1) = 0.5). This generator and the array . 2 http://cbcl.mit.edu/software-datasets/index.html In the static mode, the CLs are driven by an external IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 42, NO. 11, NOVEMBER 2007 11 VHC(t) VHC(t) VHC(t) shown in Figs. 6(a) and 7. The probability density distribution of the number of active inputs for the MIT CBCL face data set is also shown in Fig. 13(b). For the MIT CBCL face data set the adiabatic processor yields experimentally measured 0.06 computational energy efﬁciency of 480 GMACS/mW. µ This number is obtained by multiplying the measured • 0.05 energy efﬁciency of the array by the corresponding Data Probability Density MIT CBCL face data probability density function for 0.04 each number of active inputs and adding the results together. For the same data, the processor yields en- frag replacements 0.03 ergy efﬁciency of 19 GMACS/mW when conﬁgured in the static mode. This corresponds to a 25-fold 0.02 improvement in energy efﬁciency. The processor per- forms 128×256 binary multiply-and-accumulate op- 0.01 erations on each of the four arrays corresponding to +2σ 1.8 GMACS computational throughput at 13.7 kHz hot 95% of Inputs −2σ • 0 • clock frequency. 128 256 384 512 640 768 896 1024 Contributions of subthreshold leakage, junction Number of Active Inputs leakage or gate tunneling to overall power dissipated in the array are insigniﬁcant. Scaling the design to deep Fig. 12. Probability density of the number of active inputs for submicron technologies may require additional design the MIT CBCL face data (bottom) and corresponding experimen- tally measured hot clock waveforms (top). The nominal hot clock considerations such as negative voltage gate biasing frequency is 13.7kHz. The peak-to-peak voltage amplitude is 3.3V and low-voltage junction biasing. In general, compared with 1.65V power supply. to high-speed digital designs, low power dissipation of the array maintains lower temperature of the die and thus lower leakage currents. digital signal CLK through CMOS inverters (only one Not included in the MAC array and supply generator inverter is shown for simplicity), with ERL drivers power is the power dissipated in the ADCs, and functioning as static CMOS buffers, as shown in other peripheral functions such as shift registers which Fig. 13(a). In this mode the inductor is shorted and can be efﬁciently implemented using conventional the value of the supply voltage is increased to 2V dd to digital adiabatic design techniques. The bank of 512 yield the same voltage swing. The power in the static ADCs  including non-adiabatic clock generators mode is measured as the product of the average current measures 6.3mW of power dissipation from a 3.3V through the CMOS inverters as shown, multiplied by supply, at 15kHz parallel sample rate. Even though the DC voltage 2V dd supplying this current. Power this ADC design yields adequate energy efﬁciency of dissipated to generate and drive the signal CLK in the 3.2 pJ per sample per quantization level, this power static mode is similar to that for the signal pullHC level is orders of magnitude larger than that of the in the adiabatic mode. Both are small and thus are adiabatic array and resonant supply. In the present omitted from the comparative analysis. prototype the ADCs were included for convenience Fig. 13(b) shows energy consumption per computa- of characterization. For applications requiring quan- tion of the CID/DRAM array in the static mode and tized outputs, the challenge is to extend the mixed- in the adiabatic mode as a function of the number signal adiabatic VMM principle to implement adia- of active inputs (number of logic-one bits in the in- batic analog-to-digital conversion. Possible directions put vector). Theoretical, simulated and experimentally for adiabatic ADC design are charge-redistribution measured results are plotted. The experimental data ADCs  or charge-based folding ADCs . Other were measured utilizing the testing setup depicted in applications in pattern classiﬁcation, such as vector Fig. 13(a). In the static mode, as expected the energy quantization or nearest neighbor classiﬁcation, call for consumption per computation is a linear function of winner-take-all (WTA) or rank-ordered selection of the total capacitance of active CLs. In the adiabatic best template matches. WTA selection is efﬁciently mode, the energy consumption per computation is a implemented using a cascade of comparators, and non-monotonic function of the the total capacitance of potentially adiabatically implemented in the charge active CLs matching that described by Eqn. (5) and domain . PSfrag replacements 12 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 42, NO. 11, NOVEMBER 2007 pullHC X1023 DUT 2Vdd X1023 DUT Vdd A L X0 X0 Vdd L VHC A CLK ERL ERL DRIVERS DRIVERS pullHC CCL1023 CCL1023 CCL0 CCL0 ADIABATIC MODE STATIC MODE (a) 120 0.06 MIT CBCL pdf PSfrag replacements Theoretical Static 100 Theoretical Adiabatic 0.05 VHC Simulated Static pullHC Simulated Adiabatic Vdd Measured Static Data Probability Density L 80 Measured Adiabatic 0.04 CCL0 Energy (fJ/MAC) CCL1023 X0 60 0.03 X1023 pullHC Static CLK 40 0.02 Vdd 2Vdd L 20 Adiabatic 0.01 0 128 256 384 512 640 768 896 1024 Number of Active Inputs (b) Fig. 13. (a) Experimental setup for measuring supply current, and corresponding power consumption of the array in adiabatic and static modes. (b) Theoretical, simulated and experimentally measured energy consumption per computation cycle of the array as a function of input data statistics in the adiabatic mode and in the static mode. The MIT CBCL face data set statistics are shown in gray. The measured adiabatic VMM processor character- CMOS delivers 480 GMACS (4.8 × 1011 multiply- istics are summarized in Table I. and-accumulates per second) for every mW of power. Minimum energy dissipation requires low-resistance VI. C ONCLUSION line drivers, but does not require low-resistance switch- We have shown that an array of simple multiply- ing in the resonant supply for a reasonably shaped, low and-accumulate cells, consisting of charge-coupled duty cycle clock signal. Minimum energy also requires transistor pairs, constitutes a virtually lossless capac- minimum variability in the capacitive load, which is itive load to a resonant hot clock generator, lead- ensured owing to the statistics of inputs controlling ing to signiﬁcant (25-fold) savings in energy efﬁ- charge transfer in a large array of MAC cells. ciency over a lossy driven system where adiabatic The adiabatic array and resonant supply genera- line drivers are replaced with CMOS logic drivers. tor was embedded in a vector-matrix multiplication The 4mm×4mm, 512×256-cell array in 0.35-µm processor and demonstrated on a face detection task, IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 42, NO. 11, NOVEMBER 2007 13 TABLE I  W. C. Athas, L. J. Svensson, and N. Tzartzanis, “A resonant M EASURED C HARACTERISTICS signal driver for two-phase, almost-non-overlapping clocks,” in Proc. IEEE Int. Symp. on Circuits and Systems, vol. 4, May Technology 0.35 µm CMOS 1996, pp. 129–132. Supply Voltage 1.65V  W. C. Athas, N. Tzartzanis, W. Mao, L. Peterson, R. Lal, Die Area 4×4 mm2 K. Chong, J.-S. Moon, L. J. Svensson, and M. Bolotski, “The Array Area 2.7×1.8 mm2 design and implementation of a low-power clock-powered CID/DRAM Cell Area 9.9×3.6 µm2 microprocessor,” IEEE J. Solid-State Circuits, vol. 35, no. 11, CID/DRAM Cell Count 131,072 pp. 1561–1570, Nov. 2000. Throughput 1.8 GMACS at 13.7 kHz  S. C. Chan, K. L. Shepard, and P. J. Restle, “Distributed Array Energy Efﬁciency 19 GMACS/mW Static mode differential oscillators for global clock networks,” IEEE J. 480 GMACS/mW Adiabatic mode Solid-State Circuits, vol. 41, no. 9, pp. 2083–2094, 2006. Output Resolution 8 bits  E. Amirante, J. Fischer, M. Lang, A. Bargagli-Stofﬁ, Column mismatch ± 1 LSB for 97% of columns J. Berthold, C. Heer, and D. Schmitt-Landsiedel, “An ultra low- power adiabatic adder embedded in a standard 0.13-um CMOS environment,” in Proc. IEEE European Solid-State Circuits Conf., Sept. 2003, pp. 599–602. with stored coefﬁcients obtained by off-line training  K. Suhwan, C. H. Ziesler, and M. C. Papaefthymiou, “A true single-phase 8-bit adiabatic multiplier,” in Proc. IEEE Design over example data. Further research is directed towards Automation Conf., 2001, pp. 758–763. implementing ADC quantization or WTA selection  H. Yamauchi, H. Akamatsu, and T. Fujita, “An asymptoti- in the adiabatic domain ,  for a complete cally zero power charge-recycling bus architecture for battery- operated ultra-high data rate ULSI’s,” IEEE J. Solid-State adiabatic mixed-signal system-on-chip. Applications Circuits, vol. 30, no. 4, pp. 423–431, Apr. 1995. include pattern recognition , data compression   M. Amer, M. Bolotski, P. Alvelda, and T. Knight, “160x120 and CDMA matched ﬁlters . pixel liquid-crystal-on-silicon microdisplay with an adiabatic DAC,” in Proc. IEEE Int. Solid-State Circuits Conf., Feb. 1999, pp. 212–213. R EFERENCES  C. H. Bennett and R. Landauer, “The fundamental physical limits of computation,” Scientiﬁc American, vol. 253, no. 1,  R. McGowen, C. A. Poirier, C. Bostak, J. Ignowski, M. Mil- pp. 38–46, 1985. lican, W. H. Parks, and S. Naffziger, “Power and temperature  J. Lim, K. Kwon, and S.-I. Chae, “Reversible energy recovery control on a 90nm Itanium family processor,” IEEE J. Solid- logic circuit without non-adiabatic energy loss,” Electronics State Circuits, vol. 41, no. 1, pp. 229–237, Jan. 2006. Letters, vol. 34, no. 4, pp. 344–346, Feb. 1998.  B. H. Calhoun and A. P. Chandrakasan, “Ultra-dynamic volt-  R. Genov, G. Cauwenberghs, G. Mulliken, and F. Adil, “A age scaling (UDVS) using sub-threshold operation and local 5.9 mW 6.5 GMACs CID/DRAM array processor,” in Proc. voltage dithering,” IEEE J. Solid-State Circuits, vol. 41, no. 1, IEEE European Solid-State Circuits Conf., Sept. 2002. pp. 238–245, Jan. 2006.  R. Karakiewicz, R. Genov, A. Abbas, and G. Cauwenberghs,  ——, “A 256kb sub-threshold SRAM in 65nm CMOS,” in “175 GMACS/mW charge-mode adiabatic mixed-signal array Proc. IEEE Int. Solid-State Circuits Conf., Feb. 2006, pp. 628– processor,” in Proc. IEEE Symposium on VLSI Circuits, June 629. 2006, pp. 126–127.  A. Wang and A. P. Chandrakasan, “A 180mV subthreshold  A. Nakada, T. Shibata, M. Konda, T. Morimoto, and T. Ohmi, FFT processor using a minimum energy design methodology,” “A fully parallel vector-quantization processor for real-time IEEE J. Solid-State Circuits, vol. 40, no. 1, pp. 310–319, Jan. motion-picture compression,” IEEE J. Solid-State Circuits, 2005. vol. 34, no. 6, pp. 822–830, 1999.  J. Kao, M. Miyazaki, and A. P. Chandrakasan, “A 175mv  A. Kramer, “Array-based analog computation,” IEEE Micro, multiply-accumulate unit using an adaptive supply voltage and vol. 16, no. 5, pp. 40–49, Oct. 1996. body bias architecture,” IEEE J. Solid-State Circuits, vol. 37,  T. Shibata, T. Nakai, N. M. Yu, Y. Yamashita, M. Konda, and no. 11, pp. 1545–1554, Nov. 2002. T. Ohmi, “Advances in neuron-MOS applications,” in Proc.  W. C. Athas, J. G. Koller, and L. J. Svensson, “An energy- IEEE Int. Solid-State Circuits Conf., Feb. 1996, pp. 304–305. efﬁcient CMOS line driver using adiabatic switching,” in Proc. IEEE Fourth Great Lakes Symposium on Design Automation  T. Yamasaki, T. Nakayama, and T. Shibata, “A low-power of High Performance VLSI Systems, Mar. 1994, pp. 196–199. and compact CDMA matched ﬁlter based on switched-current  A. G. Dickinson and J. S. Denker, “Adiabatic dynamic logic,” technology,” IEEE J. Solid-State Circuits, vol. 40, no. 4, pp. IEEE J. Solid-State Circuits, vol. 30, no. 3, pp. 311–315, Mar. 926–932, Apr. 2005. 1995.  R. Sarpeshkar, “Analog versus digital: Extrapolating from  Y. Moon and D.-K. Jeong, “An efﬁcient charge recovery logic electronics to neurobiology,” Neural Computation, vol. 10, pp. circuit,” IEEE J. Solid-State Circuits, vol. 31, no. 4, pp. 514– 1601–1638, 1998. 522, Apr. 1996.  C. Neugebauer and A. Yariv, “A parallel analog CCD/CMOS  V. G. Oklobdzija, D. Maksimovic, and L. Fengcheng, “Pass- neural network IC,” in Proc. IEEE Int. Joint Conference on transistor adiabatic logic using single power-clock supply,” Neural Networks, vol. 1, Seattle, WA, 1991, pp. 447–451. IEEE Trans. Circuits and Systems II: Analog and Digital  V. Pedroni, A. Agranat, C. Neugebauer, and A. Yariv, “Pattern Signal Processing, vol. 44, no. 10, pp. 842 – 846, Oct. 1997. matching and parallel processing with CCD technology,” in  D. Maksimovic, V. G. Oklobdzija, B. Nikolic, and K. Current, Proc. IEEE Int. Joint Conference on Neural Networks, vol. 3, “Clocked CMOS adiabatic logic with integrated single-phase June 1992, pp. 620–623. power-clock supply,” IEEE Trans. VLSI Systems, vol. 8, no. 4,  D. Maksimovic and V. G. Oklobdzija, “Integrated power pp. 460–463, Aug. 2000. clock generators for low energy logic,” in Proc. IEEE Power  K. Suhwan and M. C. Papaefthymiou, “True single-phase Electronics Specialists Conference, 1995, pp. 61–67. adiabatic circuitry,” IEEE Trans. VLSI Systems, vol. 9, no. 1,  R. Karakiewicz, R. Genov, and G. Cauwenberghs, “1.1 pp. 52–63, Feb. 2001. TMACS/mW load-balanced resonant charge-recycling array 14 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 42, NO. 11, NOVEMBER 2007 processor,” in Proc. IEEE Custom Integrated Circuits Con- Gert Cauwenberghs (SM’89-M’94-S’04) ference, 2007. received the M.Eng. degree in applied  R. Suarez, P. Gray, and D. Hodges, “Charge redistribution physics from University of Brussels, Bel- analog-to-digital conversion techniques – Part II,” IEEE J. gium, in 1988, and the M.S. and Ph.D. Solid-State Circuits, vol. SC-10, no. 6, pp. 379–385, Dec. degrees in electrical engineering from Cal- 1975. ifornia Institute of Technology, Pasadena,  R. Genov and G. Cauwenberghs, “Dynamic MOS sigmoid ar- in 1989 and 1994. ray folding analog-to-digital conversion,” IEEE Trans. Circuits He is Professor of Biology at University and Systems I, vol. 51, no. 1, pp. 182–186, Jan. 2004. of California San Diego, La Jolla, where  K. Kotani and T. Ohmi, “Feedback charge-transfer comparator he directs the Integrated Systems Neu- with zero static power,” in IEEE Solid-State Circuits Confer- roscience Laboratory. Previously, he held ence, Feb. 1999, pp. 328–329. positions as Professor of Electrical and Computer Engineering at Johns Hopkins University, Baltimore Maryland, and as Visiting Professor of Brain and Cognitive Science at Massachusetts Institute of Technology, Cambridge. Dr. Cauwenberghs’ research aims at advancing silicon adaptive microsystems to understanding of biological neural systems, and to development of sensory and neural prostheses and brain-machine interfaces. His activities include design and development of microp- Rafal Karakiewicz (SM’03) received the ower analog and mixed-signal systems-on-chips performing adaptive B.A.Sc. and the M.A.Sc. degrees in Elec- signal processing and pattern recognition. trical Engineering from the University of He is a Francqui Fellow of the Belgian American Educational Toronto, ON in 2003 and 2006 respec- Foundation, and received the National Science Foundation Career tively. He is currently a design engineer at Award in 1997, Ofﬁce of Naval Research Young Investigator Award SNOWBUSH Microelectronics, Toronto, in 1999, and Presidential Early Career Award for Scientists and ON, Canada. Engineers in 2000. He serves on the Technical Advisory Board of GTronix, Inc., Fremont CA. He was Distinguished Lecturer of the IEEE Circuits and Systems Society in 2003-2004, and chaired its Analog Signal Processing Technical Committee in 2001-2002. He currently serves as Associate Editor for IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS I, IEEE T RANSACTIONS ON N EURAL S YSTEMS AND R EHABILITATION E NGINEERING , and IEEE S EN - SORS J OURNAL . Roman Genov (SM’96-M’02) received the B.S. degree in Electrical Engineering from Rochester Institute of Technology, NY in 1996 and the M.S.E. and Ph.D. degrees in Electrical and Computer En- gineering from Johns Hopkins University, Baltimore, MD in 1998 and 2003 respec- tively. Dr. Genov held engineering positions at Atmel Corporation, Columbia, MD in 1995 and Xerox Corporation, Rochester, NY in 1996. He was a visiting researcher in the Laboratory of Intelligent Systems at Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland in 1998 and in the Center for Biological and Computational Learning at Massachusetts Institute of Technology, Cambridge, MA in 1999. He is presently an Assistant Professor in the Department of Electrical and Computer Engineering at the University of Toronto, Canada. Dr. Genov’s research interests include analog and digital VLSI circuits, systems and algorithms for energy-efﬁcient signal process- ing with applications to electrical, chemical and photonic sensory information acquisition, biosensor arrays, neural interfaces, parallel signal processing, adaptive computing for pattern recognition, and implantable and wearable biomedical electronics. He received Canadian Institutes of Health Research (CIHR) Next Generation Award in 2005, and Dalsa Corporation Componentware Award in 2006. He served as a technical program co-chair of IEEE Conference on Biomedical Circuits and Systems in 2007. He serves on the Advisory Board of the Department of Electrical and Computer Engineering at Rochester Institute of Technology, Rochester, NY. He is an Associate Editor of IEEE T RANSACTIONS ON B IOMEDICAL C IRCUITS AND S YSTEMS .