# A 16-Bit Barrel-Shifter Implemented in Data-Driven Dynamic Logic $(D^3L)$

Ramin Rafati, Sied Mehdi Fakhraie, Member, IEEE, and Kenneth Carless Smith, Life Fellow, IEEE

Abstract—Data-driven dynamic logic  $(D^3 L)$  uses local data instead of a global clock to maintain correct precharge and evaluation phases. Eliminating the clock from dynamic gates yields less power consumption and faster gate operation. Two 16-bit barrel shifters are implemented in a 5-V 0.6- $\mu$ m CMOS technology: one in normal Domino logic and the other in our proposed  $D^3 L$ . Separate power leads are used on the chip to measure power consumption of separate sections. Post-layout simulations show that, depending on input patterns, a  $D^3 L$  shifter consumes 8% to 62% less power and is 29% faster than the Domino circuit. In addition, it provides an additional 9% area advantage over its Domino rival. Experimental measurements confirm post-layout simulation results, and prove the feasibility of the proposed logic.

Index Terms—Barrel shifter, data-driven dynamic logic  $(D^3L)$ , Domino logic, dynamic logic, low power design.

#### I. INTRODUCTION

I N CONVENTIONAL CMOS circuits, the required logic function is implemented twice, both in a pull-down network (PDN) and a pull-up network (PUN). For increasing speed, in dynamic logic, the PUN is normally replaced by a single transistor that is controlled by a global clock signal [1]. Compared to static CMOS logic, the input capacitance of every dynamic gate can be reduced by 50% or more. However, due to the usual requirement of an additional transistor (the footer transistor) that must be cascaded with the remaining logic block, the speed generally does not double. Some designers have managed to remove the footer at the cost of making their circuits delay dependent [2]. This will seriously damage logic portability among different generations of integrated circuit (IC) processing.

The other disadvantage of using dynamic logic is the excessive load on the clock signal that must be connected to every dynamic gate. Correspondingly, the increasing frequency of today's circuits also results in greater power consumption when logic is implemented in dynamic fashion. For example, in the Alpha 21164 microprocessor, the clock-distribution system consumes 20 W, which is 40% of the total dissipation of the processor [3]. As a result, the scope of dynamic logic is limited to those places, such as in data-path logic, where speed is a critical factor, and the power penalty is acceptable.

Digital Object Identifier 10.1109/TCSI.2006.883171

One solution for reducing the excessive load of the usual clock-tree network is to use local data instead of a global clock. The idea was first briefly introduced in [4] where dynamic gates precharged by a combination of clock and data are used to implement a binary look-ahead carry function. Using data for precharging a dynamic node decreases clock load and eliminates the need for a footer transistor. However, the resulting circuit has unequal data-pin capacitance loads, as involved nodes in precharging encounter heavier loads than found for normal dynamic gate precharged by a clock. Also, in these circuits, different input selections lead to different speed-power tradeoffs, an issue which is explored in Section IV of this paper. Following the basic idea of data-associated precharging we have introduced the concept of data-driven dynamic logic  $(D^{3}L)$ , in which a local combination of input data are used instead of a global clock signal. As a result, both the clocking signal and the associated transistors driven by the clock are removed from the dynamic gates.

The remaining sections of this paper are organized as follows. After introducing  $D^3L$  in the next section, methods for finding how to implement arbitrary functions in  $D^3L$  are discussed and then, the  $D^3L$  technique is demonstrated in implementation of a 16-bit barrel shifter. Next, experimental results are used to compare power, area, and speed of  $D^3L$  circuits against conventional Domino logic. Overall, we will show how a proper selection of precharging signals can create  $D^3L$  circuits which operate faster, yet consume less power than their Domino counterpart.

#### II. $D^3L$

For implementing a specific function in conventional static CMOS logic, both of 0's and 1's in the truth table must be covered. pMOS devices in a PUN and nMOS devices in a PDN combine to realize both of 0's and 1's of the truth table (Fig. 1). By contrast, in dynamic logic, one of the output states of the truth table is established initially using a single transistor driven by a global clock. Correspondingly, dynamic circuit operation is divided into two distinct parts, the precharge and the evaluate phases. In the precharge phase, the output node is precharged to a particular level. Upon the start of the evaluation phase, depending on the state of the inputs, the output node will either be allowed to maintain the precharged state, or will be forced to the opposite level. The transition between two values must be glitch-free, since dynamic gates rely on dynamic capacitive storage, in contrast to static gates, which provide continuous dc restoration.

A  $D^3L$  gate operates in two phases, precharge and evaluate, nominally the same way as dynamic logic, but with the excep-

Manuscript received February 3, 2005; revised August 24, 2005 and April 29, 2006. This paper was recommended by Associate Editor M. Stan.

R. Rafati is with SINA Microelectronics Inc., Technology Park of Tehran University, Tehran 14398-17435, Iran (e-mail: rrafati@sinamicro.com).

S. M. Fakhraie is with the School of Electrical and Computer Engineering, University of Tehran, Tehran 14395-515, Iran (e-mail: fakhraie@ut.ac.ir).

K. C. Smith is with the Department of Electrical and Computer Engineering, University of Toronto, Toronto, ON L5L 106, Canada.



Fig. 1. (a) Logic implementations of the NAND truth table. (b) Static. (c) Dynamic.(d)  $D^3L$ .

tion that a combination of inputs plays the role of the clock signal. In creating conventional dynamic logic gates, in which one of the PDN or PUN of static logic is removed, a set of conditions must be imposed on the circuit inputs. For example, in a Domino logic block, all of the inputs must be held low during the precharge phase. This suggests that if we can precharge the corresponding gate with a combination of input data, then the need for a clock signal could be eliminated. We call circuits using data precharging (rather than clock precharging)  $D^3L$ . While maintaining the usual conditions enforced on the inputs of Domino and NP-CMOS circuits, in  $D^3L$  we replace the clock signal by one input or a combination of inputs. An example of this replacement process in the transformation of a NAND gate is shown in the Fig. 1(c) and (d). Suppose both inputs A and B are held at the low level in the precharge phase (the Domino condition). Awareness of this usual restriction enables us to eliminate the clock signal as shown in Fig. 1(d). During the precharge phase, when the input A is low, node Out is precharged high. When signal A makes a possible transition from low to high, the evaluation phase begins. At this time, depending on the value of B, node Out conditionally discharges. Note that for the particular circuit above, employing B instead of A will lead us to a similar final result.

Note that a variant of Domino logic is also presented in [1, p. 299] that eliminates the clock transistor from PDN. However, since complementary value of the precharging clock is not present in the pull-down logic, short circuit power dissipation during the precharge can occur. In opposite, this case cannot happen in  $D^3L$ , as the complement of the precharging signals exist as product terms in the pull down network, that prevent any short-circuit current flow during the precharge phase. Therefore, it is a less desirable option for a scalable delay-independent design style as  $D^3L$  is doing.



Fig. 2. (a) Domino and (b)  $D^3L$  implementations of function  $F = G \cdot (A+B)$ .

# III. IMPLEMENTATION OF VARIOUS FUNCTIONS IN $D^3L$

In general, whenever we have a function F in the product-of-sums form,  $F = \prod_{i=1}^{n} S_i$ , then the minimum  $S_i$  (the  $S_i$  with the minimum number of literals) in which all inputs have a low value during the precharge phase (the Domino condition), is used to replace the clock. This replacement procedure results in a minimum number of series transistors that must be placed in the PUN. Examples of this process are shown in Fig. 2(a) and (b).

The best case occurs when one of the  $S_i$  terms has only one literal. In that case, only one transistor is used in the clock-replacement process. Note, also, that when F has only a single product term, the need is for a static *n*-input OR gate. To obtain more speed than a static OR gate provides, one can use a Domino OR gate which has less delay than a static one.

For longer chains of logic gates, we can always start a  $D^3L$  design from a Domino logic chain and then convert the individual gates using the above procedure. However, the first stage still requires a clock-driven gate to initiate proper precharge and evaluate sequences.

Using the above conversion techniques, a Domino barrel-shifter was converted to a  $D^3L$  one in [5] where an 18% power reduction was achieved. A technique similar to NP-CMOS was used for cascading a chain of N-logic-implemented gates followed by another P-logic-implemented gates in [6]. This has demonstrated the advantages of  $D^3L$  in comparison to NP-CMOS logic where again 35% reduction in power was observed.

Certain logic structures, such as multipliers which contain inverting gates like XOR cannot be easily implemented by usual dynamic techniques. For those circuits, dual-rail dynamic implementation is the remedy. Such circuits can be transformed by dual-rail  $D^3L$  (or  $D^4L$ ), in much the same way that has been demonstrated for single-rail logic. The concept of  $D^4L$  is used to implement a multiplier in [7] where its characteristics are compared against dual-rail Domino logic.

Thus, one can see that  $D^3L$  covers a wide range of logic implementations, from static to dynamic with flexibility of choice over power and speed. This tradeoff is illustrated by Fig. 3 which shows that by selecting different input combinations to control precharging,  $D^3L$  behavior can extend from that of low-power static logic to become even faster than usual dynamic logic. If ones main concern is power, then static is the logic-of-choice,



Fig. 3.  $D^3L$  relative position in our barrel shifter design.

but for speed performance, the regions near (and above) the dynamic point are better space.  $D^3L$  brings the advantage of flexible movement within the speed-power design space under the designer control. Note that in Fig. 3,  $D^3L$  can operate at a higher speed since it does not use the footer transistor required by usual dynamic logic. Of course to reduce delay in regular dynamic design, one can eliminate the footer transistor through a technique like those used in clock-delayed Domino logic [8]. However, this will lead to a delay-dependent design. Moreover, the same concept is applicable in  $D^3L$  as well. In fact, knowing delays of the signals, we can further reduce the  $D^3L$  switch network. This results in the concept of delay-dependent  $D^3L$  $(D^5L)$ , which is discussed in [9]. In the next section, we will illustrate speed-power tradeoffs in the design and implementation of a barrel-shifter circuit.

#### IV. DESIGN OF 16-BIT BARREL SHIFTER

To investigate the advantages of  $D^3L$  design at a system-building-block level, we have implemented a 16-bit barrel shifter; both in dynamic logic and  $D^3L$  styles, and have then compared their characteristics.

#### A. Design Specification

The basic operation of the desired barrel shifter is based on logarithmic shifter architecture as described in [1, p. 596] with additional right shift and rotate capabilities [10]. It can shift/ rotate 16-bit input data from 0 to 15 bits to the left/right, and send the result to the output. The shift operation is controlled by 6 bits: Four bits for the length, one bit for direction, and one bit for type (shift/rotate). The shift-and-rotate array (*SARA*), and the control logic are the two distinct blocks of the barrel shifter, in which the former performs the actual shift-and-rotate task on available data, while its controlling signals come from the control logic [10].

SARA occupies most of the chip area, and determines the critical path delay of the barrel shifter, whereas only a small percentage of the chip is occupied by the control logic. For this reason, only SARA is implemented in the dynamic and  $D^3L$  alternatives, and the control logic is purely static.

## B. Shift-and-Rotate Array (SARA)

This module has been designed using five stages, each with sixteen cells, as illustrated in Fig. 4. The basic cell used in this array is an AO22 gate that is called *qmux* which its symbolic representation is shown in Fig. 5(a). It implements the function  $F = Ci_1 * In_1 + Ci_2 * In_2$ . Here  $Ci_1$  and  $Ci_2$  come from the

control logic, whereas  $In_1$  and  $In_2$  are driven by either external of inputs, or by outputs of the previous stage of the shifter.

The first stage of the array is used for shifting or rotating data to the right. The next four stages of the array are used for shifting or rotating data from 0 to 15 positions to the left. The first of these four shifts/rotates data 0 or 1 position, the second stages 0 or 2 positions, the third 0 or 4 positions, and, finally, the fourth performs shifts or rotations of 0 or 8 positions.

## C. Domino Implementation of the SARA

Since SARA has non-inverting properties, the Domino style can be used directly in its dynamic implementation. In the precharge phase of Domino logic, inputs of each gate must be set to the inactive state. This means that in the precharge phase, all four inputs of each *qmux* cell must be set to a low level. This is easily done by using the clock signal to force the outputs of the control logic to the low level during precharge time, as shown in Fig. 5(b). In the precharge phase (Domino\_CLK = 0), both  $Ci_1$  and  $Ci_2$  are forced to zero, whereas in the evaluation phase, they found their actual values.  $In_i$  inputs of each *qmux* cell are also set to the low level through the previous cell's output inverter. Such a Domino cell configuration is shown in Fig. 6. As shown in the figure, a small keeper transistor is devised to prevent possible charge-sharing problems and to deliver static clock-speed-independent outputs.

# D. $D^{3}L$ Implementation of the SARA

In order to eliminate the clock signals from the *qmux* cells, we must substitute them with suitable combinations of inputs. Each of the four groups  $(In_1, In_2), (Ci_1, Ci_2), (In_1, Ci_2)$  and  $(In_2, Ci_1)$  can be considered in a replacement strategy. We note that one literal from each product term is required to implement the substitute control logic for  $F = Ci_1 * In_1 + Ci_2 * In_2$ . Among the various clock replacement options,  $(Ci_1, Ci_2)$  pair presents the lowest input-output capacitances which is used for implementing corresponding  $D^3L$  gate as shown in Fig. 7. Employing the logic shown in Fig. 5(b), the outputs of the control logic will be set low in the precharge phase  $(D^3L_{-}CLK = 0)$ to precharge the entire circuit. We note that  $D^{3}L$ \_CLK is used for only interface section. This mode of operation can be seen to have more similarity to the Domino circuit, as every qmux cell drives only one nMOS switch. Also, since all control signals are forced to zero at the same time, there is no precharge wave inside the circuit, since all the nodes get precharged at the same time, once the control logic outputs are driven low. Moreover, due to the elimination of clock-controlled footer transistor, this design is faster than its Domino rival. We have selected this method as our  $D^3L$  candidate for the physical implementation which is discussed in the next section.

Note that as an another choice,  $(In_1, In_2)$  group can be selected for clock replacement, so that stages of the barrel shifter are precharged with the external inputs  $(I_0 - I_{15})$ . For this purpose, the inputs of the barrel shifter must be set low in the precharge phase. As an alternative, we can construct the first stage of the barrel shifter as in the usual Domino style. In either case, the resulting low values at the inputs of internal logic create a precharge wave, which is transferred to the outputs through the second stage, then third stage, and so on. For each *qmux* cell,



Shift-And-Rotate-Array (SARA)

whenever the condition  $In_1 = In_2 = '0'$  is satisfied, the corresponding cell is precharged. A possible high transition on each of  $In_1$  or  $In_2$  inputs initiates the evaluation phase. The configuration of the resulting *qmux* cell in  $D^3L$  design is shown in Fig. 8. The advantage of this configuration over the Domino implementation is its conditional evaluation, which means that unlike the Domino gate, it does not go to the evaluation phase if both inputs remain at a low level in that phase. For randomized inputs this configuration brings an 18% power advantage over a Domino implementation, as reported in [5]. On the other hand, inputs  $In_i$  are part of the critical path, and each  $In_i$  drives both a pMOS and an nMOS device; hence, the total input-output delay of the barrel shifter will increase compared to the Domino style in which each *qmux* output drives only one nMOS device.

#### V. PHYSICAL IMPLEMENTATION

In order to show the advantages of  $D^3L$  over Domino logic, SARA has been implemented in two different logic styles, using a 5 V 0.6- $\mu$ m CMOS technology. The chip block diagram shown in Fig. 9 contains  $D^3L$  and Domino implementations of SARA. Control logic, which is statically implemented, prepares controlling signals needed for both individual arrays of the SARA, while, at its output, *interface* logic converts signals into the appropriate forms to precharge  $D^3L$  gates and satisfy the Domino condition. Since the outputs of the  $D^3L$  interface have a higher load than those for the Domino logic, proper gate sizing has been applied to provide equal rise/fall times for both cases. Both implementations share inputs from external pins and a *select* signal

Fig. 4. SARA block diagram.







Fig. 6. *qmux* cell implementation in Domino.



Fig. 7. *qmux* cell implementation in the  $D^3L$  methodology in which precharge is done by control signals.

connects the selected *SARA* to the output pads. The chip microphotograph is shown in Fig. 10. The chip has been successfully tested up to 15 MHz (the maximum frequency of our test device), and its power consumption has been measured at various frequencies. Different power connections have been used to measure power consumption of each individual block separately. The test results are in good shape, and conform to the



Fig. 8. *qmux* cell implementation in the  $D^3L$  methodology in which precharge is done by the inputs.



Fig. 9. Block diagram of the barrel shifter chip.



Fig. 10. Microphotograph of the barrel-shifter chip.

post-layout simulations. In the following sections, power, area, and speed of  $D^3L$  and Domino circuits are compared.

|                               | INPUT = RANDOM<br>(AVERAGE CONSUMPTION) |        | INPUT<br>(MINIMUM C | r = 0000<br>Consumption) | INPUT = FFFF<br>(MAXIMUM CONSUMPTION) |        |  |
|-------------------------------|-----------------------------------------|--------|---------------------|--------------------------|---------------------------------------|--------|--|
|                               | D <sup>3</sup> L                        | Domino | $D^{3}L$            | Domino                   | $D^{3}L$                              | Domino |  |
| V <sub>DD</sub> -Logic        | 1.78                                    | 1.53   | 0.67                | 0.34                     | 3.37                                  | 2.91   |  |
| Domino_Clk<br>Buffers         | 0                                       | 0.74   | 0                   | 0.74                     | 0                                     | 0.74   |  |
| Total                         | 1.78                                    | 2.27   | 0.67                | 1.08                     | 3.37                                  | 3.65   |  |
| Domino/D <sup>3</sup> L Ratio | 1.28                                    |        | 1                   | .61                      | 1.08                                  |        |  |

 TABLE I

 Average Power Consumption of  $D^3L$  and Domino Logic (MW)

#### A. Power

The implemented chip has separate power sources for the SARA blocks and control logic, clock-tree, output buffers-andmultiplexers, and PADs. Having two different clock sources, one for  $D^3L$  and the other for Domino, enables us to measure power consumption of each core separately. For example, by setting Domino\_CLK to zero, the Domino SARA will not consume any power, and  $D^3L$  SARA and control logic are the only sources of consumption from  $V_{DD}$ -Logic. In the same way, by setting  $D^3L$ -CLK to zero and toggling Domino\_CLK, we can measure the power consumption of the Domino implementation only.

The lowest power consumption occurs when every single node of the *qmux* cells in the *SARA* remains at the precharge-mode value for two consecutive phases. This happens when inputs  $In_{15} - In_0$  of the barrel shifter are set to 0. In this case, all of the *qmux* cells' outputs retain their precharged values, and only control logic, plus its interface [Fig. 5(b)] and clock buffers [Fig. 9] in Domino style, consume power. By setting shift length to a constant value, and removing control-block consumption from the list, we can compare pure consumption of precharging logic for both of  $D^3L$  and Domino, when there is no activity inside the *SARA*.

On the other hand, the highest power consumption occurs when the entire collection of *qmux* cells within the array lose their charges in the evaluation phase. This is arranged by assigning inputs  $In_{15}-In_0$  to all 1 s. Table I presents post-layoutsimulation results in the average, best and worst cases of power consumption for the two logic styles considered.

For 65 out of 80 *qmux* cells within the SARA block,  $Ci_1$  and  $Ci_2$  are complements of each other. This means that a series combination of them in Fig. 7 acts as a single clock signal from an activity-factor point of view. However, since each Domino *qmux* cell has one extra nMOS switch (the footer transistor), its consumption is higher than the equivalent  $D^3L$  circuit. From Table I, it can be concluded that Domino-circuit power consumption can be 8% to 61% higher than that for  $D^3L$ , depending on the input patterns. This result shows that where there are no input changes, and correspondingly, no event is expected to propagate in the circuit, Domino logic is least efficient from a power point of view. In such case a static equivalent circuit would consume zero power, and  $D^3L$  can effectively fill in the gap in the power spectrum between static and Domino logics.

# B. Area

The barrel-shifter-chip layout was designed using a full-custom approach. There is a single pMOS switch in the PUN of Domino *qmux* (Fig. 6), while there are two series pMOS transistors in the  $D^3L$  PUN (Fig. 7). However, Domino *qmux* has one extra nMOS switch (the footer) in the PDN. This cascaded transistor creates a gap in the active area of the PDN of the Domino *qmux* [Fig. 11(b)]. Based on the concept of branch-based design [11], this implies a greater diffusion capacitance, greater cell area, and also irregularity inside the cell's layout. As shown in Table II, the Domino cell's area is 9% more than its  $D^3L$  counterpart. The area consumption of the Domino logic becomes much worse when clock buffers and their corresponding routing are considered.

The other observation that we have made during layout preparation is that, normally, a standard cell or custom-based cell is designed by having two  $V_{\rm DD}$  and  $V_{\rm SS}$  rails at the top and bottom of the cell, and, thereby, arranging pMOS and nMOS switches close to these two rails. A Domino cell possesses a few pMOS-switches in the PUN, and a large number of nMOS switches in the PDN. This creates an unbalanced area requirement for the two sections, and demands a very careful layout to reduce wasted area. However,  $D^3L$  tends to have a morebalanced area requirement for PUN and PDN, and its layout is more straightforward, particularly when considering the fact that pMOS-switch widths are nearly twice those of the nMOS ones.

## C. Speed

The critical path of the barrel shifter is constructed from five stages of *qmux* arrays. Therefore, all input patterns should similarly pass through these five stages. The control path is a small amount of logic and does not contribute to the critical path delay. Since the outputs are always precharged to zero, the critical path delay is measured for where all inputs are set to one with different shift values. In order to perform a fair comparison, all transistors in the  $D^3L$  and Domino SARAs were constructed using  $W_n = 2.7 \ \mu m$ ,  $W_p = 1 \ \mu m$  (for the keeper), and  $W_p = 4 \ \mu m$  (for other pMOS), with  $L_p = L_n = 0.6 \ \mu m$ . Post-layout simulation results measured at the SARA's outputs, are shown in Table III.

Since only one pMOS device precharges a Domino cell, its precharge time has a lower value. In the evaluation phase of the Domino cell, there are three series nMOS devices between



Fig. 11. Qmux layouts for (a)  $D^3L$  and (b) Domino.

TABLE II Area Comparison Between  $D^3L$  and Domino

|           | D <sup>3</sup> L     | Domino               | Domino/D <sup>3</sup> L Ratio |
|-----------|----------------------|----------------------|-------------------------------|
| qmux area | 231.14u <sup>2</sup> | 251.64u <sup>2</sup> | 1.09                          |

 TABLE III

 Post-Layout Simulation Results for SARA (Times in Picoseconds)

|                  | PRECHARGE TIME | EVALUATION TIME |  |  |
|------------------|----------------|-----------------|--|--|
| D <sup>3</sup> L | 418            | 1,843           |  |  |
| Domino           | 302            | 2,380           |  |  |

the output and GND, whereas for  $D^3L$ , there are only two. As a result,  $D^3L$  is expected to have faster evaluation time than Domino. However, there are cases where this speed advantage is less than expected. For example, consider the case where  $Ci_1$  is one and  $Ci_2$  is zero in the evaluation phase. During the precharge time, a small charge is stored at node "s" (Fig. 7) and has to be discharged in the evaluation phase along with the node q. For instance, this situation arises during right shift. On the other hand, when  $Ci_1$  and  $Ci_2$  have reverse values, "01", in the evaluation phase, the circuit delay is reduced by 80 ps. In other words, left shift operation is faster than right shift in the barrel shifter. The results in the Table III show the worst-case timing for circuit operation. Please note that although Domino precharge time has the lower value, this is not beneficial for most systems that use a clock with 50% duty cycle. Generally speaking, in Domino logic, it is the evaluation time that limits the maximum frequency of the clock.

## VI. EXPERIMENTAL RESULTS

The fabricated chip has been tested at various frequencies, and the results have been compared against HSPICE post-layout simulations using BSIM3v3 level-49 transistor models. For this purpose, we have prepared a test board with appropriate switches to perform some comparative measurements of the  $D^3L$  and Domino implementations. As our chip possesses separate  $V_{DD}$  pins for different sections of the circuit, we have been able to accurately measure current drain and power consumption of each block at various speeds. Table IV shows some of our power measurements, along with various post-layout simulation results. Post-layout simulation power levels are slightly higher than experimental ones. This could be the result of voltages dropped on the wiring and PADs, while our extraction file does not have any such connection resistance. Please note that "Clk-buf" power indicated in the last row has

| Frequency        | 1MHz |      | 5MHz |      | 10MHz |      | 15MHz |      | COMMENTS                               |
|------------------|------|------|------|------|-------|------|-------|------|----------------------------------------|
|                  | Test | Sim  | Test | Sim  | Test  | Sim  | Test  | Sim  |                                        |
| D <sup>3</sup> L | 0.06 | 0.07 | 0.28 | 0.36 | 0.59  | 0.72 | 0.86  | 1.08 | Input=0000, RGHT=0,<br>- TYP=0, LGTH=0 |
| Domino           | 0.09 | 0.11 | 0.39 | 0.55 | 0.83  | 1.11 | 1.25  | 1.65 |                                        |
| D <sup>3</sup> L | 0.21 | 0.26 | 1.01 | 1.33 | 2.35  | 2.66 | 3.49  | 3.99 | Input=FFFF, RGHT=0,<br>- TYP=0, LGTH=0 |
| Domino           | 0.26 | 0.32 | 1.23 | 1.62 | 2.63  | 3.23 | 4.04  | 4.84 |                                        |
| Clk-buf          | 0.06 | 0.07 | 0.27 | 0.37 | 0.55  | 0.74 | 0.83  | 1.11 | Only in Domino                         |

 TABLE IV

  $D^3L$  and Domino Cores Power Consumptions (in MW)



Fig. 12. Measured waveforms with a 10-MHz clock for (a)  $D^3L$  and (b) Domino.

been added to the Domino power for a fair comparison against  $D^3L$  power consumption.

Measured waveforms for one of the outputs of the barrel-shifter, operating with 10-MHz clock, are shown in Fig. 12(a) and (b). The noises over the waveforms do not exist in the simulations and are related to the measurement setup. Despite their presence, the circuit has been working properly.

#### VII. CONCLUSION

 $D^3L$  is an improved type of synchronous dynamic logic, in which precharge and evaluation phases are performed under control of input data, and without an explicit clock. This logic style eliminates the need for a global clock signal, as well as the need for a footer transistor cascaded with the evaluated nMOS of conventional dynamic logic gates. Moreover, it does so without making the design delay dependent. In this paper, we have compared a barrel shifter implementation in two logic styles:  $D^3L$ and Domino. Experiments with the fabricated chip, and postlayout simulation results, show that the  $D^3L$  shifter consumes 8% to 61% less power than the Domino shifter, depending on the input data pattern. Also,  $D^3L$  is 29% faster and 9% smaller than its Domino counterpart.

#### ACKNOWLEDGMENT

The authors acknowledge the support and facilities provided by the Emad Semiconductor Inc., Nano Electronics Center of Excellence, University of Tehran; and would also like to thank G. R. Chaji, A. Charaki, and A. Khakifirooz for helping during the test of this chip; and Prof. G. Gulak and Mr. Y. Eslami, Department of Electrical and Computer Engineering, University of Toronto, Toronto, ON, Canada, for their assistance.

#### REFERENCES

- J. M. Rabaey, A. Chandrakasan, and B. Nikolic, *Digital Integrated Circuits*. Englewood Cliffs, NJ: Prentice-Hall, 2003.
- [2] J. Silberman et al., "A 1.0-GHz single-issue 64-Bit PowerPC integer processor," *IEEE J. Solid-State Circuits*, vol. 33, no. 11, pp. 1600–1608, Nov. 1998.
- [3] B. J. Benschneider *et al.*, "A 300-MHz 64-b quad-issue CMOS RISC microprocessor," *IEEE J. Solid-State Circuits*, vol. 30, no. 11, pp. 1203–1214, Nov. 1995.
- [4] J. R. Yuan, C. Svensson, and P. Larsson, "New Domino logic precharged by clock and data," *Electron. Lett.*, vol. 29, no. 25, pp. 2188–2189, Dec. 1993.
- [5] R. Rafati, S. M. Fakhraie, and K. C. Smith, "Low-power data-driven dynamic logic," in *Proc. ISCAS 2000*, vol. 1, pp. 752–755.
- [6] R. Rafati, A. Z. Charaki, S. M. Fakhraie, and K. C. Smith, "Data-driven dynamic logic versus NP-CMOS logic, a comparison," in *Proc. ICM* 2000, pp. 57–60.
- [7] R. Rafati, A. Z. Charaki, R. Z. Chaji, S. M. Fakhraie, and K. C. Smith, "Comparison of a 17b multiplier in dual-rail Domino and in dual-rail D<sup>3</sup>L (D<sup>4</sup>L) logic styles," in *Proc. ISCAS 2002*, vol. 3, pp. 257–260.
- [8] G. Yee and C. Sechen, "Clock-delayed Domino for dynamic circuit design," *IEEE Trans. Very Large Scale Inegr. (VLSI) Syst.*, vol. 8, no. 4, pp. 425–430, Aug. 2000.
- [9] R. Rafati, "Data-driven dynamic logic (D<sup>3</sup>L)," M.Sc. thesis, School of Elect. and Comput. Eng., Univ. of Tehran, Tehran, Iran.

- [10] R. Pereira, J. A. Michell, and J. M. Solana, "Fully pipelined TSPC barrel shifter for high-speed applications," *IEEE J. Solid-State Circuits*, vol. 30, no. 6, pp. 686–690, Jun. 1995.
- [11] A. Gerson and S. Machado, Low-Power HF Microelectronics: A Unified Approach. London, U.K.: IEE, 1996, ch. 15, pp. 535–579.





Laboratory, School of Electrical and Computer Engineering, University of Tehran, to develop a digital signal processor for mobile communication devices. From 2000 to 2003, he was a Senior Digital Designer

with Valence Semiconductor Inc, Markham, Canada. He is now a Co-Founder of SINA Microelectronics Inc., Tehran, Iran, which is a fabless company focusing on design and development of application-specific integrated circuit and system-on-chip (ASIC/SoC) solutions for various networking applications. His research interests include novel techniques for high-speed digital circuit design, low-power logic design and system integration for networking devices.



**Sied Mehdi Fakhraie** (M'89) was born in Dezfoul, Iran, in 1960. He received the M.Sc. degree in electronics from the University of Tehran, Tehran, Iran, in 1989, and the Ph.D. degree in electrical and computer engineering from the University of Toronto, Toronto, ON, Canada in 1995.

Since 1995, he has been with the School of Electrical and Computer Engineering, University of Tehran, where he is now an Associate Professor and Associate Dean for Graduate Studies. He is also the Director of Silicon Intelligence and the VLSI Signal

Processing Laboratory. From September 2000 to April 2003, he was with Valence Semiconductor Inc. and has worked in Dubai, UAE, and Markham, Canada offices of Valence as Director of application-specific integrated circuit and system-on-chip (ASIC/SoC Design) and also technical lead of Integrated Broadband Gateway and Family Radio System baseband processors. During the summers of 1998, 1999, and 2000, he was a Visiting Professor at the University of Toronto, where he continued his work on efficient implementation of artificial neural networks. He is coauthor of the book VLSI-Compatible Implementation of Artificial Neural Networks (Boston, MA: Kluwer, 1997). He has also published more than 80 reviewed conference and journal papers.

He has worked on many industrial IC design projects including design of network processors and home gateway access devices, digital subscriber line (DSL) modems, pagers, and one- and two-way wireless messaging systems, and digital signal processors for personal and mobile communication devices. His research interests include system design and ASIC implementation of integrated systems, novel techniques for high-speed digital circuit design, and system-integration and efficient VLSI implementation of intelligent systems.



Kenneth Carless Smith (F'78–LF'96) received the B.A.Sc. degree in engineering science, the M.A.Sc. degree in electrical engineering, and the Ph.D. degree in physics from the University of Toronto, Toronto, ON, Canada, in 1954, 1956, and 1960, respectively.

Following his academic appointment in 1961 at the University of Illinois, Urbana, where he reached the rank of Associate Professor, in 1965, he re-joined the University of Toronto, where he was appointed to the rank of Full Professor in 1970, and served as the Chairman of the Department of Electrical Engi-

neering from 1976 to 1981. Upon retirement in 1997, he was also a Professor of Electrical and Computer Engineering, of Computer Science, of Mechanical and Industrial Engineering, and of Information Sciences. For the period 1993 to 1998, he served part-time as a Visiting Professor in the Department of Electrical and Electronic Engineering at the University of Science and Technology, Hong Kong, where he was the Founding Director of Computer Engineering. He is also an Advisory Professor at the Shanghai Tiedao University, Shanghai, China. Upon retirement from the University of Toronto in 1997, he was appointed as Professor Emeritus of the University. He has served for many years as an advisor to various electronics companies throughout the world. He was a founding member of Z-Tech (Canada), a Toronto-based medical instrumentation company, for which he serves now in an advisory capacity as Principal Scientist. He has extensive industrial experience in the design and application of computers, medical instrumentation, and electronic circuits generally, as Administrator, Manager, Designer, and Consultant. His interests include analog VLSI, multiple-valued logic, sensor systems, instrumentation, human-factors engineering, flexible manufacturing, and reliability. He is widely published in these and other areas, with well over 200 journal and proceedings papers, books, and book contributions. His textbook with Adel S. Sedra Microelectronic Circuits, now in its Fifth Edition, published by Oxford University Press, with 1360 pages, has been translated into many languages, and adopted by many hundreds of universities around the world.

Dr. Smith has held a variety of posts in Societies of IEEE, most notably and currently on the Executive Committee of the International Solid-State Circuits Conference (ISSCC), as Press-Relations Chair, and as Awards Chair. Amongst his numerous affiliations with professional associations, is his former directorship and presidency of the Canadian Society for Professional Engineers.