Multi-Gb/s Bit-by-Bit Receiver Architectures for 1 − D Partial Response Channels

Masum Hossain, Anthony Chan Carusone, Senior Member, IEEE,

Abstract—Low-complexity bit-by-bit detection techniques for 1-D partial response channels are presented. First, a full-rate detection technique is presented which operates at 3.3 Gb/s consuming 40 mA from 1.8 V supply with a sensitivity of 40 mV differential. The speed of the full-rate architecture is limited by the settling time of a latch circuit which has to be less than 1 UI. To eliminate this limitation a novel demuxing technique is introduced. Using the proposed technique, a second architecture achieves 5 Gb/s data rate with the same sensitivity and consuming 62 mA (including output buffer) from 1.8 V supply. Both half-rate and full-rate architectures are also studied in 90 nm CMOS targeting chip to chip applications. The implemented full-rate architecture operates at 10 Gb/s consuming 32 mW whereas the simulated half-rate architecture consumes 50 mW and operates at 16.67 Gb/s.

Index Terms—Peak detection, Dicode Channel, AC Coupling, Clock-less demuxing, Half-rate, DFE

I. INTRODUCTION

There are many new and emerging applications for dicode (1-D) partial response signaling. Dicode partial response signaling was applied to magnetic storage channels [1]. More recently, a similar channel response has been observed in multi-Gbps wireline communication applications such as Passive Optical Networks (PON) and AC coupled chip-to-chip links that have spectral nulls at DC. The speed of receivers for these applications are generally limited by the settling time of a latch circuit. This shortcoming is addressed in this work with two novelties: first, an improved latch circuit provides faster settling time; second, a parallel architecture permits the positive- and negative-going pulses to be detected separately, thus alleviating the feedback settling time requirements on the latches. One interesting area where partial response signaling has been applied is chip-to-chip links. For example, it was used for a high speed multi-drop bus with magnetically coupled receivers [2]. Capacitive coupling has also been used in chip-to-chip links within a package [3] and over PCB traces up to 20 cm in length [4]. AC coupling has also been used for bidirectional signaling [5], as a wireless link for modulated data [6] and for power transfer [7].

Another area of interest is burst mode applications. In a PON system, the receiver at the Optical Line Terminal (OLT) needs to recover data from different Optical Network Units (ONU). The packets of data from ONUs arrive in bursts at the OLT end and their signal strength varies significantly. For high data rates such as 10Gb/s, the receiver used in OLT end needs to recover the DC information in less than 1 ns. To avoid the difficulty associated with fast DC extraction, 1-D channel is used to suppress the DC content [8].

Partial response channel receivers can be broadly classified in two categories: sequence detectors and bit-by-bit detectors. Sequence detectors, such as those using the Viterbi algorithm make a decision based on a sequence of observations spanning several symbol intervals [9]. Sequence detectors generally outperform bit-by-bit detectors and are now therefore dominant in magnetic storage applications. However, they demand sophisticated signal processing and power consumption which is generally intolerable for multi-Gb/s wireline communication applications of (1 − D) partial response signaling. The remainder of the paper will, therefore, focus on bit-by-bit detectors.

All of the multi-Gbps wireline applications shown in fig. 1 have behaviorally similar channel responses. The capacitively-coupled link in [4], the inductively coupled link in [10],[11], and the burst mode link in [8] are all dominated by a first-order highpass characteristic with a cutoff frequency of 1 to 5 times the bit rate, $f_{Bit}$. As a result, transitions in the transmitted data appear as narrow electrical pulses at the receiver while consecutive identical bits result in no signal at the receiver.

Measured and modeled responses of such an AC coupled channel are shown in fig. 2 for a 50-fF coupling capacitor and 50-Ohm termination resistor. The channel suffers from 40 dB of loss at 0.05$f_{Bit}$ and more than 15 dB of loss at 2.5$f_{Bit}$. The measured capacitively coupled channel response closely follows an ideal 1-D channel within the band of interest (DC-6 GHz). Corresponding time domain signals with NRZ transmitted data are shown in fig.3. Note that unlike modern magnetic storage channels, only a small amount of ISI is introduced. However, the sensitivity of the receiver has to be sufficient to capture the small received pulses. In this case, for
a 50 fF coupling capacitor and 10 Gb/s data, the signal suffers more than 20 dB of loss which means that the receiver needs to detect only few tens of mV. Since the received pulse width is only a fraction of the bit period, receiver circuits will generally require higher gain and bandwidth than a NRZ receiver at the same data rate without ac coupling. Furthermore, the received signal is a 3-level signal, so additional decoding is needed to recover the transmitted data.

Different applications of 1-D signaling have different requirements. Chip-to-chip links require low power and area-efficient circuits with moderate sensitivity and dynamic range. On the other hand, a burst mode application requires higher sensitivity and fast recovery. Without targeting a specific application, we first explore 1-D receiver architectures which can be adapted to meet the requirements of either application. We identify the bottleneck that limits the maximum achievable speed of this type receiver. We then propose an improved receiver architecture that obviates this speed limitation.

These receiver architectures are implemented and compared for two particular applications: In the first implementation we target 40 mV sensitivity and >10 dB dynamic range which is required for G-PON systems [12]. The two receivers implemented in 0.18 um CMOS serve as experimental validation of the theoretical discussion regarding full-rate and half-rate architectures. Their relatively high sensitivity requires large pre-amp gain, hence increases power consumption. Thus the implemented full-rate and half-rate prototypes in 0.18 um CMOS consumes 72 mW and 110 mW for 3.33 Gb/s and 5 Gb/s respectively. Although this power consumption is comparable to other existing burst mode receivers[13], chip to chip links require much lower power consumption.

For chip-to-chip interconnects, we target 80 mV sensitivity and higher bit-rate(10+ Gb/s). To improve power efficiency and achieve higher speed we implement the proposed receivers in 90nm CMOS. An implemented full-rate architecture consumes 32 mW and operates up to 10 Gb/s without any equalization. On the other hand simulated half-rate architecture consumes 50 mW and operates up to 16.67 Gb/s. Achieved power efficiencies of 3.2 mW/Gb and 3.0 mW/Gb are comparable to the DC coupled receivers at these speeds.

We begin our discussion with general bit-by-bit decoding techniques for dicode channels in section II. The proposed half-rate architecture, introduced at the end of section II relaxes the speed bottleneck introduced by feedback in full-rate architectures. In section III, a full-rate implementation in 0.18um CMOS is described. It is based on the architecture introduced in [14] but modified to accommodate threshold adjustability and improve sensitivity. Section IV describes the circuit level implementation of the half-rate architecture introduced in Section II. Using this technique the receiver can potentially improve the speed by a factor of 2, at the expense of increased power consumption. The 0.18-um CMOS prototype operates up to 5Gb/s, 50% faster than the full-rate architecture. Targeting chip to chip applications, 90 nm implementation of these architectures are presented in section V. Finally, these two architectures are compared in the conclusion in section VI.

II. 1-D RECEIVER ARCHITECTURE

In this section, dicode (1-D) partial response bit-by-bit receiver architectures are reviewed.

A. Decision Feedback Equalization (DFE)

For un-coded binary data transmitted over a 1-D channel, the data can be recovered using a 1-tap decision feedback equalizer, as shown in fig. 4 [15][16]. In this architecture the received signal, \( s(n) = z(n) - z(n-1) \), is compared to a threshold level that is updated based on the immediate previous bit.

\[
v_{th}(n) = \beta V(n - 1)
\]

A hardware efficient implementation of this technique is discussed in [3] [4] where this functionality can be achieved at high speed utilizing a hysteresis latch. Compared to a conventional DFE, this architecture provides several advantages: (i) Since there is no clock required, this architecture can be implemented with less complexity and lower power consumption. (ii) Since there is no D-FF in the feedback path, it will settle faster than a clocked 1 tap DFE. However, there is still a feedback path that must settle and the highest...
chosen so that the received signal amplitude, $|S(t)|$, is excessive to the receiver side as shown in Fig. 6. The thresholds, $|V_{th}|$, are chosen to enable the pre-coding functionality, the pre-coder can be moved to the receiver side as shown in Fig. 5. It may be shown that, in the absence of noise causing decision errors, $v(n) = z(n)$. Hence, the output of the first XOR operation is,

$$w(n) = u_1(n) \oplus u_2(n) = z(n) \oplus z(n-1)$$

and the decoder output $v(n)$ is

$$v(n) = v(n-1) \oplus w(n) = v(n-2) \oplus w(n-1) \oplus w(n)$$

Using (7) and (8), the decoded output $v(n)$ can be expressed as a function of transmitted symbol $z(n)$:

$$v(n) = z(n) \oplus v(n-2) \oplus z(n-2)$$

This equation can be iteratively extended back in time to the first transmitted symbol, which we shall assign to time $n = 0$:

$$v(n) = z(n) \oplus v(0) \oplus z(0)$$

Thus if the initial transmitted symbol $z(0)$ and initial decoder output $v(0)$ are the same, equation (10) reduces to $v(n) = z(n)$ and the decoder output is indeed equal to the transmitted data. Comparing the transceiver architectures in Fig. 4, Fig. 5 and Fig. 6, the highest achievable speed is always limited by the delay of a feedback loop which must be one bit period or less.

C. Receiver with half-rate Decoder

To ease the settling-time requirements of all the above architectures, we introduce the half-rate decoder shown in Fig. 7. This architecture is a natural progression from the one shown in Fig. 6, where the feedback loop is shifted before the XOR operation. The operation of the half-rate receiver in Fig. 7 is best understood by recognizing that the top path, through $u_1$ and $w_1$, is responsible for receiving positive peaks in $s$, whereas the bottom path through $w_2$ and $w_2$ receives only the negative peaks in $s$. Every positive peak in $s$ (corresponding to a rising-edge of $z$) must be followed by a negative peak (corresponding to a falling-edge of $z$). Hence, the top path can never be active in two consecutive bit periods. Similarly, all negative peaks are followed by a positive peak, so the negative path is never active for two bit periods in a row. Hence, the feedback loops have twice as long to settle: 2 UI. The front-end of the receiver is unchanged from Fig. 6, so equations (5) and (6) are still valid.

Now the decoder output $w_1(n)$ and $w_2(n)$ can be written as:

$$w_1(n) = w_1(n-1) \oplus (z(n)z(n-1))$$

$$w_2(n) = w_2(n-1) \oplus (z(n)z(n-1))$$

The full-rate decoded output $v(n)$ is then related to $w_1(n)$ and $w_2(n)$ as follows:

$$v(n) = w_1(n) \oplus w_2(n)$$

$$= (w_1(n) \oplus (z(n)z(n-1))) \oplus (w_2(n) \oplus (z(n)z(n-1)))$$

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
to injection-lock an oscillator in the simulations of section V. Alternatively, we use them as inputs to a phase detector in a conventional clock recovery loop. Hence, these signals can be used to provide programmability several copies of the circuit were made nominally greater than 2. In addition, to accommodate both large and small inputs, the threshold levels should be adjustable. However, this simple circuit suffers from two main challenges. First, the critical output node is heavily loaded by the capacitance of \( g_m \) and the following stages which limits its settling time. To reduce the time constant at the critical nodes cascode devices were used. Due to this transistor stacking VDD was increased to 2.5 V in a 0.13-um CMOS process. In this work the time constant is improved within the process nominal VDD. The second challenge with using the hysteresis circuit in fig. 8(b) is that adjusting the threshold level will also effect both large and small inputs, the threshold levels should be adjustable. However, this simple circuit suffers from two main challenges. First, the critical output node is heavily loaded by the capacitance of \( g_m \) and the following stages which limits its settling time.

In summary, present low complexity 1-D decoders generally include feedback loops which must settle in less than 1 UI. However, this requirement can be relaxed by using the half-rate architecture proposed in fig. 7. The remainder of this paper describes a prototype of the half-rate decoder of this paper describes a prototype of the half-rate decoder for 1-D partial response signaling.

Note that,

\[
(z(n)z(n-1)) \oplus (z(n)z(n-1)) = z(n) \oplus z(n-1) \quad (15)
\]

Substituting (15) into (14),

\[
v(n) = w_1(n-1) \oplus w_2(n-1) \oplus z(n) \oplus z(n-1) \quad (16)
\]

\[
= v(n-1) \oplus z(n) \oplus z(n-1) \quad (17)
\]

similarly \( v(n-1) \) can be written as:

\[
v(n-1) = v(n-2) \oplus z(n) \oplus z(n-2) \quad (18)
\]

substituting (18) into (17) results in,

\[
v(n) = z(n) \oplus v(0) \oplus z(0) \quad (19)
\]

Thus the decoder can correctly recover the transmitted symbol \( v(n) \) if the initial decoder output and initial transmitted symbol are the same, \( v(0) = z(0) \), the same requirement obtained for the full-rate architecture in fig. 6.

Notice that pulses are generated at \( u_1 \) whenever a rising (falling) edge is observed on the channel data, \( z \). Hence, these signals can be used as inputs to a phase detector in a conventional clock recovery loop. Alternatively, we use them to injection-lock an oscillator in the simulations of section V.

In summary, present low complexity 1-D decoders generally include feedback loops which must settle in less than 1 UI. However, this requirement can be relaxed by using the half-rate architecture proposed in fig. 7. The remainder of this paper describes a prototype of the half-rate decoder and compares it to a full-rate decoder in the same technology. Error propagation of such receivers is same as a conventional DFE. Just as in DFE-based partial response receivers, so long as a sufficiently low Bit Error Rate (BER) is maintained there is no observable degradation performance.

III. FULL-RATE BIT-BY-BIT DETECTION

The receiver architecture shown in fig. 8(a) was introduced in [14]. Notice the linear amplifier in parallel with the Hysteresis latch, which improves the receiver’s overall speed. With this in place, the receiver’s speed is determined by the settling time of the hysteresis latch, and the bandwidth of the pre-amplifier.

This circuit demonstrates hysteresis, if the following condition is satisfied:

\[
g_m R_L > 1 \quad (20)
\]

where, \( g_m \) is the small signal transconductance of the feedback differential pair. In practice, to ensure operation in the presence of noise and process variations, \( g_m R_L \) is made nominally greater than 2. In addition, to accommodate both large and small inputs, the threshold levels should be adjustable. However, this simple circuit suffers from two main challenges. First, the critical output node is heavily loaded by the capacitance of \( g_m \) and the following stages which limits its settling time.

To reduce the time constant at the critical nodes cascode devices were used. Due to this transistor stacking VDD was increased to 2.5 V in a 0.13-um CMOS process. In this work the time constant is improved within the process nominal VDD. The second challenge with using the hysteresis circuit in fig. 8(b) is that adjusting the threshold level will also effect other aspects of the design, such as its settling time. Hence, to provide programmability several copies of the circuit were operated in parallel in [13]. The proposed circuit is shown in fig. 8(c). An additional differential pair, \( g_{m2} \), is introduced in the latch which provides several advantages: First, note that the condition for hysteresis is now,

\[
g_{m1} R_1 g_{m2} R_H > 1 \quad (21)
\]

Compared with the condition for the previous circuit in (22), there is additional flexibility to choose gain of each stage, \( R_1 \) and \( R_H \) to minimize the settling time. Second, \( g_{m2} \) also works as a buffer between the critical node and the following stages. Finally, this architecture also allows adjustment of the threshold levels, as shown in fig. 9(a).


Also note the use of a “split-load” at the output [19] so that the feedback is taken from the fast-settling node with low impedance while the output is taken from the node with larger swing. Hence, the feedback loop settling time is dominated by the time constant $R_H C_H$ whereas the “output settling time” is dependent on the time constant $(R_1 + R_2) C_L$. This allows design flexibility and relaxes the tradeoffs between speed, sensitivity and noise immunity. Simulations of the hysteresis latch in fig. 9(b) indicate that the split-load improves the settling time by >20% in this circuit. The targeted sensitivity of the receiver is 40mV differential input. The hysteresis comparator thresholds can be adjusted from 150mV to 400mV differential input. In PON applications, the receiver threshold can be adjusted based on a training preamble which precedes each data burst.

A 5 stage pre-amplifier providing 24 dB differential gain is used in front of the hysteresis latch. Budgeting 30mW of power for the pre-amplifier, without inductive peaking the achieved bandwidth is only 2.5 GHz resulting in excessive data-dependent jitter. Hence, inductive peaking was used to extend the bandwidth to 3.5 GHz.

**B. Experimental Results**

A die photo of the receiver front end is shown in fig. 10. Measurements were made with a channel comprising an approximately 3-ft long SMA cable and a 50-fF ac-coupling capacitor on-chip which, together with the 50-Ohm on-chip termination, forms the high-pass filter characterized in figures 2 and 3. The measured results are obtained with single ended excitation only. The receiver’s dynamic range was tested by varying the input amplitude from 40mV to 200mV. For a 40 mV input, the threshold level was adjusted to 70 mV, and for a 180 mV input, the threshold level was adjusted to 180 mV. The receiver demonstrated error-free data recovery at 3.3 Gb/s for a PRBS $2^{10}−1$ pattern at both signal amplitudes (fig. 11(a) and fig.11(b)).

**IV. Half-Rate Detection**

The speed of the architecture in section III is limited by the finite bandwidth of the pre-amplifier and the threshold settling time. To further increase the speed of $(1−D)$ partial response receivers, the parallel half-rate architecture described in section II is used. The block diagram of a CMOS implementation of this architecture is shown in fig. 13. The front end is comprised of two major circuit blocks: a slicer and a toggle flip-flop (T-FF). The T-FF provides the feedback and XOR
Fig. 12. Results for a PRBS $2^{10} - 1$ pattern: (a) a segment of the transmitted and recovered sequences and (b) BER bathtub plot.

Fig. 13. (a) Proposed half-rate receiver architecture (b) Transition detector circuit (c) building block of the 5-stage pre amp (d) T-FF operation shown in each path of fig 7. The circuit outputs Demux1 and Demux2 correspond to $u_1$ and $u_2$ in fig 7. These may be XORed to recover the full-rate data or further demultiplexed for digital decoding at a much slower rate. One possible implementation is shown in fig 13, where recovered half-rate clock is used to further demultiplex Demux1 and Demux2. These demuxed bit streams are then XORed at a half-rate to decode even and odd bit streams. Half-rate clock can be recovered from the transition information provided by $u_1$ and $u_2$.

A. Implementation

The first stage of the slicer is a differential difference amplifier that compares the input to $V_{th}$. The detected pulses are then passed through 5 inductively-peaked amplifier stages providing 26 dB gain. Fortunately, due to the half-rate architecture lower bandwidth can be tolerated here than in the full-rate pre-amplifier. Hence, the total current consumption is only 9 mA from a 1.8 V supply for each amplifier chain providing 2.2 GHz bandwidth.

High speed T-FFs have been widely used as dividers in both wireline and wireless applications. Conventional CML T-FFs employ two back-to-back D-latches as shown in fig. 13(d). A typical implementation of the D latch is shown in fig. 14(a). This type of T-FF exhibits self oscillation which allows it to operate as a high-frequency divider. A typical sensitivity curve is shown in fig. 14(a). Unfortunately, noise around the self-oscillation frequency can cause the output to toggle erroneously during periods when there is no transition in the received data, resulting in bit errors in the decoded sequence. Thus self oscillation in the T-FF must be avoided to use it as a decoder in this application. In addition, the buffer $A_L$ is needed to drive a capacitive load without loading the latch nodes.

To alleviate both of these problems, we bring the buffer within the feedback loop as shown in fig. 14(b). The gain of $A_L$ is easily made adjustable to allow variable latching strength and, hence, T-FF sensitivity. The modified architecture provides frequency-independent sensitivity characteristics, as shown in fig. 14(b). Furthermore stage $A_L$ effectively buffers the critical latch node and eliminates the requirement of an additional buffer.

Thus the proposed latch circuit does not consume additional power compared to a conventional latch implementation.

B. Experimental Results

A prototype receiver in 0.18 um CMOS is shown in fig. 15. In this implementation we used same 50 ff coupling capacitor and 50 ohm resistance as a highpass filter to provide the 1-D partial response. The receiver provided error-free operation for a PRBS $2^{10} - 1$ pattern up to 5Gb/s. Eye diagrams of the demultiplexed outputs at 3.33 Gb/s and 5 Gb/s are shown in fig. 16 and fig. 17 respectively. Portions of actual...
PRBS $2^{10} - 1$ transmitted and recovered sequences are shown in Fig. 18. Note that none of the eye diagrams show any ringing which indicates that the proposed T-FF implementation strongly suppresses self oscillation.

V. 90NM IMPLEMENTATION

For chip-to-chip applications, the sensitivity and dynamic range requirements are relaxed. We target 80 mV sensitivity in a 90-nm CMOS process resulting in greatly improved power-efficiency. No inductors were used in these designs as chip-to-chip applications demand compact circuitry.

A. Full-rate Bit-by-Bit detection

The full-rate architecture of section III achieves 80 mV sensitivity with a 5-stage preamplifier that consumes only 10 mW. The hysteresis latch consumes 7 mW and operates up to 10 Gb/s. For experimental study we used the implemented full-rate receiver in [14]. Similar to section III, only non-linear path is used for NRZ recovery. Due to the relaxed dynamic range, we can use a fixed threshold in the latch. Recovered NRZ data measured with the same channel as in sections III and IV is shown in Fig. 19 at 10 Gb/s. The total power consumption is 32 mW from a 1.2-V supply.

B. Half-rate Receiver

Simulations of the half-rate architecture in 90-nm CMOS are used to demonstrate: (a) a speed improvement over the full-rate architecture, commensurate with that observed in 0.18-um CMOS, is possible; and (b) clock recovery and
Table I

<table>
<thead>
<tr>
<th></th>
<th>[13]</th>
<th>[4]</th>
<th>This work</th>
<th>This work</th>
<th>This work</th>
<th>This work</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Full-rate</strong></td>
<td>Full-rate</td>
<td>Full-rate</td>
<td>Full-rate</td>
<td>Half-rate</td>
<td>Full-rate</td>
<td>Half-rate (Simulated)</td>
</tr>
<tr>
<td><strong>Technology</strong></td>
<td>0.13um</td>
<td>0.18um</td>
<td>0.18um</td>
<td>0.18um</td>
<td>90nm</td>
<td>90nm</td>
</tr>
<tr>
<td><strong>Channel</strong></td>
<td>On-chip L-R</td>
<td>Proximity coupled</td>
<td>On-chip C-R</td>
<td>On-chip C-R</td>
<td>On-chip C-R</td>
<td>On-chip C-R</td>
</tr>
<tr>
<td><strong>Pre-amp Gain</strong></td>
<td>26 dB</td>
<td>—</td>
<td>23 dB</td>
<td>22 dB</td>
<td>17 dB</td>
<td>17 dB</td>
</tr>
<tr>
<td><strong>Bit-rate</strong></td>
<td>10 Gb/s</td>
<td>3 Gb/s</td>
<td>3.33 Gb/s</td>
<td>5 Gb/s</td>
<td>10 Gb/s</td>
<td>16.67 Gb/s</td>
</tr>
<tr>
<td><strong>VDD</strong></td>
<td>2.5 V</td>
<td>1.8 V</td>
<td>1.8 V</td>
<td>1.2 V</td>
<td>1.2 V</td>
<td></td>
</tr>
<tr>
<td><strong>Receiver</strong></td>
<td>500 mW</td>
<td>10 mW</td>
<td>72 mW</td>
<td>110 mW</td>
<td>32 mW</td>
<td>50 mW</td>
</tr>
<tr>
<td><strong>Power Consumption</strong></td>
<td>Includes Logic+ Buffer power</td>
<td>Includes buffer power</td>
<td>Includes buffer power</td>
<td>(With Clock Recovery 110 mW)</td>
<td></td>
<td></td>
</tr>
<tr>
<td><strong>Area</strong></td>
<td>2.72mm²</td>
<td>—</td>
<td>0.97mm²</td>
<td>1.08mm²</td>
<td>0.45mm²</td>
<td>—</td>
</tr>
<tr>
<td><strong>Sensitivity</strong></td>
<td>40 mV</td>
<td>120 mV</td>
<td>40 mV</td>
<td>40 mV</td>
<td>80 mV</td>
<td>80 mV</td>
</tr>
<tr>
<td><strong>FoM(pJ/Bit)</strong></td>
<td>50</td>
<td>3.33</td>
<td>21.8</td>
<td>22</td>
<td>3.2</td>
<td>3.0</td>
</tr>
</tbody>
</table>

Fig. 19. Recovered 10 Gb/s NRZ eye from full-rate receiver implemented in 90 nm CMOS

1:2 demultiplexing is readily feasible within this architecture.

The same circuits described in Fig. 13 are ported to a 90-nm process. Following the T-FF, all remaining circuitry is implemented using full-swing CMOS logic. The Spectrum of the signals \( u_1 \) and \( u_2 \) contain tones at the baud rate which can be utilized for clock recovery using a PLL. Phase locking can also be done using an injection locked oscillator or gated VCO which provides the fast locking required for burst mode applications. Injection locking a half-rate clock relaxes the VCO design compared to a full-rate VCO. In the proposed architecture the signals \( u_1 \) and \( u_2 \) are used to injection-lock a half-rate ring oscillator which operates at 8.33 GHz. The recovered half-rate clock is then used to demultiplex and retime the data. Proper recovery of the even and odd data is demonstrated using this technique in simulations at 16.67 Gb/s in Fig. 20. This represents a 67% increase in data rate over the full-rate measurements, which is very comparable to the measurement results from the 0.18-um prototypes where the half-rate architecture offered a 50% increase in data rate, from 3.3 Gb/s to 5 Gb/s. The total simulated power consumption, including clock recovery, demultiplexing, and required logic, is 110 mW from a 1.2-V supply.

VI. CONCLUSION

In recent years, there has been significant effort to improve sensitivity and speed of AC coupled receivers. This trend is driven by desires to use small ac-coupling capacitors, achieve higher data rates, and/or accommodate lossy channels. Thus high preamp gain and bandwidth are required at the cost of additional power and area. A more detailed comparison of the proposed receivers and state-of-the-art receivers with high sensitivity is given in table-I. Sensitivity is measured by the minimum signal amplitude required for error free detection at receiver. For comparison of different implemented receivers, we used a Figure of Merit (FoM), which is defined as:

\[
FoM(\text{pJ/Bit}) = \frac{\text{Power consumption}}{\text{Bit rate}}
\]  

Compared to the 10 Gb/s receiver, the presented full-rate and half-rate receivers achieve similar sensitivity with significant power reduction. On the other hand, power and area efficiency can be further improved in 90nm implementation. Compared to full-rate architectures, the proposed half-rate
architecture can potentially achieve twice the speed at the cost of additional hardware complexity and power. In this work, a 50% improvement in speed is achieved at the cost of a 30% increase in power. Clearly, the half-rate architecture is particularly useful when the targeted speed is not achievable using full-rate architectures.

Another potential application of the half-rate architecture is for clock-less demultiplexing which was proposed for burst mode applications in [20] to relax the lock time requirements of the subsequent clock and data recovery circuitry. In [20], a finite state machine performs sophisticated processing resulting in high power consumption. On the other hand, the half-rate receiver proposed in this work demultiplexes the bit stream based on edge detection which is actually performed by the passive 1-D channel, thus providing reduced power consumption.

ACKNOWLEDGMENT

The authors would like to thank Intel Corp. and Broadcom for funding this research and CMC for providing fabrication facilities.

REFERENCES


