1. Introduction
Modern Trends in Chip to Chip communication have moved towards Ghz-speed serial links using clock recovery techniques. The adoption of clock recovery architectures require dc-balanced data streams. As a result, system architects [1],[2] have developed encoding algorithms to convert a regular 8-bit stream of data into a 10-bit dc-balanced, run length limited data pattern. In this project we will be implementing a single channel 8B/10B encoder following the PCI-Express® encoding requirements.
2. High-level General Design Flow
Figure 2.1 of the appendix shows our high level design flow. The design process is divided into tree teams: Front end, Back end and Custom. Dash lines represent work done by the same team/person. In this flow, all three teams can relatively start work at the same time reducing development time. The next sections describe in detail the steps of how each team iterated through this flow.
3. Architecture, RTL, Synthesis.
As indicated in figure 2.1, the first step in the design process is defining the top-level architecture. This includes core block functionality and I/O to core interface, taking into account system level timing, area and technology limits. Figure 3.1 summarizes the proposed architecture. The encoder is to be implemented using CMC 0.35u wcells Std Library, targeting Frequency of 250Mhz.
3.1 RTL Code
The encoder core logic is composed of 5 blocks. Verilog RTL Code was written for each blocks individually first and verified. Once all blocks pass low level verification, the entire core RTL netlist was composed and once again functionally verified. The 8 bit input word is divided into sets of 3 and 5 bits. This pair is then encoded to a 4/6 bits pair given the depending on running disparity and run length previous set of encoded data. Table 3.1 describes detail functionality of each block:
Block | Description |
---|---|
enc_k |
determines weather current data stream is a valid command code if k input is asserted. Also outputs encoded 4B/6B for a valid input commands. |
enc_d |
encode input 8bit word to 4B/6B data streams. |
enc_flip |
determine weather encoding algorithm allows for DBI (data bit inversion). Determine if running disparity (RD) needs to be flipped after transmission of data stream. outputs other control signals to reset, current RD value. |
S3 |
State machine stage. keeps track of current RD. Mux between enc_k or enc_d 4B/6B outputs. determine if data needs to be inverted if DBI is enabled. Determine if invalid_k output needs to be asserted. |
S4 |
output flop stage. Send out data or inverted data depending on S3 output. Allow for output data retiming incase necessary to meet system level timing. |
3.2 IO Clocking Architecture
Since the 8B/10B encoder ASIC operates synchronously, the clock/data relationship at its input ports and its output ports, relative to the respective input clock and output clock, have to be considered at an early design stage. For ease of system integration, the data hold time was chosen to be “zero” at the IO interface. As a result, for system integration with other integrated circuits, the only requirement would be that the input data edge never arrives earlier than the clock edge. Figure 3.2 Shows the Core-to-Pad interface.
3.3 Synthesis
Once the architecture is solidified, The RTL code is synthesized with preliminary constraints. These constraints are further tweaked during synthesis to obtain the optimal design based on area, timing and power requirements. A flat (vs. hierarchical) synthesis approach was chosen since we are able to achieve our targets with more margin and would ease setting design constraints. Once synthesis is complete, a physical verilog netlist is exported from the Synopsys Design Environment. This netlist along with synthesis constraints file, will then be used to place are route the design in First Encounter.
4. PR flow: placement, clock tree, routing and timing closure
This section describes our basic place and route flow used to design the encoder. The entire flow is detailed in the following seven subsections starting from placement of IO pads to generating GDS for DRC/LVS. Figure 4.1 illustrates our P&R flow. There are four main inputs: gate-level netlist, design timing constraints, physical library of all standard cells and I/O's (lef) and timing library (lib). The final output is place and route GDS. The Padout spreadsheet (Figure 5.1) maintained by the IO team was used to place IO pads around the ring as indicated.
4.2 Power Grid Design
The power grid structure of our design consists of Metal1, Metal2 and Metal3. The number of stripes is sufficient to have power consistently distributed throughout the entire core area and have minimal IR drop at the chip center. The following table describes the layout of the power grid.
Metal |
Orientation |
Width |
Pwr&Gnd stripes spacing |
Set to set spacing |
---|---|---|---|---|
M1 |
Horizontal |
5.8um |
15.8um |
43.2um |
M2 |
Vertical |
10um |
1.5um |
72um |
M3 |
Horizontal |
10um |
11.6um |
86.4um |
4.3 Standard Cell Placement
The encoder consists of 878 instances. We have utilized the placement engine from First Encounter to place the design. We achieve 65% placement utilization. Figure 4.2 is a snapshot of our cell placement.
4.4 Clock Tree Synthesis
The clock signal is coming from a Clock Pad. It then branches out with one (clk_in) is driving flops and the other (clk_in_delay) is going to all Input Pads. The clk_in_delay clock is the delay version of the clk_in to satisfy IO timing constraints. First Encounter Clock Tree Synthesis (CTS) is used to generate those clock trees. The following table summarized the characteristics of those two clock trees. Figure 4.3 is showing our clock tree structures after CTS.
Clock Name |
flops/buffers |
Insertion delay |
Skew |
Transition |
---|---|---|---|---|
clk_in |
54/96 |
1.4 -> 1.6 ns |
200.6 ps |
392ps |
clk_in_delay |
10/91 |
3.25 -> 3.35 ns |
100 ps |
217ps |
4.5 Routing
First Encounter Nanoroute is used to route the encoder. The design has minimal congestion hence zero routing violation is achieved.
4.6 Timing Closure and ECO
Since there is no timing model for our IO pads, setup and hold timing verifications can only be done by simulations on full chip gds. The design is extracted and checked for transition timing for both clock and data as followed:
Clock transition: |
392 ps (worst case) |
Data transition |
807 ps (worst case) |
Some buffers are upsized and inserted during ECO stages to speed up some transition. The overall timing is good.
4.7 GDS Generation
The physical database is converted to GDS from encounter. This gds has all routing metals, via's and references to standard cells. It is then streamed in using Cadence Virtuoso along with standard cells gds for final DRC/LVS verifications.
5. IO design: high speed IO
5.1 Full-Custom IO Design Flow
The design of the input/output circuit for the 8B/10B encoder ASIC began with the target specifications, and layout planning at the chip level. Figure 5.1 in the appendix illustrates the planning of the IO pad out, the chip power/ground to signal ratio, and signal ordering.
5.3 IO Drivers
The IO drivers are specified to drive 2kW, 20pF loads, using LVTTL signaling standard as specified in JEDEC JESD-8B (Interface Standard for Nominal 3V/3.3V Supply Digital Integrated Circuits). The reason for output drivers being able to drive resistive load, is to accommodate for subsequent stages which could have bi-directional IO pads, which could have resistive loading. Driving the specified loads, the final design demonstrates rise and fall times of 646ps and 469ps respectively.
5.4 IO Receivers
The clock receiver and data receiver are different in that the clock receiver consists of a Schmitt trigger for noise immunity, whereas the data receiver consists of a master-slave flip-flop to synchronize data relative to off-chip clock to on-chip clock.
5.5 ESD Structures
All IO pads have p+/nwell p-diodes and n+/psub n-diodes as their protection against positive and negative ESD hits. In addition, the input pads have series resistors of approximately 200W (constructed using parallel combination of pdiff resistor and ndiff resistor) for additional input protection.
5.6 IO Ring layout design
For a modular design, each inline-bonded IO pad has the same basic structure in a common instance, consisting of the bond site, horizontal VDD/VSS rails, ESD diodes, and guard ring isolations between core-logic and output driver/input receiver logic. A common corner pad connects up the four sides of the IO pads in a seamless manner. The IO ring instances and layout could be found in Figure 5.2 and 5.3 respectively.
The LEF (library exchange format) for each individual IO pad was exported from their abstract view, to be utilized by the back-end team in performing automated place-and-route. The IO ring is fully DRC and LVS cleaned prior to full-chip integration.
6. Top-level Integration
Chip integration is the final stage of the design, merging the auto-routed core logic, the full-custom IO ring, and any additionally required blocks such as a seal ring. The chip-level instance view and layout view are shown in Figure 6.1 and 6.2 .
In order to perform LVS, the verilog netlist for the core logic was imported into cadence as schematics constructed using standard cells. The signals interfacing with the IO ring were identified with IO ports created. The physical view of the core logic was imported from a gds with a stream layer map table providing the layer definitions.
Manual power tap connections were inserted to tap the IO power grid onto the chip core power grid. Finally, top level pin stamps were drawn to identify top level IO signals.
The IO ring and the core were individually lvs-cleaned first, prior to top-level integration; the lvs reports are shown in the appendix.
Due to the two different LVS methodologies required for IO and core logic (flat LVS vs. macrolvs), a complete full-chip was not able to be performed. However, the respective IO and core interfacing signals were checked to be connected properly.
7. Appendix
7.1 Figures
Figure 2.1 - Top Level Design Flow
Figure 3.1 - Encoder Core Blocks
Figure 3.2 - I/O to Core interface
Figure 4.1 - Place and Route Flow
Figure 4.3 - Placed and routed clock tree
Figure 5.1 - Snapshot of Padout Spreadsheet
Figure 5.2 - I/O Ring instance view
Figure 5.3 - I/O Ring Layout View
Figure 6.1 - Full Chip instance View
Figure 6.2 - Full Chip Layout View
7.2 I/O Ring LVS Reports
Running simulation in directory: "/proj/fc/devel_rfung/meng/ECE1388/LVS".
*WARNING* Attached technology library _wcells does not exist.
Design library has been temporarily attached to default technology library.
Please attach an existing technology library to the design library cmosp35diode.
Or add the attached technology library _wcells in cds.lib.
*WARNING* techOpenTechFile: unable to open file techfile.cds in library cdsDefTechLib in r mode
*WARNING* techPcellEvalTrigger: Internal error since tfCnt is equal to 0
*WARNING* techPcellEvalTrigger: Internal error since tfCnt is equal to 0
*WARNING* techPcellEvalTrigger: Internal error since tfCnt is equal to 0
*WARNING* techPcellEvalTrigger: Internal error since tfCnt is equal to 0
Begin netlist: Dec 20 00:42:25 2004
view name list = ("auLvs" "extracted" "gate.sch" "cmos.sch")
stop name list = ("auLvs")
library name = "Encoder8B10BLib"
cell name = "IORING"
view name = "extracted"
globals lib = "basic"
Running Artist Flat Netlisting ...
End netlist: Dec 20 00:42:41 2004
Moving original netlist to extNetlist
Removing parasitic components from netlist
presistors removed: 0
pcapacitors removed: 0
pinductors removed: 0
pdiodes removed: 0
trans lines removed: 0
6203 nodes merged into 6203 nodes
Begin netlist: Dec 20 00:42:43 2004
view name list = ("auLvs" "schematic" "gate.sch" "cmos.sch")
stop name list = ("auLvs")
library name = "Encoder8B10BLib"
cell name = "IORING"
view name = "schematic"
globals lib = "basic"
Running Artist Flat Netlisting ...
End netlist: Dec 20 00:42:54 2004
Moving original netlist to extNetlist
Removing parasitic components from netlist
presistors removed: 0
pcapacitors removed: 0
pinductors removed: 0
pdiodes removed: 0
trans lines removed: 0
7303 nodes merged into 7303 nodes
Running netlist comparison program: LVS
Begin comparison: Dec 20 00:42:56 2004
@(#)$CDS: LVS version 5.0.0 01/31/2004 20:15 (intelibm12) $
Warning: Devices on a command "permuteDevice" that are not present in netlist:
"capacitor".
Warning: Devices on a command "parameterMatchType" that are not present in netlist:
"capacitor".
1328 net-list ambiguities were resolved by random selection.
The net-lists match.
layout schematic
instances
un-matched 0 0
rewired 0 0
size errors 0 0
pruned 0 0
active 14018 11878
total 14018 11878
nets
un-matched 0 0
merged 0 0
pruned 0 0
active 6203 6203
total 6203 6203
terminals
un-matched 0 0
matched but
different type 0 0
total 69 69
End comparison: Dec 20 00:43:05 2004
Comparison program completed successfully.
7.3 Core LVS Reports
@(#)$CDS: LVS version 5.0.0 01/31/2004 20:15 (intelibm12) $
Command line: /proj/stfs1_vol9/local_user/vendor_tools/cadence.5033-1/tools.lnx86/dfII/bin/32bit/LVS -dir /proj/fc/devel_rfung/meng/ECE1388/LVS -l -s -f -t /proj/fc/devel_rfung/meng/ECE1388/LVS/layout /proj/fc/devel_rfung/meng/ECE1388/LVS/schematic
Like matching is enabled.
Net swapping is enabled.
Fixed device checking is enabled.
Using terminal names as correspondence points.
Net-list summary for /proj/fc/devel_rfung/meng/ECE1388/LVS/layout/netlist
count
934 nets
37 terminals
1 wdkrp_2
3 wand2_1
2 wdkrp_4
1 wand2_2
2 wdtkrp_2
20 wand2_4
1 wdtkrp_4
2 wnor2_2
4 wnor2_4
13 wbuf_1
16 wdp_4
11 wbuf_2
13 wand3_4
74 winv_1
154 wbuf_4
37 winv_2
4 wnand2_1
10 wnand2_2
252 winv_4
23 wnand2_4
1 wand4_1
15 wxor2_2
3 wand4_4
3 wdtp_2
24 wdtp_4
1 wnand3_2
1 wnand3_4
5 wor2_1
29 wcd_8
21 wor2_2
56 wor2_4
5 wmux2_4
1 wnand4_4
4 wor3_1
6 wor3_4
4 wcd_12
1 wdtkrsp_2
4 wcd_16
2 wdtkrsp_4
2 wdkrsp_2
1 wor4_1
2 wor4_2
Net-list summary for /proj/fc/devel_rfung/meng/ECE1388/LVS/schematic/netlist
count
934 nets
37 terminals
1 wdkrp_2
3 wand2_1
2 wdkrp_4
1 wand2_2
2 wdtkrp_2
20 wand2_4
1 wdtkrp_4
2 wnor2_2
4 wnor2_4
13 wbuf_1
16 wdp_4
11 wbuf_2
13 wand3_4
74 winv_1
154 wbuf_4
37 winv_2
4 wnand2_1
10 wnand2_2
252 winv_4
23 wnand2_4
1 wand4_1
15 wxor2_2
3 wand4_4
3 wdtp_2
24 wdtp_4
1 wnand3_2
1 wnand3_4
5 wor2_1
29 wcd_8
21 wor2_2
56 wor2_4
5 wmux2_4
1 wnand4_4
4 wor3_1
6 wor3_4
4 wcd_12
1 wdtkrsp_2
4 wcd_16
2 wdtkrsp_4
2 wdkrsp_2
1 wor4_1
2 wor4_2
Terminal correspondence points
1 VDD!
2 VSS!
3 clk_in
4 clk_in_delay__L40_N0
5 clk_in_delay__L40_N1
6 clk_in_delay__L40_N2
7 clk_in_delay__L40_N3
8 clk_in_delay__L40_N4
9 clk_in_delay__L40_N5
10 clk_in_delay__L40_N6
11 clk_in_delay__L40_N7
12 clk_out__L14_N0
13 clk_out__L14_N1
14 clk_out__L14_N2
15 clk_out__L14_N3
16 clk_out__L14_N4
17 data0_net
18 data1_net
19 data2_net
20 data3_net
21 data4_net
22 data5_net
23 data6_net
24 data7_net
25 invalid_k_net
26 k_net
27 rst_net
28 tx_data0_net
29 tx_data1_net
30 tx_data2_net
31 tx_data3_net
32 tx_data4_net
33 tx_data5_net
34 tx_data6_net
35 tx_data7_net
36 tx_data8_net
37 tx_data9_net
The net-lists match.
layout schematic
instances
un-matched 0 0
rewired 0 0
size errors 0 0
pruned 0 0
active 834 834
total 834 834
nets
un-matched 0 0
merged 0 0
pruned 0 0
active 934 934
total 934 934
terminals
un-matched 0 0
matched but
different type 0 0
total 37 37
Probe files from /proj/fc/devel_rfung/meng/ECE1388/LVS/schematic
devbad.out:
netbad.out:
mergenet.out:
termbad.out:
prunenet.out:
prunedev.out:
audit.out:
Probe files from /proj/fc/devel_rfung/meng/ECE1388/LVS/layout
devbad.out:
netbad.out:
mergenet.out:
termbad.out:
prunenet.out:
prunedev.out:
audit.out:
8 References
[1] Actel Corporation, 955 East Arques Avenue, Sunnyvale, California 94086, USA. "Implementing an 8b/10b Encoder/Decoder for Gigabit Ethernet in the Actel SX FPGA Family". web: www.actel.com
[2] Widmer Albert X. "8B/10B Encoding and Decoding for High Speed Applications". IBM T.J. Watson Research Center. 1101 Kitchawan Rd. Route 134, Yorktown Heights, NY 10598-0218