# Clock-Routing Driven Layout Methodology for Semi-Synchronous Circuit Design<sup>\*</sup>

Atsushi TAKAHASHI, Wataru TAKAHASHI, and Yoji KAJITANI

Department of Electrical and Electronic Engineering, Tokyo Institute of Technology 2-12-1 Ookayama, Meguro-ku, Tokyo 152, Japan E-mail: {atushi, wata, kajitani}@ss.titech.ac.jp

#### Abstract

The most expensive part in modern VLSIs is the clockdistribution network where the clock is assumed to be distributed periodically and simultaneously. While, in the semi-synchronous system, the clock is assumed to be distributed periodically, but not necessarily simultaneously. In this framework, we propose a new design methodology which maximizes the performance of a circuit subject to the minimum cost clock-distribution network. The clock-delay map is calculated in advance and then the circuit placement procedure maximizes the performance according to it. This methodology reduces the clock-distribution cost significantly. The experiments show that the performance of a circuit obtained by our methodology is comparable with that of the complete-synchronous circuit in most cases.

#### 1 Introduction

In layout synthesis, distribution of the clock is critical to the performance of sequential circuits. In the *complete-synchronous* system, the clock is assumed to be distributed periodically and simultaneously to every register. Therefore, the *clock-skew*, the maximum difference of delays to the clock pins on registers from the clock source is a negative effect against speeding up a sequential circuit. Thus efforts have been towards its elimination. Surveys are found in [4, 1].

While, in the *semi-synchronous* system [9], the clock is assumed to be distributed periodically to every register, but not necessarily simultaneously. A *clock-timing* of a register is the difference between clock-delays to the register and to a reference register. A *clock-schedule* is a set of clock-timings of registers. Given signaldelays between registers, a clock-schedule that realizes the minimum clock-period is called an *optimum*  clock-schedule. The minimum clock-period and an optimum clock-schedule in the semi-synchronous framework are determined by a graph theoretical algorithm [9]. A remarkable feature of the semi-synchronous system compared with the complete-synchronous system is that the clock-period can be shorter than the maximum signal-delay between registers. The experiments in [9] show that the reduction of the clock-period to the minimum clock-period in the complete synchronous framework, which is the maximum signal propagation delay, is 16.1% on average. An optimum clock-schedule is also obtained by a linear programming [3], or by using the decision version of the problem with binary search strategy [2]. Similar discussions are found in the multiphase clock-schedule [6, 5, 7].

We call a clock-tree that realizes a given clockschedule a *schedule-clock-tree*. It is shown that a schedule-clock-tree for an arbitrary clock-schedule can be realized using the Elmore-delay model [8]. However, the experiments in [8] show that the total wire length of schedule-clock-tree for randomly generated clock-pin locations and clock-schedule was about 1.8 times larger than that of zero-skew clock-tree.

Since the most expensive part in modern VLSIs is the clock-distribution network, we should attach great importance to the cost of the clock-distribution as well as the clock-period of a circuit. It is possible to reduce the cost of the clock-distribution by some extent under the control of clock-schedule. However it seems impossible to reduce the cost to the level attained by the zeroskew clock-tree as far as we are devoted to optimize the clock-schedule for given signal-delays between registers and locations of registers which are determined without taking the advantage of the semi-synchronous system.

We propose a new design methodology which minimizes the clock-period of a circuit subject to the given clock-distribution network. A minimum cost clockdistribution network is assumed to be prefabricated on the chip. The clock-delay to each clock pin which depends on its location is calculated from the clockdistribution network. This provides us with the *clock*-

<sup>\*</sup>This work was supported in part by CAD21 (TIT) and in part by Grant-in-Aid for Scientific Research of the Ministry of Education, Science and Culture of Japan.

delay map. The clock-timing of each register is determined automatically by its location. Thus, the clockschedule is determined if a layout of the circuit is given. The signal-delays between registers, which consist of the gate-delay and routing-delay, are also calculated from the layout of the circuit. Accordingly, the minimum clock-period of the layout in the semisynchronous framework is determined. Thus the target of a placement procedure is the minimization of the clock-period of the circuit by optimizing layout under the clock-delay map.

This methodology enables us to reduce the wire length of the clock-distribution network in practical size by 30% compared with the H-tree based clock-distribution network. This reduction is confirmed when more than 400 clock-pins are distributed uniformly on a unit length 2-dimensional mesh. In this case, the wire length of an H-tree, which will be a zero-skew clock-tree, is  $1.5(n - n^{0.5})$  and that of a minimum spanning tree is n - 1 where n is the number of clock-pins.

Experiments using a simplified model show that the performance of a semi-synchronous circuit obtained by our methodology is comparable with the performance of the complete-synchronous circuit in most cases.

The rest of paper is organized as follows. After giving the basic definitions in Section 2, we present a proposed methodology in Section 3. In Sections 4 and 5 the model for experiments and experimental results are presented. Section 6 is the conclusion.

#### 2 Preliminaries

Sequential circuit N under consideration to be designed in the semi-synchronous framework consists of registers and gates, and wires connecting them. Every register of N is ticked equi-period but not necessarily simultaneously by the clock, which is distributed by a clockdistribution network C.

A clock from clock-source  $s_0$  arrives at clock-pin of each register v through C with some delay which is called the *clock-delay* of v. Our concern is not with the clock-delay itself but their relative difference. To convert the clock-delay to the relative difference, we take an arbitrary (maybe an imaginary) register as the reference register such that it is ticked by a clock-edge with the *reference clock-delay*. Then register v is ticked d(v) time after the reference register is ticked. d(v)is called the *clock-timing* of v. The *signal-delay* from register u to register v consists of the *gate-delay* and routing-delay. The signal-delay is not unique because of signal propagations through various paths on N, different rise and fall gate delays, etc. Let  $w_{min}(u, v)$  and  $w_{max}(u, v)$  be the minimum and maximum signal-delay from register u to register v, respectively.

There are two types of constraints, called No-

Double-Clocking Constraint and No-Zero-Clocking Constraint, to guarantee the circuit to function correctly in the semi-synchronous framework [3].

#### **No-Double-Clocking Constraint:**

$$d(v) \leq d(u) + w_{min}(u, v) \quad (1)$$

No-Zero-Clocking Constraint:  

$$d(x) + ay = (x, y) \leq -t + d(y)$$
(2)

 $d(u) + w_{max}(u, v) \leq t + d(v)$ (2)

where t is the clock-period of the circuit concerned. Given clock-timings and signal-delays, the circuit is feasible if  $d(u) - d(v) + w_{min}(u, v) \ge 0$  for any register pair u and v, and the minimum clock-period is the maximum  $d(u) - d(v) + w_{max}(u, v)$  over all register pairs uand v with signal propagation.

#### 3 Clock-Routing Driven Layout

The proposing design methodology is outlined as follows: First, a clock-distribution network C with minimum cost is assumed to be fabricated on the chip independent of the circuit N; Second, the layout of the circuit N is optimized under the assumed clockdistribution network C; Third, the clock-distribution network is fabricated on the chip without or with minor modification from the assumption. If the clockdistribution network C is modified in the third step, the repetition of the second and third steps may be required.

The cost of a clock-distribution network C may depend on the wiring area, power consumption, and reliability. But it has nothing to do with the clock-skew, the difference of clock-delays to clock-pins, in this methodology.

The clock-timing d(v) of a register v is calculated from the assumed clock-distribution network C which depends on the location of the register. The signaldelay w(u, v) between registers u and v is calculated from the layout of the circuit N. If d(u) - d(v) + $w_{min}(u,v) < 0$  for some register pair u and v, the layout of the circuit N is not compatible with the clockdistribution network C, that is, the circuit does not work correctly for any clock-period. Otherwise, the minimum feasible clock-period for the layout is the maximum  $d(u) - d(v) + w_{max}(u, v)$  over all register pairs u and v with signal propagation. The layout of the circuit is optimized by a placement procedure with respect to the minimum clock-period. Of course, we can take other evaluations other than the clock-period, for example, the total wire length of the circuit.

Note that the semi-synchronous circuit obtained by this methodology is better than the completesynchronous circuit even if the layout evaluations, for example clock-period, are comparable, since the clockdistribution cost can be significantly reduced.



Figure 1: Chip model

## 4 Layout Model for Validation of the Methodology

For the validity of our methodology, we synthesize the layout of the circuit using a simple model.

The chip model is the one as shown in Fig. 1. There are slots for registers on the 2-dimensional mesh in the chip and slots for IOs on the boundary of the chip. The clock-source  $s_0$  is at the middle of the top boundary of the chip.

The clock-delay of register v is proportional to the Manhattan distance from  $s_0$  to the slot which v is assigned. Regarding the clock-source as the reference register, we have that  $d(v) = \alpha L_2(s_0, v)$  where  $\alpha$  is a certain coefficient that transforms the length to delay and  $L_2(s_0, v)$  denotes the Manhattan distance between  $s_0$  and v. While, the clock-delay of IO pins v is fixed to  $d(v) = \alpha (W/4 + H/2)$  where W and H are the width and height of the chip, respectively. This value is selected to minimize the maximum clock-delay difference between the adjacent register slot and IO slot. These definitions will be understood by the clock-delay map shown in Fig 2.

For the signal-delays of N, each gate has a unit delay and the routing-delay from registers u to v is proportional to the Manhattan distance between u and v. The assingment of the gates does not affect the routing-delays. Thus we ignore the assingment of gates. Let  $g_{min}(u,v)$   $(g_{max}(u,v))$  be the minimum (maximum) number of gates over all paths from u to v. Then  $w_{min}(u,v) = g_{min} + \beta L_2(u,v)$  and  $w_{max}(u,v) =$  $g_{max} + \beta L_2(u,v)$  where  $\beta$  is a certain coefficient that transforms the length to delay.

The placement procedure used for layout synthesis is based on a simulated-annealing (SA) strategy. The



Figure 2: Clock-delay map

procedure starts with a random assignment of registers and IOs to their slots. Then, the procedure repeats a pairwise exchange of the contents in either register slots or IO slots, until the assingment is infeasible or the predefined number of pairwise exchanges are done. If a feasible assignment is obtained, the procedure continues a pairwise exchange of the contents of slots using SA method, and improves a layout. If an exchange leads to the infeasible assignment, the exchange is rejected without Boltzmann test in SA. Otherwise the exchange is tested by the evaluation function. An assignment of registers and IOs to their slots is evaluated by the minimum clock-period T and the summation Zof  $d(u) - d(v) + w_{max}(u, v)$  over all register pairs u and v with signal propagation. The evaluation function is f = aT + bZ where a and b are the weight constants.

#### 5 Experiments

The above procedure was applied to LGSynth93 benchmarks using Sun Sparcstation 10. Each circuit is designed in the complete-synchronous framework and semi-synchronous framework. In the completesynchronous framework, a flat clock-delay map is used, that is, the ideal zero clock-skew is assumed. For each circuit, the width W and height H of the chip, the number of slots, and the distance between adjacent slots are determined to fit the circuit. The parameters used in the experiments are as follows.  $\alpha = 0.7$ ,  $\beta = 0.7$ , a = 10, and b = 1.

The results of the best layout with respect to the clock-period among 10 layouts for each circuit in both frameworks are shown in Table 1. In Table 1, "Regs" and "IOs" are the numbers of registers and IO terminals, respectively. " $\tau_N$ " is the clock-period. "len" is the total wire length which is estimated by the sum of the distances between registers with signal propaga-

|       |               | complete-sync       |                                  | semi-sync                 |             |
|-------|---------------|---------------------|----------------------------------|---------------------------|-------------|
| name  | Regs<br>(IOs) | $	au_N[\mathbf{u}]$ | $\operatorname{len}[\mathbf{u}]$ | $	au_N[\mathrm{u}]\ (\%)$ | len[u] (%)  |
| s27   | 3(5)          | 8.8                 | 46                               | 8.4 (-4.5)                | 49 ( 6.5)   |
| s208  | 8(13)         | 16.2                | 443                              | 16.1(-0.6)                | 451 (1.7)   |
| s298  | 14 (9)        | 14.3                | 443                              | 13.3(-7.0)                | 468(5.6)    |
| s344  | 15(20)        | 22.7                | 767                              | 19.0(-16.3)               | 794(3.6)    |
| s349  | 15(20)        | 22.7                | 755                              | 19.0(-16.3)               | 778 ( 3.1)  |
| s382  | 21 (9)        | 15.6                | 1116                             | 15.8(1.3)                 | 1128 (1.1)  |
| s386  | 6(14)         | 19.1                | 1163                             | 20.5(7.3)                 | 1200(3.2)   |
| s400  | 21 ( 9)       | 15.9                | 1088                             | 15.8 (-0.6)               | 1152(5.8)   |
| s420  | 16(21)        | 24.0                | 1210                             | 25.4(5.8)                 | 1289(6.5)   |
| s444  | 21 (9)        | 16.0                | 1107                             | 15.0 (-6.2)               | 1205 ( 8.8) |
| s510  | 6(26)         | 20.7                | 977                              | 25.5(23.2)                | 972 (-0.4)  |
| s526  | 21 ( 9)       | 15.5                | 1066                             | 14.9 (-3.9)               | 1068(0.2)   |
| s526n | 21 (9)        | 15.5                | 1066                             | 14.9(-3.9)                | 1066(0.0)   |
| s641  | 19(59)        | 73.2                | 6074                             | 68.9(-5.9)                | 6622 ( 9.0) |
| s713  | 19(58)        | 69.9                | 5916                             | 66.4(-5.0)                | 5785(-2.2)  |
| s820  | 5(37)         | 20.5                | 2651                             | 27.3(33.2)                | 2499(-5.7)  |
| s832  | 5(37)         | 20.5                | 2636                             | 27.3 (33.2)               | 2552(-3.2)  |
| s838  | 32(37)        | 34.6                | 2936                             | 33.4 (-3.5)               | 3038(3.5)   |
| s953  | 29(39)        | 22.3                | 3952                             | 22.7(1.8)                 | 4660(17.9)  |
| s1196 | 18 (28)       | 39.0                | 7101                             | 39.0(0.0)                 | 7152(0.7)   |
| s1238 | 18(28)        | 37.4                | 6928                             | 37.4 ( 0.0)               | 7155 ( 3.3) |
| s1423 | 74 (22)       | 99.3                | 118713                           | 96.5 (-2.8)               | 121361(2.2) |
| s1488 | 6(27)         | 35.2                | 5181                             | 43.3 (23.0)               | 5122(-1.1)  |
| s1494 | 6(27)         | 35.2                | 5193                             | 43.3 (23.0)               | 5103 (-1.7) |
| ave.  | —             | -                   | —                                | -(3.1)                    | -(2.9)      |

Table 1: Layout Results. (u: unit delay or unit length)

tion. The percentages in a semi-synchronous column are the differences from a complete-synchronous column. The computational times of both frameworks are almost same for each circuit and at most 250[s] in each design.

For most of circuits, the clock-period and total wire length of the circuit in the semi-synchronous framework are comparable with those in the complete-synchronous framework. Note that the total wire length is evaluated in the complete-synchronous framework since the summation Z is the total wire length, but not evaluated directly in the semi-synchronous framework. For several circuits, especially in cases that the number of registers is small in contrast with the number of IOs, the layout results in the semi-synchronous framework are inferior to that in the complete-synchronous framework. It seems that the number of registers is too small to work the placement procedure well under the simple model or the circuit is not fit for the assumed clockdelay map.

### 6 Concluding Remarks

We propose a clock-routing driven layout methodology for semi-synchronous circuit design. The experiments show that the performance of a semi-synchronous circuit obtained by our methodology is comparable with the performance of the complete-synchronous circuit in most cases. The estimated average length of clock-distribution network is 72.1[u] in the semisynchronous framework to 89.4[u] in the completesynchronous framework in the experiments.

However, the experiments in this paper are not enough to appeal the effect of the proposed methodology from many points of view. The very simple chip model and delay model are assumed. The number of registers in each circuit is too small to reduce the wire length of the clock-distribution network substantially. The modification of the clock-distribution network is not considered. There are large spaces for improvement of the layout procedure used for experiments to convince the potential ability of the methodology.

#### References

- J. Cong, L. He, C. K. Koh, and P. H. Madden, "Performance optimization of VLSI interconnect layout," *INTEGRATION*, the VLSI journal, vol. 21, pp. 1–94, 1996.
- [2] R. B. Deokar and S. S. Sapatnekar, "A graphtheoretic approach to clock skew optimization," in *Proc. ISCAS* '94, vol. 1, pp. 407–410, 1994.
- [3] J. P. Fishburn, "Clock skew optimization," *IEEE Trans. on Computers*, vol. 39, no. 7, pp. 945–951, 1990.
- [4] E. G. Friedman, ed., Clock Distribution Networks in VLSI Circuits and Systems: A Selected Reprint Volume. IEEE Press, 1995.
- [5] D. A. Joy and M. J. Ciesielski, "Placement for clock period minimization with multiple wave propagation," in *Proc. 28th DAC*, pp. 640–643, 1991.
- [6] K. A. Sakallah, T. N. Mudge, and O. A. Olukotun, "Analysis and design of latch-controlled synchronous digital circuits," in *Proc. 27th DAC*, pp. 111–117, 1990.
- [7] T. G. Szymanski, "Computing optimal clock schedules," in *Proc. 29th DAC*, pp. 399–404, 1992.
- [8] A. Takahashi, K. Inoue, and Y. Kajitani, "Clocktree routing realizing a clock-schedule for semisynchronous circuits." to appear in Proc. ICCAD '97, 1997.
- [9] A. Takahashi and Y. Kajitani, "Performance and reliability driven clock scheduling of sequential logic circuits," in *Proc. ASP-DAC '97*, pp. 37–42, 1997.