Abstract

We describe our prototyping of a functioning DLX microprocessor, based on the 32-bit instruction set architecture developed by Patterson and Hennessy. This architecture is an emerging academic standard, but to our knowledge has yet to be successfully prototyped. Our implementation of DLX is a 12" x 15" 2-layer circuit board, containing 59 chips and running on a 2 MHz clock. Our machine was developed at the Thayer Rapid Prototyping Facility, a laboratory for the rapid construction and evaluation of digital systems.

1.0 Introduction

The DLX microprocessor is a 32-bit RISC CPU, designed by David Patterson and John Hennessy and described in detail in [1]. This description includes instruction set formats, opcode mnemonics, and a basic datapath. A compiler and simulator are also available, greatly simplifying the task of developing a complete working system. Despite this, while DLX has been simulated extensively at many different sites (see for example [2] and [3]), it has to our knowledge never been implemented.

This paper describes our implementation of the DLX microprocessor. The resulting system was produced entirely on site, and correctly executes C programs compiled with the DLX compiler. We have learned a great deal from building a working system, and here share our insights for others interested in implementing DLX and other complex microsystems.

We begin with an overview of the Thayer Rapid Prototyping Facility, our laboratory for the rapid production of digital systems. We then briefly discuss the DLX architecture, and describe our implementation in considerable detail. We discuss how Thayer DLX programs are executed, identify the significant problems we encountered, and discuss how they were solved. Conclusions are presented, along with directions for future work.

2.0 The Thayer Rapid Prototyping Facility

Approximately three years ago, we began a concerted research effort into providing an enhanced rapid prototyping capability for our research in digital system design. We had identified several problems with VLSI-based prototyping approaches [4], particularly for sites where local fabrication was not possible, and decided to adopt a PCB-based approach. We envisioned a facility where users could walk in with an idea and walk out with hardware, all without ever having to leave the lab. The result was the Thayer Rapid Prototyping Facility.

The Thayer RPF is designed to provide hardware and software assistance for each step of the design process. Our view of this process is shown in Figure 1. The design begins with functional specification and a high-level block diagram. Hardware components are selected, and the design entered with schematic capture tools. The design is simulated , and a netlist generated. After placement, the design is then used as input to a PCB layout program. Layout and routing tools produce a board description file, which is turned into a board using a PCB prototyper. The resulting board is populated with ICs and tested.

The goals of the Thayer RPF are 1) to have all steps of this process performed in the same laboratory, and 2) to produce working prototypes as quickly as possible. To achieve these goals, we have constructed an integrated environment of commercial products. We have chosen the Sun Sparcstation as the main workstation for the Thayer RPF. Currently, we have four color workstations in the laboratory. The Workview^® tool package, from Viewlogic Incorporated, is used to accelerate the schematic capture and simulation stages of the design process. Workview provides a complete schematic capture and simulation package, including back-annotation, hierarchical schematics, an extensive parts library, and support for device modeling. We also have an extensive array of field programmable gate array support, including Actel, Altera, and Xilinx development systems.

For PCB layout, we are using the Racal-Redac PCB system, running on an IBM PC. We have written our own software in house that assists in the translation of Workview netlists to the format Racal-Redac requires.

The RPF is perhaps unique as an academic laboratory in that it has the capability of producing printed circuit boards on site, in the same room in which systems are designed. The RPF employs a PCB prototyping system developed by Direct Imaging Incorporated, in which a resistive ink is sprayed on copper sheets and then etched with sodium persulfate. The ink is then scrubbed off, the sheets tin plated and automatically drilled, and then assembled into a finished prototype.

System testing is the final stage of the design process. This requires a sophisticated pattern generator and logic analyzer, one that interfaces easily with the simulation tools of the CAD system being used and permits rapid comparison of simulation output vectors with observed output vectors. The Thayer RPF uses the HP 16500A logic analyzer for system bringup and test. This device has proven extremely effective in the final stages of the system prototyping process.

The Thayer Rapid Prototyping Facility has been involved in several successful experiments in rapid digital system design. In addition to the DLX microprocessor, other projects include the design of a computer for gene sequence analysis, an FHT transform engine, and a real-time data processor for rocket telemetry. For further information on these and other projects, the reader is referred to [5], [6], [7], and [8].

The problems of producing working systems in a university environment are well known [9]. These problems include continuity of personnel, publication pressures, and resource availability. We have experienced all these difficulties in our efforts to develop a rapid prototyping laboratory. Nonetheless, our experience with the RPF confirms the positive experience of other researchers in this area; rapid prototyping capabilities make valuable contributions to both teaching and research that simulation cannot. Our prototyping of the DLX microprocessor is a case in point.

3.0 DLX Architecture

For readers unfamiliar with the DLX architecture, we give a brief overview of it here. Other readers may skip to the next section. For a more detailed presentation, the reader is referred to [1].

DLX is a 32-bit microprocessor architecture, with 32 general purpose registers and a hard-wired zero in R0. Memory is Big Endian byte addressable (i.e. byte 0 is in the most significant position of the word), and all instruction accesses are aligned.

The DLX integer instruction set is shown in Table 1. (A floating point extension of the architecture is also described in [1], which we did not implement). There are three basic classes of instruction: data transfer, arithmetic/logical, and control flow. Instruction formats are shown in Figure 2. We note that the DLX instruction set is highly streamlined. The number of instructions and instruction formats is small, and instruction decoding is simple.

One suggested DLX datapath is shown in Figure 3. Consistent with the instruction set, the datapath has two source busses and a destination bus. A 32-bit ALU is responsible for basic arithmetic and logical functions, with interaction to memory handled by a Memory Address Register and a Memory Data Register. Interrupt addresses are stored in the Interrupt Address Register, while instructions are fetched with a PC. We see from the datapath that most instructions can execute in one cycle.

In general, we note that the DLX microprocessor has a streamlined instruction set, with a few simple instruction formats and operations that are easy to decode. The architecture is carefully described, and comes with a publicly available C compiler and simulator. All these features make it an ideal candidate for microsystem prototyping.

4.0 Implementation

Our implementation of DLX is a 2-layer 12" x 15" printed circuit board, shown in Figure 4. This board was manufactured in the RPF with the PCB prototyper discussed earlier. The board contains 59 chips and consumes 12.5 watts of power.

A block diagram of the Thayer DLX datapath is shown in Figure 5. Comparing with Figure 3, we see that the principle differences are 1) the adoption of a 2-bus architecture, and 2) the use of a so-called "universal unit", or UU.

We chose a 2-bus architecture for three reasons: 1) to match our available register files, which shared input and output pins, 2) to improve the routability of the board, and 3) to simplify the machine. This decision reflects a consistent willingness to sacrifice performance for an increased probability of producing a working prototype under time constraints. We expect that others interested in prototyping DLX will face similar tradeoffs.

The UU is a field programmable gate array, initially adopted to implement a 32-bit barrel shifter. As the design progressed, however, we discovered that more and more logic could be added to it without increasing chip count or power consumption. Thus the shifter became the UU, containing both the (nontrivial) sign-extension logic required by the DLX instruction set and the memory alignment circuitry. The ability to incorporate new logic into our design quickly and easily was crucial to its success; the use of a field programmable gate array was absolutely essential.

4.1 Program Execution

The Thayer DLX communicates with a Macintosh computer using a UART; binary files are downloaded over an RS-232 link into 32K of on-board SRAM and then executed.

To run programs on the Thayer DLX, the processor board must first be powered up and reset. A boot ROM initializes the register file, interrupt vectors and the UART, and then waits for incoming files over the RS-232 link.

To execute programs, the user begins by creating a C program and compiling it with the public domain DLX compiler (dlxcc). This produces an ASCII file of DLX instructions. The first few lines of a compiler output file are shown in Figure 6.

The DLX instruction file is then assembled using a modified version of the DLX simulator (dlxsim), producing an ASCII hex file of addresses and data. A portion of this file is shown in Figure 7. This file is then downloaded to the DLX board over the serial line; the boot ROM program reads the incoming characters and stores them in SRAM. Only the first two fields of each line are processed. When an address of 0xFFFFFFFF is read, the board stops the download and begins executing the program.

In addition to board initialization code, the DLX boot ROM contains UNIX library functions. Current system calls include printf, putc, and getc, which perform simple output on the Macintosh. We use memory-mapped I/O, allocating a certain portion of the address space to the UART. Other functions include integer multiplication and division, which are performed in software. All functions and library routines are written with DLX instructions. Our experience indicates that having students write ROM code for library functions teaches them lessons about hardware/software tradeoffs more effectively than any classroom exercise.

4.2 Errors Found

A computer system must be considered as a complete whole; the compiler, assembler, instruction set, hardware, and other components interact in subtle ways. The implementation and bringup of DLX require the ability to find and correct bugs virtually anywhere. As expected, the majority of errors occurred at subsystem interfaces; subsystem components, on the whole, worked correctly. Typical examples of this were 1) human errors in the conversion of the DLX schematic netlist to the PCB netlist, resulting in unrouted traces, and 2) an inability for certain instructions to access the UU properly, due to a simulation error at the interface between the FPGA and TTL parts of the design. Human error was, of course, also a factor. 17 connections had to be wire-wrapped manually; some of these were later found to be incorrect. Even the more mundane problems did not escape us; we had chips inserted incorrectly, poor solder connections, and improperly wired components due to misread documentation.

Of greater significance were errors identified in the DLX software and documentation in the course of debugging our board. We obtained our software via anonymous ftp in November, 1990, and based our implementation of DLX on the 1990 edition of [1]. The distribution includes source code and examples.

It is remarkable that we found as few errors as we did, since to our knowledge DLX has never been completely implemented. Nonetheless, since these problems will be of interest to others working with DLX, we mention them below:

1) Unsigned set instructions. In addition to SETxx instructions, which set the condition code based on a signed comparison, the compiler generates unsigned SETxxU instructions. We discovered this only after our board was designed, and were surprised to find them as they were not mentioned in the documentation. We redesigned the DLX finite state machine to support these instructions without much effort, and in the process discovered a design error in the setting of the overflow condition code.

2) Compiler errors in shift expressions. Our version of the DLX compiler would not compile C shift expressions of the form "a << b", with a and b variables. We corrected this problem by modifying the machine description file.

3) Compiler errors in logical negation. The compiler would accept logical negation operations (e.g. ~a), but the resulting code would not be accepted by the simulator. Changing the machine description file also fixed this problem.

4) Formats of SLLI, SRLI, and SRAI instructions. The SLLI, SRLI, and SRAI instructions were encoded by the assembler as R-format instructions. We believed this to be incorrect, and recoded them as I-format instructions.

5) Address calculations of labels. Forward references in the assembler were not handled properly, resulting in incorrect address values for labels in assembly files.

Other modifications were made to the software to suit our particular implementation of DLX. We cannot overemphasize the importance of source code access to our implementation efforts.

5.0 Performance

Our prototype is a slow machine, running at 2MHz. This is slower than any of the academically-developed prototypes described in [9]. We believe this to be due to 2 factors: 1) our willingness to choose less aggressive technology and trade off performance for increased probability of the production of a functioning prototype, and 2) the fact that DLX is the first major project completed at the RPF. We note that a later RPF project, the Gene Sequence Processor, runs at 10 MHz [8], although readers should use caution in comparing the two devices.

Additionally, DLX CPI figures are high for a streamlined instruction set. Of the 66 instructions we implemented, 32 require 5 cycles, 23 require 6, and 11 require 7, giving a static average CPI of 5.7. Many well-known techniques could be employed to reduced CPI, including pipelining, prefetching, and delayed memory accesses. Similarly, using different IC's (in particular, eliminating the FPGA) could yield a significantly faster clock.

These figures represent a tradeoff between two design goals: making the system fast versus making the system quickly. Virtually all our design decisions, the 2-bus architecture, the use of an FPGA, the simple control strategy, and others, reflect a willingness to tradeoff performance for a working system.

It seems clear that if building working hardware means consistently losing performance, then much of the motivation for building hardware is lost. We believe, however, that technology advances the functionality/performance curve, just as it advances cost/performance. (For example, as FPGA technology progresses, more functionality can be included in designs with faster clocks). As the technology for rapid prototyping improves, and as our familiarity with DLX increases, we anticipate building faster and faster versions of both DLX and other digital systems in less and less time. Plans for the next iteration of DLX include pipelining and floating point processing.

6.0 Conclusions

With a few notable exceptions (see for example [10]), universities have been difficult places to build functioning hardware. The translation of ideas from simulations to working prototypes is believed to be unnecessary and/or insurmountably difficult.

Our work suggests that advances in rapid prototyping technology force a reevaluation of this position. The emergence of open architectures, field programmable gate arrays, and PCB prototypers suggest that working hardware can be developed where previously simulation was all that could be expected. This means that students learn more; they find the design experience more rewarding when they build something that works.

But in addition to pedagogical benefits, rapid prototyping has significant scientific advantages. Results obtained from working hardware are inherently more credible than those from simulation; simulation results are virtually impossible to reproduce reliably, while hardware results are much more likely to be confirmed elsewhere. Constructing hardware is in a very real sense an experiment that tests a hypothesis, permitting strong inferential techniques and reproducibility of results to be employed in a manner closer to that of the physical sciences.

We have offered evidence in support of these conclusions in this paper, discussing the development of a functioning DLX microprocessor in an academic laboratory. This effort makes extensive use of rapid prototyping technology, embodied in an on-site laboratory for digital system construction. The resulting project has benefited both our teaching and research efforts; we have learned a great deal about microsystem prototyping and the use of field programmable gate arrays.

Future work will include faster pipelined versions of DLX, the addition of floating point capability, and the prototyping of more advanced digital systems. We are currently prototyping a pipelined version of DLX, using a multilayer PCB design, and expect implementation results shortly. We hope others will join us in our efforts to move universities away from simulation and towards the increased production of working hardware. We believe the rewards will prove worth the effort.

7.0 Acknowledgements

The authors gratefully acknowledge the contributions of Professor Charles Hitchcock, Todd Thayer and Evan Gewirtz in bringing up Thayer DLX. The Thayer Rapid Prototyping Facility is supported by a variety of sources, including the Whitaker Foundation, Actel, Altera, Xilinx, Sun Microsystems, Viewlogic, National Semiconductor, and Direct Imaging Incorporated. Additional support was provided by the National Science Foundation, award #CDA-8921062.

8.0 References

[1] Patterson, David and Hennessy, John "Computer Architecture: A Quantitative Approach", Morgan Kaufmann Publishers Inc., San Mateo, CA, 1990.

[2] Reese, Bob and Harden, Jim "Efficient Use of a Behavioral Simulator in an Introductory Computer Architecture Course", Proceedings of the 4th Microelectronics System Education Conference and Exposition, San Jose, CA, 1991, pp 107-116.

[3] Siewiorek, Daniel et. al., "The Use of Verilog in an Introductory Computer Architecture Course", Proceedings of the 3rd Microelectronics System Education Conference and Exposition, San Jose, CA 1991, pp 139-148.

[4] Fagin, Barry and Hitchcock, Charles, "Rapid Prototyping Without MOSIS: A Minority View", Proceedings of the 2nd Annual VLSI Education Conference, San Jose, CA, 1991, pp 59-67.

[5] Erickson, Adam and Fagin, Barry "Calculating the FHT in Hardware", IEEE Transactions on Signal Processing, June 1992, pp 1341-1353.

[6] Fagin, Barry "The Effects of Field Pro- grammable Gate Arrays on the Digital System Design Process", Technical Report, Thayer School of Engineering, Dartmouth College, Hanover NH 03755.

[7] Fagin, Barry "Using Antifuse-Based FPGAs in Performance-Critical Digital Designs", Proceedings of the 4th Microelectronic Systems Education Conference and Exposition, San Jose, CA, 1991.

[8] Fagin, Barry "FPGA Utility in Special and General Purpose Processors", special issue of the Journal of VLSI Signal Processing on Field Programmable Gate Arrays, to appear.

[9] Dollas, Apostolos and Chi, Vernon, "Rapid System Prototyping in Academic Laboratories of the 1990's", Proceedings of the 1st International Workshop on Rapid System Prototyping, Research Triangle Park, North Carolina, 1990, pp 38-45.

[10] Poulton, John "Building Microelectronic Systems in a University Environment", Proceedings of Advanced Research in VLSI 1991, Santa Cruz, CA, pp 387-400.

9.0 Figures and Tables

Figure 1: Digital System Design at the Thayer RPF

Figure 2: DLX Instruction Formats

Figure 3: Integer DLX Datapath [1]

Figure 4: Thayer DLX Board

Figure 5: Thayer DLX Datapath

.text

.align 2

.global _fib

_fib:

;; Save the old frame pointer

sw -4(r14),r30

;; Save the return address

sw -8(r14),r31

;; Establish new frame pointer

add r30,r0,r14

;; Adjust Stack Pointer

add r14,r14,#-16

;; Save Registers

sw 0(r14),r3

sw 4(r14),r4

lw r4,0(r30)

addi r3,r0,#2

sgt r1,r4,r3

bnez r1,L2

...

Figure 6: Compiler output

00000000 0bff7ffc j 0xffff8000 ; trap #0 (warm start)

00000004 0bff7ffc j 0xffff8004 ; trap #4 (mult and div)

00000008 0bff7ffc j 0xffff8008 ; trap #8 (UART putc)

0000000C 0bff7ffc j 0xffff800c ; trap #12 (UART getc)

00000010 0bff7ffc j 0xffff8010 ; trap #16 putch( char c)

00000014 0bff7ffc j 0xffff8014 ; trap #20 getch( char c)

00000018 0bff7ffc j 0xffff8018 ; trap #24 getcc( char c)

; (no wait getch)

0000001C 0bff7ffc j 0xffff801c ; trap #28 printf

00000020 0bff7ffc j 0xffff8020 ; trap #32 sprintf

00000024 0bff7ffc j 0xffff8024 ; trap #36 gets

00000100 24000000 trap #0 ; This trap has no return

00000104 24000008 trap #8

00000108 2be00000 jr r31

0000010C 2400000c trap #12

...

Figure 7: Assembler output

Table 1: DLX Instructions [1]

LB,LBU,SB load byte, load byte unsigned, store byte

LH,LHU,SH for halfword

LW,SW for word

MOVI2S,MOVS2I special purpose register access

ADD,ADDI,ADDU,ADDUI signed and unsigned add, add immediate

SUB,SUBI,SUBU,SUBUI for subtraction

MULT,MULTU,DIV,DIVU signed and unsigned 32-bit multiply

AND,ANDI logical AND, AND immediate

OR,ORI for OR

XOR,XORI for XOR

LHI load high immediate; loads upper half of register

SLL,SLLI shift left logical, immediate

SRL,SRLI shift right logical, immediate

SRA,SRAI shift right arithmetic, immediate

Sxx,SxxI conditional, conditional immediate

xx indicates test: LT,GT,LE,GE,EQ,NE

BEQZ,BNEZ Branch if register equal/not equal to zero

J,JR jump (PC offset), jump (register target)

JAL,JALR jump and link, PC relative or register target

TRAP OS call

RFE return from exception