A Special-Purpose Processor

for Gene Sequence Analysis

Barry Fagin*

(barry.fagin@dartmouth.edu)

J. GIll Watt**

Thayer School of Engineering

Robert Gross

(bob.gross@dartmouth.edu)

Department of Biology

Dartmouth College

Hanover, NH 03755

KEYWORDS: dynamic programming, sequence comparison, special purpose computer.

*To whom reprint requests should be sent

**Current address: Digital Equipment Corporation, Hudson MA.

ABSTRACT

Advances in computational biology have occurred primarily in the areas of software and algorithm development; new designs of hardware to support biological computing are extremely scarce. This is due, we believe, to the presence of a non-trivial knowledge gap between molecular biologists and computer designers. The existence of this gap is unfortunate, as it has long been known that for certain problems, special-purpose computers can achieve significant cost/performance gains over general purpose machines.

We describe one such computer here: a custom accelerator for gene sequence analysis. The accelerator implements a version of the Needleman-Wunsch algorithm for nucleotide sequence alignment. Sequence lengths are constrained only by available memory; the product of sequence lengths in the current implementation can be up to 2²². The machine is implemented as two NuBus boards connected to a Mac II f/x, using a mixture of TTL and FPGA technology clocked at 10 MHz. The boards are completely functional, and yield a 15-fold performance improvement over an unassisted host.

Introduction

Efforts to use novel architectures for computational biology are relatively scarce. One of the first attempts was made by Collins and Coulson (1984), who suggest using a special-purpose Single Instruction Multiple Data machine for sequence processing. Mukherjee (1989) discusses efficient hardware algorithms for finding the longest common subsequence of two strings, although no performance results are reported. Many supercomputers have received careful attention from researchers, including the Hitachi S810-20 (Goto and Tagashira, 1986), and the iPSC/1 and the CM-1 (Core et. al., 1989). Lopresti (1989) has discussed the performance of the Multiflow Trace, Convex C1, Cray-2, and CM-2 on gene sequence comparison problems.

One of the first successful efforts in the area of special purpose processing was the Splash highly parallel programmable logic array, developed by Lopresti and colleagues at the Supercomputing Research Center (see Gokhale 1991). This machine evolved out of the Princeton Nucleic Acid Comparator, or P-NAC, developed by Lopresti and Lipton (1987). Both of these machines use a highly parallel architecture to calculate multiple similarity values in parallel. Like the Gene Sequence Processor, Splash makes extensive use of field programmable gate arrays to permit rapid prototyping and flexible implementation. Both Splash and P-NAC were prototyped successfully. Neither machine, however, is alignment-preserving, That is, neither machine can produce the alignments corresponding to the similarity scores they generate (although Splash can in theory be modified to do so, as described by Hoang [1991]).

More recently, we have identified three projects concerned with special purpose hardware for sequence comparison: the Biological Information Signal Processor (Chow 1991), the BioSCAN system (White 1991), and the B-SYS programmable systolic array (Hughey 1991). Both the BISP and BioSCAN systems are full custom VLSI implementations of systolic arrays: BISP uses a custom chip for dynamic programming calculations, while BioSCAN implements linear similarity algorithms using lookup tables stored in an on-chip RAM. B-SYS also uses full custom ASICs, but is a general purpose programmable systolic array, and hence considerably more flexible in the algorithms it can support.

The principal difference between the GSP and these devices is its employment of user-obtainable, off-the-shelf parts. Our goal was to employ semi-custom ASICs in a special purpose processor to produce a machine that could be reproduced and upgraded without the expense of a full custom approach. The GSP also preserves alignments and calculates similarity scores. We believe both these features are essential components of a biological processor: similarity scores are necessary to rank results from library searches, while alignments assist in visualizing homologous regions and constructing evolutionary relationships between sequences.

System and Methods

The Gene Sequence Processor was designed at the Thayer Rapid Prototyping Facility (RPF), a laboratory for the rapid design and construction of digital systems. The RPF has been involved in several successful experiments in rapid digital system design. In addition to the Gene Sequence Processor, other projects include a completely functional 32-bit microprocessor, a Fast Hartley Transform engine, and a real-time data processor for rocket telemetry. For information on these and other projects, the reader is referred to (Erickson and Fagin 1992), (Fagin 1991) and (Fagin and Chintrakulchai 1992).

The GSP itself consists of two 11.7" x 4" 2-layer printed circuit boards, consisting of the following components:

5 Actel Field Programmable Gate Arrays

3 TI NuBus chips

8 1Mb DRAM SIMMs

4 AMD 22V10 PALs

23 miscellaneous TTL IC's

The total cost for these parts is approximately $750: $400 for the DRAM chips, $250 for the FPGA's, and $100 for the remaining IC's. Board fabrication costs are relatively modest, due to the use of an in-house printed circuit board prototyping system. We estimate these at around $100.

The GSP board is clocked at 10 MHz, and requires 6 clock ticks = 600 ns to calculate a dynamic programming array entry.

Algorithm

The algorithm implemented by our processor is the well-known dynamic programming algorithm for determining the similarity of two strings (see for example Sellers, 1974 and Kruskal, 1983). For readers unfamiliar with this technique, we provide an overview of it here.

The Gene Sequence Processor computes the similarity between two sequences A and B by computing an edit distance: the minimum number of insertions, deletions, and changes needed to transform sequence A into sequence B. Our system permits insertions, deletions, and changes to each have their own associated cost; thus the GSP calculates the least costly sequence of operations that transforms sequence A into sequence B.

Consider the minimum cost sequence of operations needed to transform a sequence A of length m into a sequence B of length n. Let A’ be the sequence A minus its last character, similarly for B’. A can be transformed into B in one of the following ways:

1) Delete the last character of A and transform A’ into B.

2) Transform A into B’ and add the last character of B.

3) Delete the last character of A, transform A’ into B’, and add the last character of B.

At least one of these methods will yield a transformation of A to B of minimum cost. But then each of the sequences of operations in the first part of the definitions above can be defined similarly, by letting A’’ and B’’ denote the sequences A’ and B’ minus their last characters and reapplying the above definitions. The process can be repeated until null sequences are reached, for which the associated costs are zero.

We may mathematically define this process in the following way. Let A be a sequence of length m, B a sequence of length n, and let [i] denote the i^th character of a sequence. Let the minimum distance between A and B be the total cost of the least costly series of insertions, deletions, and substitutions needed to transform A into B. Then D[m,n ], the minimum distance between A and B, is given by:

D[0,0] = 0 (I)

D[i,0] = D[i-1,0] + del (A[i]) 1 _ i _ m (II)

D[0,j] = D[0,j-1] + ins (B[j]) 1 _ j _ n (III)

D[i,j] = minimum{D[i-1,j] + del (A[i]), D[i,j-1] + ins (B[j]), D[i-1,j-1] + subs (A[i],B[j])}

1 _ i _ m, 1 _ j _ n (IV)

where the functions ins, del, and subs denote the cost of insertion, deletion, and substitution respectively. Lines II, III, and IV correspond to definitions 1, 2, and 3 above.

Let us number the matrix D with increasing subscripts from top to bottom and left to right. The task of computing the minimum distance between two sequences of lengths m and n requires the calculation of each entry in this array, in which the value of a cell is dependent on the values of its left, upper left, and upper neighbors. An example matrix and cost functions for m = 3, n = 5 is shown in Figure 1. The minimum distance to completely transform sequence A into sequence B, 8, appears in the lower right hand corner of the array; the cost functions used are shown at the top of the picture.

Notice that if information is kept with each cell recording which sum in (IV) is the minimum, then the entire sequence of operations corresponding to the minimum distance can be reconstructed. This gives an optimal alignment, a way of positioning the sequences relative to one another so that similarities can be easily seen. The example of Figure 1, modified with special information to reconstruct the alignment, is shown in Figure 2.

The special information is kept in the form of back pointers, numbers that indicate which entry in the array corresponds to the smallest neighbor. In general, all cells will have back pointers; for clarity only those pointers in the optimal alignment are shown. Notice that more than one optimal alignment may exist, corresponding to multiple paths of back pointers from the lower right corner to the upper left.

The processor we have constructed implements the calculations described by (I) through (IV), using back pointers to save the final alignment and return it to the host. We note that although (I) through (IV) are used to calculate global alignments, our processor can also be used to implement local alignment calculation. The initial sequences A and B are transferred from the host, along with the values for the ins, del, and subs functions. The dynamic programming calculation then begins, using special hardware to accelerate the comparison operation in (IV).

In addition to the operations specified in (I)-(IV), the GSP permits the incorporation of initial costs for gaps between sequences. Thus models with linear gap costs, which incorporate both the number of nucleotides in the gap and some initial constant penalty can be accommodated by the GSP.

We note that the global alignment calculated in Figures 1 and 2 may not in and of itself be biologically significant. Global alignments involving complete sequence conversion which may contain multiple insertions or deletions at the ends of a sequence, for example, clearly contain superfluous information. Significant alignment information, however, is contained in the dynamic programming matrix computed by the GSP, which can then be traversed by software to identify alignments of importance. For more information on alignment extraction from a dynamic programming matrix, the reader is referred to Gribskov and Devereux (1991).

Implementation

A block diagram of the Gene Sequence Processor is shown in Figure 3. The GSP consists of five basic subsystems: alignment memory, sequence memory, a counter/comparator, the interface with the host computer, and the Gene Arithmetic Logic Unit. It implements a dynamic programming algorithm to compute the minimum cost of insertions, deletions, and substitutions necessary to transform one nucleotide sequence into another, based on a set of cost functions supplied by the user. The total cost of this sequence of operations is the edit distance between A and B.

The GSP contains 8 megabytes of alignment memory, used to hold the dynamic programming array in Figure 1. This memory is organized as 2²² 16-bit words of dynamic random access memory, or DRAM, limiting the product of the lengths of the sequences to be compared to 2²². The format of a DRAM word is shown in Figure 4. Each byte is broken into 6 bits of data and a 2 bit field for the back pointer.

Back pointer information was made identical in each byte to simplify the design of the processor, leaving 12 bits of data in each array entry. Thus, the maximum edit distance the GSP can compute is 2¹² = 4096. Currently, cost function values are constrained to be integers. This has not proven to be a limiting factor in our experiments.

The sequences to be compared, cost function values, and sequence lengths are downloaded into sequence memory, organized as 4096 bytes of static random access memory, or SRAM. The counter/comparator is used to generate the necessary addresses for the dynamic programming calculation, and to halt the computation when the last entry in the dynamic programming array has been calculated. The processor interface is responsible for all interaction with the host computer.

A block diagram of the Gene Arithmetic Logic Unit, or ALU, is shown in Figure 5. The relevant cost coefficients and array values are added in parallel, and sent to a customized 3-way comparator that determines the minimum value. This value is then written back into the dynamic programming array; the letters “DP” and subscripts in Figure 5 refer to the relevant array entries.

The nucleotides as compared by the Gene ALU are encoded in the 4-bit representation shown in Table I. Each bit in the 4-bit representation corresponds to a specific nucleotide. This arrangement allows for the use of the standard IUPAC nucleotide code as shown in the table. A match occurs if any bits in a given position are '1' for both nucleotides. For example, the Gene ALU will detect a match between "C" and "Y", because both have a '1' in their second bit position.

Discussion

The performance of the GSP is shown in Figure 6, in which the sequence comparison time of the GSP is compared with that of the unaided host. Host execution times were measured running a C implementation of the dynamic programming algorithm. We deliberately made this program as fast as possible, to avoid biasing the results in favor of the GSP. Execution time is shown on the vertical axis, plotted against the number of nucleotides. For these measurements, both sequences are assumed to be of the same length.

Figure 6 also shows the execution time of the same C code on an IBM RS6000 workstation and a Sun Sparc. The results are real time measurements, running under UNIX with a single user. Both object files were produced with the same compiler. We see that the results for both workstations are comparable to the f/x.

Figure 6 shows an average performance improvement of about 15-fold. Note that despite the appearance of the graph, GSP performance scales quadratically, due to the quadratic nature of the algorithm. The resulting execution times are very close to 600(n+1)²/10⁹, since the processor runs at 10Mhz and requires 6 clock cycles to calculate a dynamic programming array entry.

For a performance comparison of the GSP with other machines, and a more detailed discussion of the architecture and technology, the reader is referred to Fagin and Watt (1992).

Biologists wishing to use the GSP need only be familiar with the look and feel of the Macintosh; no other special skills are necessary. The software interface to the GSP uses the standard Macintosh tools, and is quite flexible. One can use the GSP and existing software to view color comparisons of sequences, examine long alignments using scroll boxes, and compute alternative alignments with different cost functions. When constructing a special-purpose processor, the danger of the “architect's trap” is especially strong: one must be careful to not put so many features into hardware that the flexibility and cost-effectiveness of the system are compromised. By incorporating a considerable amount of system flexibility in the GSP user interface, we believe we have avoided this problem.

Examples of GSP/user interaction are shown in Figures 7 and 8. When the processor is connected to the host and powered on, the user double-clicks on a standard Macintosh software icon to obtain the dialog box in Figure 7. The user then specifies the names of two files on her disk, containing the two sequences to be compared. She can then specify insertion, deletion, and substitution costs, as well as insertion and deletion gap penalties; the values shown in the figures are merely defaults, and can be changed with the mouse and keyboard. Other features include advanced debugging output and a post-processing analysis of the complete dynamic programming array. We note that all of these tasks are accomplished using the standard Macintosh user interface and mouse.

Once the parameters of the run have been entered, the user clicks “Done”. Some timing information will be printed, followed by the display of the optimal alignment and corresponding edit distance as shown in Figure 8. Users can view large alignments in their entirety, scrolling to the left or right one character at a time (small arrows) or one window at a time (large arrows).

Conclusions and Future Work

The initial goal of this project was to construct an attached processor for gene sequence analysis that would offer an order of magnitude improvement in performance over personal computers at a minimal cost. Our efforts were successful: the Gene Sequence Processor has met or exceeded design specifications.

Nonetheless, much can be done to improve it. The main performance limitations on the GSP are the 10MHz clock and the DRAM interface. The present design requires one read and write per cell, each of which requires three clock cycles to implement. The 10MHz clock is taken directly from the NuBus; employing a design with a separate clock could in principle reduce the cell calculation time from 600ns to 320ns. The critical path through the system would then be through the FPGA's, which could be reduced to about 225ns through the use of a carry-lookahead scheme for the Gene ALU. Using a static RAM cache and DRAM page mode access could reduce memory access time even further, at the cost of increased system complexity.

Sequence lengths can be extended through the addition of more DRAM; up to 32Mb can be accommodated with minimal modifications to the existing design. Currently, only nucleotide sequences can be accommodated; future versions will support amino acid comparisons. We believe this can be accomplished by simply reprogramming the FPGA's, and are currently investigating this possibility.

Furthermore, the basic dynamic programming algorithm described here, while optimal, is known to be unnecessarily slow. More heuristically-oriented algorithms like those used in the FASTA (Wilbur and Lipman, 1983) and BLAST (Altschul et. al. 1990) software packages should be investigated. These algorithms are less regular than Needleman-Wunsch, and hence may pose more of a challenge to a custom processor.

It is also known that the full calculation of all cells in the dynamic programming array is inefficient (Ukkonen 1983). By restricting cell calculations to those with distances less than a predetermined upper limit, significant performance improvement can be obtained (Spouge 1991). We believe that this type of optimization can be accommodated by the GSP with minimal effort, requiring only modifications to the control logic to permit addresses to be skipped if a critical distance has been exceeded. We intend to support this feature in the next GSP prototype.

Finally, we believe a 15-fold improvement in performance, while acceptable for a preliminary prototype, is insufficient for a truly effective biological computer. Special purpose devices generally require two or three orders of magnitude of effectiveness to warrant their employment over general purpose devices, due to the natural advantages of products developed in an active commercial environment and the rapid rate of improvement of general purpose computing technology.

We hope, therefore, to improve both the performance and the software capabilities of the machine by at least another order of magnitude, to produce a truly significant research tool of general interest to molecular biologists. This will require the use of more aggressive technology in the machine itself, a more sophisticated user interface, and deeper study into the specific computational requirements of important biological problems.

It is our hope that by combining insights from computer engineering and molecular biology, we may be able to produce machines better suited to computational biology than those that currently exist. The ultimate goal of our research is the production of a machine that can reduce the time of significant tasks of computational biology, such as database searching and similarity detection, from minutes or hours to seconds. Such computational capability, we believe, will permit biologists to more rapidly and efficiently evaluate the implications of nucleotide and protein sequence data, and to generate new insights into the nature of genomic information. This will become possible when sequence comparisons can be performed in an interactive mode rather than in a batch mode. Work on this topic is currently in progress.

Acknowledgements

The authors gratefully acknowledge the assistance of the Whitaker Foundation, whose generosity made the construction of the GSP possible. Support for the Rapid Prototyping Facility was provided by a variety of sources, including Actel, Altera, Xilinx, Sun Microsystems, Viewlogic, National Semiconductor, and Direct Imaging Incorporated. Additional support was provided by the Thayer School of Engineering, and the National Science Foundation, award #CDA-8921062.

References

Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990). Basic Local Alignment Search Tool. J. Mol. Biol. 215: 403-410.

Chow, E. T. et. al. (1991) A Systolic Array Processor For Biological Information Signal Processing. Proceedings of the 1991 International Conference on Supercomputing, 1991, pp 216-223.

Collins, J.F. and Coulson, A.F.W. (1984) Applications of Parallel Processing Algorithms for DNA Sequence Analysis. Nucleic Acids Res., 12, 181-192.

Core, N. et. al. (1989) Supercomputers and Biological Sequence Comparison Algorithms. Computers and Biomedical Research, 22, 497-515.

Erickson, A. and Fagin, B. (1992) Calculating the FHT in Hardware, IEEE Transactions

on Signal Processing, June 1992 (in press).

Fagin, B. (1991) Using Antifuse-Based FPGAs in Performance-Critical Digital Designs, Proceedings of the 4th Microelectronics System Education Conference and Exposition, San Jose, CA, 225-234.

Fagin, B. and Watt, G. (1992), FPGA and Rapid Prototyping Technology Use in a Special Purpose Computer for Molecular Genetics, Proceedings of the 1992 International Conference on Computer Design, October 1992 (in press).

Fagin, B. and Chintrakulchai, P. (1992), Prototyping the DLX Microprocessor, Proceedings of the 1992 IEEE Workshop on Rapid System Prototyping, Research Triangle Park NC, (in press).

Gribskov, M. and Devereux, J. (1991) Sequence Analysis Primer, Stockton Press, Macmillan Inc, New York, NY.

Gokhale, M. et. al. (1991) Building and Using a Highly Parallel Programmable Logic Array. Computer, 24, 81-89.

Gotoh, O. and Tagashira, Y. (1986) Sequence Search on a Supercomputer. Nucleic Acids Res., 14, 57-64.

Hoang, D. (1991) Systolic Array for DNA Sequence Alignment. Brown University Department of Computer Science Technical Report, December 1991.

Hughey, R. (1991) Programmable Systolic Arrays. Ph. D. Dissertation, Department of Computer Science, Brown University, Providence RI 02912.

Karlin, S. and Altschul, S. (1990) Methods for Assessing the Statistical Significance of Molecular Sequence Features by Using General Scoring Schemes. Proc. Natl. Acad. Sci., 87, 2264-2268.

Kruskal, J. B. (1983) An Overview of Sequence Comparison: Time Warps, String Edits, and Macromolecules. SIAM Review, 25, 201-237.

Lander, E. S. et. al. (1991) Computing in Molecular Biology: Mapping and Interpreting Biological Information. Computer, 24 #11, 6-13.

Lopresti, D. (1987) P-NAC: A Systolic Array For Comparing Nucleic Acid Sequences. Computer, 20 #7, 98-99.

Lopresti, D. (1989) Sequence Comparison on Commercial Supercomputers. Supercomputing Research Center Technical Report SRC-TR-89-010.

Mukherjee, A. (1989) Hardware Algorithms for Determining Similarity Between Two Strings. IEEE Trans. Comp., 38, 600-603.

Sellers, P. H. (1974) On the Theory and Computation of Evolutionary Distances.

SIAM J. Appl. Math., 26, 787-793.

Spouge, J.L. (1991). Fast optimal alignment. Comput. Appl. Biosci. 7:1-7.

Ukkonen, E. (1983). On approximate string matching. Proc. Int. Conf. Found. Comp. Theory, Lecture Notes in Comp. Sci. 158:487-496, Springer-Verlag, Berlin.

White, C. et. al. (1991). BioSCAN: A VLSI-Based System for Biosequence Analysis, Proceedings of the 1991 International Conference on Computer Design, Cambridge MA, 1991, pp 504-509.

Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Proc. Natl. Acad. Sci., 80, 726-730.