Calculating the FHT in
Hardware
Adam C. Erickson
Sequoia Systems Inc.
Marlboro, Massachusetts
{...seqp4!sequoia!adame}
Barry S. Fagin
Thayer School of Engineering
Dartmouth College
Hanover, NH 03755
6036463060
barry.fagin@dartmouth.edu
EDICS 4.11, 7.1.3, 7.2
ABSTRACT
We have developed a parallel, pipelined architecture for calculating the Fast Hartley Transform. Hardware implementation of the FHT introduces two challenges: retrograde indexing and data scaling. We propose a novel addressing scheme that permits the fast computation of FHT butterflies, and describe a hardware implementation of conditional block floating point scaling that reduces error due to data growth with little extra cost. Simulations reveal a processor capable of transforming a 1Kpoint sequence in 170 microseconds using a 15.4 MHz clock.
1.0 Introduction
The fast Hartley transform has attracted considerable research interest as an alternative to the FFT. A complete analogy exists between the two transforms. All the properties applicable to the FFT, such as the convolution and shifting theorems, apply to the FHT. In fact, the FHT is easily extracted from the real and imaginary components of the FFT (and vice versa). These properties have been welldocumented in [Sore85], [Brac86], [Hou87], and [Onei88].
The computational advantage of the FHT lies in its use of realvalued data. As the vast majority of applications involve only realvalued data, significant performance gains are realized without much loss of generality. If one performs the complex FFT on an Npoint realvalued sequence, the product is a complex sequence of 2N real points, N of which are redundant.^{[1]} The FHT, on the other hand, will yield a real sequence of the same length with no redundant terms, using only half the time and half the memory resources.
Realvalued FFT algorithms have been derived [Duha86, Sore87, Mou88] which compare in speed and memory requirements of the FHT, but they require the implementation of a different algorithm to compute the inverse realvalued FFT. This added complexity is undesirable, as algorithms for both the forward and inverse transform are preferred.
The FHT is obtained from the Discrete Hartley Transform (DHT) pair of a real function x(t), shown below:
^{(1)}
^{(2)}
where cas(x) = cos(x) + sin(x). Note that the inverse DHT differs from the forward DHT only in a scaling factor, lending itself well to hardware implementation.
Like the DFT, naive implementations of the above equations imply O(N^{2}) operations. Using techniques similar to the derivation of the FFT, the number of operations required to perform the DHT can be reduced to O(NlogN)^{[2]}.
Hardware implementations of the FFT and similar algorithms fall into two categories: programmable DSP processors, and dedicated FFT processors. Most of those in the first class [Marr86], are extremely efficient when compared to general purpose Von Neumann architectures, but suffer in performance when compared with dedicated processors.^{[3]}
Academic efforts in dedicated FFT processors include a waferscale VLSI multiprocessor array capable of computing the FFT, proposed in [Kore81]. [Choi88] describes an extension of [Kore81] with improved fault tolerance, although performance figures are not reported. The CUSP dedicatedFFT processor [Lind85] has a very simple host interface and an elegant solution to controlling the overflow associated with fixedpoint arithmetic. But the decision to implement bitserial multipliers results in a singleprocessor system that is easily outperformed by ours and others [Shen88, Burs88].
One of the most successful commercial efforts in this area is Honeywell’s HDSP66110/220 chipset [Shen88]. Capable of performing a 1024point realvalued transform in 63 microseconds, this product combines a number of multiprocessing techniques to implement the fastest singleprocessor, radix4 FFT engine known to the author. Several HDSP66110/220’s can be cascaded to gain even further performance, with a corresponding increase in cost.
Hardware realizations of the FHT are scarce. We conjecture that this is due not only to its relatively new stature in the ranks of fast algorithms, but also to the peculiar quirks of its implementation.
Software implementations of the FHT on Intel’s iPSC hypercube [Lin88] and the Connection Machine [Land90] have demonstrated that architectures of this type do not yet possess the interprocessor communication speeds necessary to compute the FHT efficiently. Here, 1Kpoint transform execution times are measured on the order of seconds. Adding processors can actually degrade performance, as communication overhead between the processors begins to offset any realized performance gains. LeNgoc and Vo offer another description of the implementation and performance of the FHT in [LENG89]. They publish estimated performance figures, of which only software implementations on the 80x86 microprocessor family reflect actual measurements. Estimates on other architectures are obtained by halving published FFT calculation times.
A summary of performance benchmarks for a number of generalpurpose and dedicatedFFT processors is given in a later section. In the next section, we discuss the two principal challenges to hardware implementation of the FHT: retrogradeindexing and scaling.
The two principal difficulties with implementing the FHT are retrograde indexing and data scaling. Retrograde indexing refers to the peculiar addressing pattern of the FHT, while scaling arises from our use of fixedpoint arithmetic to calculate the FHT.
Addressing for the FHT exhibits a unique asymmetry. Figure 1 depicts the flow diagram for a decimationintime (DIT), radix2, 16point FHT [Kwon86]. Note the regular structure of the FHT. It is identical to a radix2 FFT except for the ‘T’ blocks, which implement retrograde indexing. Visualizing larger transforms is straightforward: to construct a 32point transform flow graph, we duplicate the 16point transform diagram below for stages 14 and add a fifth stage with T5 in its lower half^{[4]}.
Figure 1 Butterfly Diagram for 16point FHT
[reproduced from [kwon86])
For the first two stages, FHT butterflies are similar to FFT butterflies, involving only two terms and no multiplications. For stage 3 and after, however, an increasing number of butterflies involve nontrivial multiplications. As T3 and T4 show, these butterflies require three input points to produce two output points (heavy lines, stage 3), introducing computational asymmetry and an I/O ratio of 3:2. The I/O efficiency can be improved by noting that two of the three terms used for any one butterfly are shared by a second butterfly in the same iteration. If the two associated butterflies are processed simultaneously, only four terms (not six) would be required (heavy lines, stage 4). Thus if we calculate two FHT butterflies at once, we may use four input points to calculate four output points, giving an I/O ratio of 1:1 [SORE85]. The concurrent calculation of two related butterflies will be referred to as a dual radix2 butterfly. The equations of a dual radix2 butterfly are shown below:
^{(3)}
^{(4)}
^{(5)}
^{(6)}
where i = 1,...,log_{2}N (stage counter), N_{s} = 2^{i} (group size), f = 0,...,N_{s}/4  1 (butterflies within each group), and k = 0,N/N_{s},...,(N/N_{s}  1)*N/N_{s } (base index for each group). Whenever f evaluates to zero, Equations (3)  (6) are reduced to one addition or subtraction. This is the case for all of stages 1 and 2, half the butterflies in stage 3, and so on, decreasing to only 1 butterfly in the last stage. All the other butterflies involve general multiplication with the twiddle factors and two additions or subtractions. The next section discusses the consequences these operations can have on data integrity.
Additions and subtractions using fixedpoint arithmetic can result in data overflow, a situation which yields erroneous results and must be avoided during FHT calculation. In butterflies requiring three terms, two addition operations are involved. Studying Equations (3)  (6), we observe that the condition which results in the largest data growth is when f/N_{s} = 1/8 and the data are equal and at their maximum. Normalizing the maximum to be 1, the result of a worstcase computation is described by:
Because the result can grow by a factor of 2.4142, the data can experience up to élog_{2}2.4142ů = 2 bits of overflow. To prevent this, the data can be shifted right by two bits before performing the butterfly calculations. In addition to reducing the data’s dynamic range, this data scaling introduces truncation error if any of the discarded bits are nonzero. As the FHT calculations proceed from stage to stage and further data shifts are required, the error can propagate and grow enough to severely distort or completely obliterate the true output signal.
The technique used to scale the data can likewise degrade the signaltonoise ratio. The easiest solution is to unconditionally shift the data right two bits at each stage. This would eliminate the possibility of overflow, at the cost of unnecessary shifting of data and a large loss of accuracy. Clearly other solutions are preferable.
Our processor implements a scaling technique known as conditional “blockfloatingpoint” (BFP), whereby the data are shifted only when an overflow occurs [Welc69]. Instead of providing an exponent for each data word, as in floatingpoint, the entire sequence is associated with a single exponent. Should an overflow occur during an iteration, the entire array is shifted right by one and the exponent incremented.
How and when the sequence is shifted can effect system performance. [Welc69] suggests that the sequence be shifted immediately after an overflow. In this method, the sequence may be traversed twice during an iteration for the sole purpose of overflow adjustment. [ANDE88] limits the size of the incoming data to be two bits less than fullprecision. The data are then allowed to grow into these extra bits during an iteration, after which an explicit subroutine is called to shift the sequence by the maximum amount of overflow that occurred. These schemes are not optimal because a significant portion of the FHT calculation is dedicated to adjusting the sequence. Our implementation improves upon existing methods by shifting the data as it arrives in the computational element during the next iteration, thereby precluding any time penalties. We discuss this in detail in a later section.
The situation resulting in the maximum possible bit growth arises when the input data are equal and at fullscale; that is, x(n) = [2^{B1}1,2^{B1}1,...,2^{B1}1] or [2^{B1},2^{B1},...,2^{B1}], where B is the processor’s wordlength. The FHT of a constant input is an impulse whose amplitude is given by the following equation:
^{(7)}
This result is intuitive, as a constant timedomain signal concentrates all its power into the DC component of its transform. Equation 7 shows that the data can grow by a factor of N, or log_{2}N bits, so when x is at fullscale, log_{2}N shifts are necessary to prevent overflow.
We note that the inverse transform would likewise produce log_{2}N bits of overflow given the same maximum DC input sequence. However, the inverse transform requires that we divide by N, equivalent to subtracting log_{2}N from the data’s common exponent. From equation 8, we see that the data, overall, does not experience any growth, although it experiences maximum data scaling. For a detailed analysis of this and other examples showing the effects of data scaling, the reader is referred to [ERIC90].
^{(8)}
Knowing that a sequence can grow by one bit during the first two stages and by two bits in all subsequent stages, unconditional fixedpoint scaling techniques would shift the data a total of 2log_{2}N  2 bits, almost twice the maximum possible bitgrowth. For this reason, blockfloatingpoint schemes can generally be expected to perform better than unconditional schemes. Detailed error analyses for the fixedpoint scaling techniques discussed here can be found in [welc69], [rabi73], and [oppe88]. A method of analysis specific to our implementation is presented in a later section.
4.0 Architecture of the FHT
Processor
This section describes an architecture capable of computing the FHT very efficiently. For 1024point transforms, 98 percent of the execution time is spent performing FHT calculations. We first discuss the system architecture, and then present solutions to the aforementioned implementation challenges.
4.1 System Architecture
The processor architecture embodies a number of techniques designed to optimize system performance and functionality. The butterfly computational unit (BCU) utilizes a 5stage pipeline, keeping the critical path as short as possible. Additionally, the processor incorporates tightlycoupled, high speed buffer memories. Organized to hold up to 1024 16bit words, each memory is arranged to allow simultaneous, singlecycle access to four 16bit data points. Employing several of these memories allows for many such concurrent data transfers, boosting the data I/O rate significantly.
The processor can be programmed to transform sequence lengths of any power of two from 4 to 1024 points, thereby overcoming an inflexibility found in some dedicated hardware processors. Larger transforms are a simple extension of the processor’s address generation unit and an increase in the depth of the memories.
A block diagram of the system is given in Figure 2. The arithmetic element accepts four data elements (and the two sine and cosine terms) and outputs four processed elements every clock cycle. Each of the four buffer memories shown contains two independently addressable arrays 32 bits wide by 256 words deep. Two of these memories alternate as source and destination as the FHT calculations proceed from stage to stage. The other two memories not involved in the transform can be used to load newlysampled data and unload recently processed data asynchronously while transform calculations of a third sequence are in progress, accommodating highspeed, realtime FHT processing.
Interface to a host machine is accommodated by 3 control lines and a 4bit word indicating transform size which can easily be mapped into the host’s address space. To begin a transform, the host places the value log_{2}N on the data bus and asserts start. Calculations can be put on hold or halted altogether by asserting pause or reset, respectively. When a transform is complete, done is asserted. The host can then recover the processed data and their common exponent.
Figure 2 System Block Diagram
We have mentioned previously the two primary challenges to implementing the FHT: retrograde indexing and scaling. We now present our solutions to these problems.
Because parallel access to all four elements of a dual radix2 butterfly is desired, the processor employs an algorithm different from the one originally proposed in [Brac84]. Given a sequence of length 2^{v}, v Î [2,3,...,10], the sequence is broken into 4 equal segments and arranged as depicted in Figure 3. Each column in the array is a separate memory bank, where the first 1/4th of the sequence is loaded in the first bank, the second 1/4th in the second, and so on. Thus, a 16point sequence would be stored as shown in Figure 3.
When calculations begin, the top data value from each of the memories can be fetched simultaneously (via memory address 0 being sent to all four memories), providing the BCU with the necessary data to calculate the first butterfly. Comparing Figures 1 and 3 will show that bitreversal, an operation required before performing DIT butterfly calculations, is already partially accomplished (the destination addresses for the first stage will still need to be bitreversed). A simple binary “butterfly” counter extending from 0 to N/4  1 can serve as the address to fetch the input data.
Figure 3 Arrangement of Npoint Sequence into four separate memory banks
Once fetched, the data enter the Butterfly Computational Unit, pictured in Figure 4. The dotted lines in the figure show where clocked buffers are placed in order to implement the fivestage pipeline. Once the data arrive from memory, the data are first scaled according to the amount of overflow that occurred during the previous stage. Next, reordering of the data may be necessary due to the way data was written during the previous iteration. After reordering, the butterfly computations from equations (3)  (6) are performed. More postcalculation reordering may be necessary before the results are finally written to the destination memory.
As Figure 4 shows, butterfly calculation involves multipliers, adders and subtracters. Disposing of most of the subscripts and letting m denote the current stage number, Equations (3)  (6) simplify to:
A0_{m+1} = A0_{m }+ (B0_{m}cosq + B1_{m}sinq) (9)
A1_{m+1} = A1_{m} + (B1_{m}cosq  B0_{m}sinq) (10)
B0_{m+1} = A1_{m}  (B1_{m}cosq  B0_{m}sinq) (11)
B1_{m+1} = A0_{m}  (B0_{m}cosq + B1_{m}sinq) (12)
These equations are similar to those used to calculate just one radix2 complex FFT butterfly .
Figure 4 BCU Block Diagram
Although a sequence is stored in four separate memory banks, writing the results to the destination memory does not require the generation of four separate addresses. The BCU result data are written to the destination memory in pairs, requiring only two addresses, A and B (see Figure 4). Data must be written in this way to insure simultaneous access to four elements using a single source address during the next iteration.
4.3 Address Generation and Control
The correct computation of the FHT using the datapath of Figure 4 requires the specification of input and output reordering control signals. For some butterflies, the data emerging from the A and B sides of the pipeline must be swapped. This swapping must then be remembered by the processor during the next butterfly stage. Butterflies with 2 input points must also be handled differently.
Output and input reordering are indicated by the boolean variables O and I. These are used to reroute data coming from or going to the A or B memories, and are implemented as control signals in the processor. The i control line governs the reordering of data before a butterfly calculation. When the i control line is active in a given stage, the data coming from memory A is swapped with that of memory B. When o is active, the processed data to be written to memory A is swapped with that of memory B. If o is active for a butterfly in any stage, i will be active in the corresponding butterfly in the next stage.
The c control line accommodates butterflies requiring 2 (instead of 4) input terms. Note that in the first two stages c is always high. The BCU computes the correct results, but they are not in the proper order for writing out to memory. When c becomes active, results from the first adder and second subtracter (in the 2nd tier of ALUs) are written to destination memory A0 and A1, respectively, and results from the second adder and first subtracter are written to destination memory B0 and B1, respectively.
To analyze the operation of the control lines, it is convenient to number each butterfly with an address b, 0 <= b < N/4  1, where N is the input size, and to label the stages of the FHT with a number s, 0 <= s < log N. We divide the set of butterflies into groups of size 2^{s}^{ 1}, numbered from g = 0 .. N/2^{s + 1}1 (if s = 0 the group size is defined to be 0). I, O, and C can then be specified as follows:
s = 0 I = 0, O = 1, C = 1
s = 1 I = 1, O = bit 0 of b , C = 1
s = 2 .. log N2 I = bit s2 of b , O = bit s1 of b , C = NOR(lower s1 bits of b )
s = logN  1 I = bit s2 of b , O = bit s1 of b , C = 1
Bits are numbered from 0, and are assumed to be equal to zero if the indicated position is greater than logN  3. Output reordering is always required for the first stage, and then for all the odd numbered groups. Input reordering is not required for butterflies in stage 0, but is needed at stage s for all butterflies with reordered outputs in the stage s1. The C signal, for twoinput butterflies, is always active during the first and second stages. C is then active for the first butterfly in each group until the last stage, when it is always asserted.
The destination A and B addresses of a butterfly b in group g for stage s can be specified in the following way:
IF s = 0
A := bitreversed(b ), B := bitreversed(b )
ELSE IF g is even
A := b
IF NOR(lower s1 bits of b )
B := A plus 2^{s1}
ELSE
B := 2^{s}^{ }(g+1)  A
ELSE
B := b
IF NOR(lower s1 bits of b )
A := B plus 2^{s1}
ELSE
A := 2^{s}^{ }(g +1)  B
For the first stage of the FHT, the A and B addresses are simply the bitreversed equivalents of the butterfly addresses. Otherwise, for even groups A is the butterfly address. B is equal to A plus the group size if the butterfly is the first in a group, otherwise it is twice the group size times (g + 1 ) minus A. For odd groups, the A and B values are switched.
Our processor uses four memory banks, with two alternating as source and destination with each stage of the FHT. Alternatively, one may reduce system memory to two banks if special logic is added to stall the pipeline when the input and output stages attempt to access the same address. In a processor with k stages, this will occur when the destination address for butterfly b , is equal to b+k . It is not difficult to show that for k odd, accesses for s > 0 are conflictfree.
Destination addresses and control signal calculation for a 64point FHT are shown in Table 1. This transform has 16 dualradix butterflies. '•' denotes an asserted signal, 'b' a butterfly number, and 'g' a group number. A and B indicate A and B destination addresses, while I, O, and C denote the appropriate control signals. Groups are indicated by dotted lines.
b 
g 
I 
C 
O 
A 
B 

g 
I 
C 
O 
A 
B 

g 
I 
C 
O 
A 
B 





















0 
 

• 
• 
0 
0 

0 
• 
• 

0 
1 

0 

• 

0 
2 
1 
 

• 
• 
8 
8 

1 
• 
• 
• 
1 
0 

0 
• 


1 
3 
2 
 

• 
• 
4 
4 

2 
• 
• 

2 
3 

1 

• 
• 
2 
0 
3 
 

• 
• 
12 
12 

3 
• 
• 
• 
3 
2 

1 
• 

• 
3 
1 
4 
 

• 
• 
2 
2 

4 
• 
• 

4 
5 

2 

• 

4 
6 
5 
 

• 
• 
10 
10 

5 
• 
• 
• 
5 
4 

2 
• 


5 
7 
6 
 

• 
• 
6 
6 

6 
• 
• 

6 
7 

3 

• 
• 
6 
4 
7 
 

• 
• 
14 
14 

7 
• 
• 
• 
7 
6 

3 
• 

• 
7 
5 
8 
 

• 
• 
1 
1 

8 
• 
• 

8 
9 

4 

• 

8 
10 
9 
 

• 
• 
9 
9 

9 
• 
• 
• 
9 
8 

4 
• 


9 
11 
10 
 

• 
• 
5 
5 

10 
• 
• 

10 
11 

5 

• 
• 
10 
8 
11 
 

• 
• 
13 
13 

11 
• 
• 
• 
11 
10 

5 
• 

• 
11 
9 
12 
 

• 
• 
3 
3 

12 
• 
• 

12 
13 

6 

• 

12 
14 
13 
 

• 
• 
11 
11 

13 
• 
• 
• 
13 
12 

6 
• 


13 
15 
14 
 

• 
• 
7 
7 

14 
• 
• 

14 
15 

7 

• 
• 
14 
12 
15 
 

• 
• 
15 
15 

15 
• 
• 
• 
15 
14 

7 
• 

• 
15 
13 
stage 0 1 2
b 
g 
I 
C 
O 
A 
B 

g 
I 
C 
O 
A 
B 

g 
I 
C 
O 
A 
B 





















0 
0 

• 

0 
4 

0 

• 

0 
8 

0 

• 

0 
0 
1 
0 



1 
7 

0 



1 
15 

0 

• 

1 
15 
2 
0 
• 


2 
6 

0 



2 
14 

0 

• 

2 
14 
3 
0 
• 


3 
5 

0 



3 
13 

0 

• 

3 
13 
4 
1 

• 
• 
4 
0 

0 
• 


4 
12 

0 

• 

4 
12 
5 
1 


• 
7 
1 

0 
• 


5 
11 

0 

• 

5 
11 
6 
1 
• 

• 
6 
2 

0 
• 


6 
10 

0 

• 

6 
10 
7 
1 
• 

• 
5 
3 

0 
• 


7 
9 

0 

• 

7 
9 
8 
2 

• 

8 
12 

1 

• 
• 
8 
0 

0 
• 
• 

8 
8 
9 
2 



9 
15 

1 


• 
15 
1 

0 
• 
• 

9 
7 
10 
2 
• 


10 
14 

1 


• 
14 
2 

0 
• 
• 

10 
6 
11 
2 
• 


11 
13 

1 


• 
13 
3 

0 
• 
• 

11 
5 
12 
3 

• 
• 
12 
8 

1 
• 

• 
12 
4 

0 
• 
• 

12 
4 
13 
3 


• 
15 
9 

1 
• 

• 
11 
5 

0 
• 
• 

13 
3 
14 
3 
• 

• 
14 
10 

1 
• 

• 
10 
6 

0 
• 
• 

14 
2 
15 
3 
• 

• 
13 
11 

1 
• 

• 
9 
7 

0 
• 
• 

15 
1 
stage 3 4 5
Table
1: Address Generation and Control for 64point FHT
4.4 An Example
Table 2 shows the addressing , control, and data values generated during the computation of a length16 FHT. Note that there are log_{2}N = 4 stages, each with N/4 = 4 butterflies. In this example, we assign data points x[i] = i, for i = 0,...,N1. Three views of the data are given as it is processed within the BCU. The box labelled “Memory” represents the contents of the source memory at the start of each iteration (and what arrives at the inputs to the BCU). Additionally, the butterfly results at the outputs of the second tier of adders is given, followed by the data after output reordering. Symbols used are similar to those in Table 1.
Stage 1

Table 2 Addressing and DataReordering Control for 16point FHT
Figure 5 illustrates an example butterfly calculation. In the second butterfly of the second stage, the data [20, 8, 12, 8] are retrieved from the source memory. The I line is active for this butterfly, so the data are reordered as [12, 8, 20, 8]. Calculations on this data now proceed as described in equations (2.1)  (2.4) to produce [32, 16, 0, 8]. If no output reordering needs to be done, the data [32, 16] would get written to memory A and [0, 8] would get written to memory B. However, both c and o are active here. The c line causes the data to be reordered as [32, 8, 16, 0], and the o line, also being active, would subsequently swap the first two elements with the second two before releasing the data to memory. Thus, [32,8] is written to location 0 of destination memory B, and [16,0] is written to location 1 of destination memory A.
The three control lines, i, c and o, are wired to the input and output multiplexors in the BCU. The i line is wired directly to the 21 muxes at the inputs of the BCU, and the c and o lines are used to generate the data select lines for each of the four 31 output muxes.
A B
[ 20 8 ][ 12 8 ] Retrieve Data, addrA=addrB=1
[ 12 8 20 8 ] I Active: Swap A,B
[ 32 16 0 8 ] Perform Calculations
[ 32 8 16 0 ] C Active: A1=B1, B0=A1, B1=B0
[ 16 0 32 8 ] O Active: Swap A with B
[ 16 0 ][ 32 8 ] Write Data, addrA=1,addrB=0
Figure 5 Example Butterfly Illustrating the Operation of Address and Control Lines
We now turn our discussion to fixedpoint versus floatingpoint arithmetic, pointing out the consequences of choosing one over the other.
4.5 FixedPoint versus FloatingPoint
Implementing a 16bit fixedpoint processor has important advantages over floatingpoint implementations. Standard, singleprecision floatingpoint arithmetic operates on 32bit data. This increased data width would have an enormous impact on the amount of hardware required to implement our processor. Memory width would double, as would all hardware used to compute the double radix2 butterfly. Furthermore, to preclude any performance hits, additional pipeline hardware would be required to offset the higher latency of floatingpoint operations.
Floatingpoint hardware itself is relatively expensive. Not only is floatingpoint hardware functionally more complex than integer hardware, but, because of the larger data paths, the hardware must be housed in costly pingridarray packages. For these reasons, we felt that a fixedpoint processor was better suited for our purposes.
In addition to aliasing, quantization, leakage and other sources of error inherent to discrete transform calculations, the fixedpoint data will experience both roundoff and truncation error. Roundoff error is introduced at the multipliers, where only the 16 mostsignificant bits of the 32bit product are used. In [zakh87], a theoretical analysis of fixedpoint and floating point roundoff error is developed. The theoretical results are supported by subsequent experiments on whitenoise input sequences. Given a zeromean, whitenoise, fixedpoint input sequence, Zakhor found that the noisetosignal ratio increases by about 1.1 bits per stage. This relatively high error can be partially attributed to her limiting the dynamic range of the input sequence in order to insure against overflow.
Truncation error occurs after an adder overflow, where the leastsignificant bit is shifted out.^{[5]} We attempt to minimize these effects by implementing block floating point scaling.
Among the components needed to implement BFP scaling are overflowdetection logic, a maximumoverflow register (maxovl) reflecting the largest growth (0, 1, or 2 bits) of data that occurred during the current stage, a scale register (scale), reflecting the largest growth of data that occurred during the previous stage, and the common exponent register (expo). Figure 6 illustrates their interconnection.
Figure 6 Scale Control Block Diagram
The input data are constrained to lie in the interval [2^{14}, 2^{14 } 1], leaving an extra bit available for the possible overflow in the first stage. The overflow occurring during each butterfly calculation is compared to maxovl, initialized to zero at the beginning of a each stage. If maxovl is exceeded, it is updated with the new maximum. At the end of the first stage, maxovl is stored into scale, added to expo, and then reset for the next stage.
As the data arrive during the second stage, they are scaled down by the value in scale before being processed. This will give back the extra bit needed to guard against a secondstage overflow.
Data during the third and later stages can experience two bits of overflow. Preparations for this occur during the second stage, where the overflow incurred during each butterfly calculation plus one is compared to maxovl. Thus, if no overflow occurs during the second stage, maxovl = 1. If an overflow does occur, maxovl = 2. In effect, an extra, unconditional shift is made to ensure that two guard bits are present for the third stage. A special case arises when none of the data occupies more than 14 bits of precision after the second stage. In this case, the two required guard bits are already present, and the unconditional shift does not take place.
Scale control at the end of the third and subsequent stages ensures that the data starts with two guard bits. This proceeds with no special cases until the last stage, where the data need not be adjusted after an overflow. Here, the data are allowed to grow to the full 16 bits of precision and are written to the output buffer memory unmodified.^{[6]} The exponent register will reflect the total amount of shifts the data underwent.
Everything said about the forward FHT applies equally as well to the inverse FHT; the only difference is that the exponent will be log_{2}N greater than the true value. A simple subtraction takes care of this.
Statistical models have been formulated which place an upper bound on the output noisetosignal ratio for FFTs employing unconditional BFP scaling [welc69] and [oppe72]. Both authors assume that one bit of overflow occurs during each stage, and both obtain the same result, that the output noisetosignal ratio increases by half a bit per stage. The output noisetosignal ratio for conditional BFP scaling depends strongly on how many overflows occur, and at what points during the FFT calculation they occur. As the positions and timing of the overflows are correlated with the input signal, one needs to know the signal statistics in order to analyze the output noisetosignal ratio. Instead of analyzing the problem theoretically, both authors perform experiments using whitenoise inputs and compare the results to the upper bound obtained for unconditional BFP scaling. As expected, the experiments show that conditional BFP performed somewhat better than unconditional BFP, as fewer shifts of the data occurred, and hence less error propagated to the outputs.
We outline here a procedure similar to that used in [welc69] and [oppe71] for empirically determining the output signaltonoise ratio for an FHT using conditional BFP scaling. We begin with an input sequence, x(n), and compute its FHT using two programs. The first program uses doubleprecision floating point in its FHT computations, which we take as exact. The second program emulates our FHT processor scale control hardware. Using fixedpoint arithmetic, it will produce an output sequence equal to that of the first program plus an error sequence. The error sequence can be extracted by subtracting the outputs of the two programs, as shown below:
Figure 7 Procedure for BFP Scaling Error Analysis
The signaltonoise ratio is determined by taking the ratio of the rootmeansquare value of the floatingpoint sequence to that of the error sequence, where the RMS values are given by Equations (13) and (14). Note that the sequences’ mean values must be subtracted from each of the terms before being squared.
^{(13)}
^{(14)}
The results can be used to predict the SNR for several classes of input sequences and be compared to theoretical and experimental results for unconditional fixedpoint implementations in [Oppe88] and [welc69]. An example calculation of the SNR for a maximumamplitude, truncated cosine wave is provided in the Appendix.
5.0 Simulation of the
Architecture and Performance Estimates
The two primary tools used to develop the FHT processor were the Workview^{®} design system, for schematic capture and simulation, and the XACT CAD package for gate array layout. Workview is an integrated schematic capture and simulation package; our version ran on a Sun 3/60 platform. XACT is a field programmable gate array die editor which gives the designer complete control over logic placement and routing resources. We have shown in [FAGI90] that user control at this level was essential to meeting our design spec of a 65ns clock period, corresponding to a 15.4 MHz system clock. For a more thorough description of the processor hardware, the reader is referred to [FAGI90] and [ERIC90].
Simulation has verified that we meet our system spec of a 65ns clock period. We may therefore calculate the performance of the FHT processor in the following way. There are log_{2}(N) stages in an Npoint transform and N/4 dualradix2 butterflies in each stage. Given a throughput rate of one dualradix2 butterfly per clock cycle, the total number of cycles required to compute an Npoint sequence is given by the relation (N/4+5)log_{2}(N)+3. The pipeline adds five cycles for each stage in the calculation, and three additional cycles are consumed to initiate the transform. With the system clock 15.4 MHz, a 1024point transform can be performed in 170 mS, allowing one to perform realtime signal processing using a 6.04 MHz input sampling rate. Table 3 lists the various transform sizes and their execution times as given by the above equation.
Transform Size 
Time
to Complete 
4 
0.97 mS 
8 
1.56 mS 
16 
2.53 mS 
32 
4.42 mS 
64 
8.38 mS 
128 
17.01 mS 
256 
36.04 mS 
512 
77.92 mS 
1024 
169.68 mS 
Table 3 Performance Benchmarks for FHT Processor
Table 4 compares our 1Kpoint benchmark with other implementations [LeNg89]. Most of the times given are onehalf the time to calculate a 1Kpoint complex FFT, which is an accurate approximation to the time required to compute a 1Kpoint real transform. We limit our comparisons to DSP processors and chip sets, systems of comparable complexity to the FHT processor. More complex systems (for example [SWAR84] and [TMC89]) may exhibit improved performance, but at considerably increased cost.
Processor 
Time to Calc. 1Kpt Transform 
System Clock 
Honeywell HDSP66110/66210 
63 mS 
25 MHz 
FHT Processor 
170 mS 
15.4 MHz 
TRW TMC2310 
514 mS 
20 MHz 
CUSP 
665 mS 
50 MHz 
Plessey PDSP 1600 Family 
1000 mS 
40 MHz 
Texas Instruments TM320C30 
1880 mS 
16 MHz 
Motorola MC56000 
2500 mS 
10 MHz 
Phillips/Signetics DSP1 
3430 mS 
8 MHz 
Advanced Micro Dev. ADSP2100 
3800 mS 
8 MHz 
NEC PD77230 
5400 mS 
6.7 MHz 
Texas Instruments TM320C25 
6600 mS 
10 MHz 
National Semiconductor LM32900 
6700 mS 
10 MHz 
Table 4 Performance Comparison of Several DSP & Transform Processors
We caution against an unduly
enthusiastic interpretation of Table 4.
The reported time for the FHT processor is based on simulation, while
other results are from functioning chips.
Additionally, the FHT processor is a special purpose device, and should
be expected to perform well on the chosen application when compared with more
general purpose devices. Nonetheless, even when these factors are taken into
account, we believe the performance difference to be significant.
6.0 Conclusions and Future Work
We have described a highperformance FHT architecture, discussing our solutions to the principal computational challenges of the FHT. Simulations reveal a processor capable of computing a realvalued, 1Kpoint transform in 170µS. The processor is among the fastest transform processors to date and is the only dedicated FHT hardware processor known to the authors.
The architecture has been fully simulated, and has up to this point met system specifications. The next step will include PCB fabrication and prototyping. While problems inevitably arise when constructing an experimental system, our previous experiences with the construction of digital systems under similar conditions have been quite successful [KUSK89]; we anticipate comparable results here. We note that the final design need perform only within 300% of specifications to offer superior performance to its closest rival.
Additional work should include the development of host interface to a standard bus with supporting software. We recommend the Nubus, Multibus, or VMEbus, as these are wellsupported industry standards. If in addition support for direct loading of sequences via A/D converters is realized, the FHT processor could become an extremely powerful signal processing tool.
Our processor can easily be extended to support a number of DSPrelated algorithms other than the FHT. General multiplication support for windowing and convolution applications requires little modification, as the multiplier hardware is already present. Multiplieraccumulation for FIR filtering and other iterative DSP applications could likewise be supported with little modification. In fact, it would not be too difficult to generalize the arithmetic element to accommodate the FFT and other transforms without loss of performance, as the BCU structure of the two algorithms are very similar. Given the flexibility reprogrammable gate arrays provide, the control unit could be reprogrammed dynamically to implement the FFT, FHT, or any other DSPrelated algorithm without additional hardware. We propose this as a topic for future investigation.
7.0 Acknowledgements
The FHT processor was designed at the Thayer Rapid Prototyping Facility, a laboratory for the rapid production of high performance digital systems. Parts were ordered through the FAST parts broker, funded by the Department of Defense Advanced Research Projects Agency. We gratefully acknowledge Viewlogic Inc, for their provision and support of the Workview^{®} CAD package, and the Whitaker Foundation for their support of the Thayer RPF. Finally, the authors would like to thank Xilinx Incorporated for their support of software, equipment, and documentation.
8.0 Bibliography
[ANDE88] ANALOG DEVICES, ADSP2100 Applications Handbook, Volume I, and ADSP 2100 User's Manual, 1988, Analog Devices, Signal Processing Division, One Technology Way, P.O. Box 9106, Norwood Mass., 02062.
[Blah85] Blahut, R., Fast Algorithms for Digital Signal Processing, Addison Wesley, 1985.
[Brac84] Bracewell, R. N., The Fast Hartley Transform, Proc. IEEE, Vol. 72, No. 8, August 1984.
[Brac86] Bracewell, R. N., The Hartley Transform, Oxford University Press, 1986.
[Brig88] Brigham, E. Oran, The Fast Fourier Transform and its Applications. Englewood Cliffs, New Jersey, Prentice Hall, Inc., 1988.
[Burs88] Bursky, D., OneChip FFT Processor Rips Through Signal Data, Electronic Design, September 8, 1988, pp. 127129.
[Choi88] Choi, Y. & Malek, M., A FaultTolerant FFT Processor, IEEE Transactions on Computers, Vol. 37 No. 5, pp 617621, May 1988.
[Duha86] Duhamel, P., Implementation of “SplitRadix” FFT Algorithms for Complex, Real, and RealSymmetric Data, IEEE Transactions on ASSP, Vol. 34, No. 2, April 1986.
[ERIC90] ERICKSON, A., A High Performance Processor to Compute the Fast Hartley Transform Using FieldProgrammable Gate Arrays, Masters Thesis, Thayer School of Engineering, Dartmouth College, March 1990.
[FAGI90] FAGIN, B., Using Reprogrammable Gate Arrays in Performance Critical Designs, 3rd Microelectronics Conference, Santa Clara, CA 1990.
[Gane82] Ganesan, K., Govardhanarajan, T. S., Dhurkadas, A. & Veerabhadraiah, S., Hardware Realization of a General Purpose FFT Processor in a Distributed Processing Configuration, Defense Science Journal, Vol. 32, No. 1, January 1982, pp. 4146.
[Gold73] Gold, B. & Bially, T., Parallelism in Fast Fourier Transform Hardware, IEEE Transactions on Audio Electroacoustics, Vol. 21, No. 1, 1973.
[Hou87] Hou, Hsieh S., The Fast Hartley Transform Algorithm, IEEE Transactions on Computers, Vol. 36, No. 2, February 1987.
[Kaba86] Kabal, P. & Sayar, B., Performance of FixedPoint FFT's: Rounding and Scaling Considerations, IEEE ICASSP 1986.
[Kane70] Kaneko, T. & Liu, B., Accumulation of RoundOff Error in Fast Fourier Transforms, J. Assoc. Computing Machinery, Vol. 17, No. 4, pp. 637654, 1970.
[Kore81] Koren, I. A Reconfigurable and FaultTolerant VLSI Multiprocessor Array, Proceedings of the 8th International Symposium on Computer Architecture, Minneapolis, Minn, 1981.
[Kuma86] Kumaresan, R. & Gupta, P. K., Vector Radix Algorithm for a 2D Discrete Hartley Transform, Proc. IEEE, Vol. 74, No. 5, May 1986.
[KUSK89] KUSKIN, J.,A Hardware Monitor for the MC68000, B.E.Thesis, Thayer School of Engineering, Dartmouth College, December 1989.
[Kwon86] Kwong, C. P., & Shiu, K. P., Structured Fast Hartley Transform Algorithms, IEEE Trans. ASSP, Vol. 34, No. 4, August 1986.
[Lamb86] Lamb, K., CMOS DSP building blocks adjust precision dynamically, Electronic Design, October 30, 1986, pp. 96103.
[Land89] Landaeta, Dave & Menezes, Melwin, The Fast Hartley Transform on the Connection Machine, Thayer School of Engineering, Dartmouth College.
[LeNg89] LeNgoc, Tho & Vo, Minh Tue, Implementation and Performance of the Fast Hartley Transform, IEEE Micro, Vol. 9 , No. 5, October 1989, pp. 20  27.
[Lin88] Lin, X., Chan, T. F. & Karplus, W. J., The Fast Hartley Transform on the Hypercube Multiprocessors, University of California, Los Angeles, No. CSD880011, Feb., 1988.
[Lind85] Linderman, Richard W., Chau, Paul M., Ku, Walter H. & Reusens, Peter P., CUSP: A 2µm CMOS Digital Signal Processor, IEEE Journal of SolidState Circuits, Vol. 20, No. 3, June 1985.
[Marr86] Marrin, Ken, Six DSP processors tackle highend signal processing applications, Computer Design, March 1, 1986, pp. 21  25.
[Mou88] Mou, Z.J. & Duhamel, P., InPlace ButterflyStyle FFT of 2D Real Sequences, IEEE Trans. ASSP Vol. 36, No. 10, October 1988, pp.16421650.
[Onei88] O’NEILL, MARK A., Faster Than Fast Fourier, Byte, April, 1988, pp. 293300.
[oppe89] Oppenheim, Alan V. and Schafer, Ronald W. Discrete Time Signal Processing. Englewood Cliffs, New Jersey, Prentice Hall, Inc., 1989.
[Rabi75] Rabiner, Lawrence R. and Gold, Bernard. Theory and Application of Digital Signal Processing. Englewood Cliffs, New Jersey, Prentice Hall, Inc., 1975.
[Schw86] Schwartz, M., Schlappacasse, J. & Baskerville, G., Signal processor’s multiple memory buses shuttle data swiftly, Electronic Design, February 20, 1986, pp. 147153.
[Shen88] SHEN, S., MAGAR, S., AGUILAR, R., LUIKUO, G., FLEMING, M., RISHAVY, K., MURPHY, K. & FURMAN, C., A High Performance CMOS Chipset for FFT Processors., Proceedings of the IEEE International Conference on Computer Design: VLSI In Computers and Processors, 1988, pp. 578  581.
[Sore85] Sorensen, H. K., Jones, D. L, Burrus, C. S. & Heideman, M. T., On Computing the Discrete Hartley Transform, IEEE Trans. ASSP, Vol. 33, No. 4, October 1985.
[Sore87] Sorensen, H. K., Jones, D. L, Heideman, M. T. & Burrus, C. S., RealValued Fast Fourier Transform Algorithms, IEEE Trans. ASSP, Vol. 35, No. 6, June 1987.
[Swar84] Swartzlander, E. E. Jr., Young, W. K. W. & Joseph, S. J., A Radix4 Commutator for Fast Fourier Transform Processor Implementation, IEEE Journal of SolidState Circuits, Vol. 19, No. 5, October 1984.
[TMC89] THINKING MACHINES CORPORATION, Connection Machine Technical Summary, Cambridge Massachusetts, 1989.
[Welc69] Welch, Peter D., A FixedPoint Fast Fourier Transform Analysis, IEEE Transactions on Audio Electroacoustics, Vol. 17, No. 2, pp. 151157, 1969.
[Welt85] Welten, F. P. J. M., Delaruelle, A., Van Wyk, F. J., Van Meerbergen, J. L., Schmid, J., Rinner, K., Van Eerdewijk, K. J. E. & Wittek, J. H., A 2µm CMOS 10MHz Microprogrammable Signal Processing Core With an OnChip Multiport Memory Bank, IEEE Journal of SolidState Circuits, Vol. 20, No. 3, June 1985.
[Zakh87] Zakhor, A. & Oppenheim, A. V., Quantization Errors in the Computation of the Discrete Hartley Transform, IEEE Trans. ASSP Vol. 35, No. 11, November 1987.
The following pages give simulated error analysis results for a 32point maximum amplitude truncated cosine wave. We employ two programs: program A, a doubleprecision FHT algorithm, and program B, the FHT processor hardware emulator. Table A.1 shows the output of Program A for a 32point, truncated cosine wave input sequence. Figure A.1 graphs the final sequence (last column).
Bit Reversal Stage 1 Stage 2 Stage 3 Stage 4 Final
Sequence
     
16383.0000
16383.0000 16383.0000 32766.0000 0.0000 0.0000
0.0000
16383.0000 16383.0000 39552.0608 3258.7813 3258.7813
0.0000 0.0000 16383.0000 32766.0000 9596.9392 9596.9392
0.0000 0.0000 16383.0000 16383.0000 16383.0000 16383.0000
16383.0000
16383.0000 16383.0000 0.0000 0.0000 0.0000
0.0000
16383.0000 16383.0000 6786.0608 24518.8922 24518.8922
0.0000 0.0000 16383.0000 0.0000 23169.0608 23169.0608
0.0000 0.0000 16383.0000 16383.0000 16383.0000 16383.0000
16383.0000
16383.0000 16383.0000 32766.0000 65532.0000 65532.0000
0.0000
16383.0000 16383.0000 39552.0608 82362.9029 82362.9029
0.0000 0.0000 16383.0000 32766.0000 55935.0608 55935.0608
0.0000 0.0000 16383.0000 16383.0000 16383.0000 16383.0000
16383.0000
16383.0000 16383.0000 0.0000 0.0000 0.0000
0.0000
16383.0000 16383.0000 6786.0608 10946.7706 10946.7706
0.0000 0.0000 16383.0000 0.0000 23169.0608 23169.0608
0.0000 0.0000 16383.0000 16383.0000 16383.0000 16383.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 3258.7813
0.0000 0.0000 0.0000 0.0000 0.0000 9596.9392
0.0000 0.0000 0.0000 0.0000 0.0000 16383.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 24518.8922
0.0000 0.0000 0.0000 0.0000 0.0000 23169.0608
0.0000 0.0000 0.0000 0.0000 0.0000 16383.0000
0.0000 0.0000 0.0000 0.0000 0.0000 65532.0000
0.0000 0.0000 0.0000 0.0000 0.0000 82362.9029
0.0000 0.0000 0.0000 0.0000 0.0000 55935.0608
0.0000 0.0000 0.0000 0.0000 0.0000 16383.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 10946.7706
0.0000 0.0000 0.0000 0.0000 0.0000 23169.0608
0.0000 0.0000 0.0000 0.0000 0.0000 16383.0000
Table A.1 FHT of MaximumAmplitude Truncated Cosine
Wave: Program A
Figure A.1 FHT of 32Point Truncated Cosine Wave
Table A.2 lists the output of Program B using the same input sequence. Note the order in which the output sequence must be read and the value of the exponent register.
Assuming that the
doubleprecision results of Program A are exact, we can calculate the
signaltonoise ratio of the processor’s fixedpoint output for a
maximumamplitude, truncated cosine wave.
Following the procedure outlined in previous sections, we find that the
SNR of our output sequence is 71 dB.
Given that the data began with 14 bits of precision (excluding the sign
bit), this corresponds to an average decrease in SNR of 0.44 bits per stage.
n z(n) + e(n) z(n) e(n) [z(n)  avg(z)]^{2} [e(n)  avg(e)]^{2}
0 16 0.0000 16.0000
2.6840E+08 64.00000
1 3280
3258.7813 21.2187 3.8580E+08 174.73403
2 9584
9596.9392 12.9392 4.6051E+07 24.39570
3 16368
16383.0000 15.0000 3.3087E24 49.00000
4 0 0.0000 0.0000
2.6840E+08 64.00000
5 24528
24518.8922 9.1078 1.6730E+09 1.22722
6 23168
23169.0608 1.0608 1.5644E+09 82.09810
7 16368
16383.0000 15.0000 3.3087E24 49.00000
8 65520
65532.0000 12.0000 2.4156E+09 16.00000
9 82368
82362.9029 5.0971 4.3533E+09 171.53403
10 55920
5935.0608 15.0608 1.5644E+09 49.85490
11 16384
16383.0000 1.0000 3.3087E24 81.00000
12 0 0.0000 0.0000
2.6840E+08 64.00000
13 10944
10946.7706 2.7706 2.9553E+07 27.34662
14 23168
23169.0608 1.0608 4.6051E+07 48.15250
15 16368
16383.0000 15.0000 3.3087E24 49.00000
16 16 0.0000 16.0000
2.6840E+08 64.00000
17 3280
3258.7813 21.2187 3.8580E+08 174.73403
18 9584
9596.9392 12.9392 4.6051E+07 24.39570
19 16368
16383.0000 15.0000 3.3087E24 49.00000
20 0 0.0000 0.0000
2.6840E+08 64.00000
21 24528
24518.8922 9.1078 1.6730E+09 1.22722
22 23168
23169.0608 1.0608 1.5644E+09 82.09810
23 16368
16383.0000 15.0000 3.3087E24 49.00000
24 65520
65532.0000 12.0000 2.4156E+09 16.00000
25 82368
82362.9029 5.0971 4.3533E+09 171.53403
26 55920
55935.0608 15.0608 1.5644E+09 49.85490
27 16384
16383.0000 1.0000 3.3087E24 81.00000
28 0 0.0000 0.0000
2.6840E+08 64.00000
29 10944
10946.7706 2.7706 2.9553E+07 27.34662
30 23168
23169.0608 1.0608 4.6051E+07 48.15250
31 16368
16383.0000 15.0000 3.3087E24 49.00000
Avg.: 16375
16383.0000 8.0000  
Sum/N1 (s^{2}):   8.3118E+08 65.50601
^{[1]} The complex FFT can be used to simultaneously compute the transform of 2 realvalued sequences. However, the two resulting transforms are scrambled, requiring a fair amount of postprocessing to separate the two transforms.
^{[2]}Here and throughout the paper, logN denotes log_{2}N.
^{[3]} Performance benchmarks for several DSP processors and dedicated FFT processors are given in section 5.
^{[4]} It is interesting to note that the realvalued FFT algorithm also exhibits the retrograde addressing property and the same butterfly structure of Figure 1.5. The inverse transform, however, has a different butterfly structure.
^{[5]} The effects of scaling may offset any error introduced by roundoff as the “noisy” bits are shifted out and discarded. Some implementations round the scaled data according to the value of the discarded bit. If the shifted bit was ‘0’, no error occurs, and the bit is discarded. If the shifted bit was ‘1’, then one is added to the data after shifting. Thus 3 shifted by 1 bit and rounded gives 2. Scaling the data in this way produces a mild improvement in the output signaltonoise ratio at the cost of a more complex BCU.
^{[6]} If the data are to be involved in another transform (i.e. IFHT), the data must be readjusted to within [2^{14},2^{14}1]