Calculating the FHT in Hardware

 

Adam C. Erickson

Sequoia Systems Inc.

Marlboro, Massachusetts

{...seqp4!sequoia!adame}

 

Barry S. Fagin

Thayer School of Engineering

Dartmouth College

Hanover, NH 03755

603-646-3060

barry.fagin@dartmouth.edu

 

 

EDICS 4.11, 7.1.3, 7.2

 

 

 

ABSTRACT

 

We have developed a parallel, pipelined architecture for calculating the Fast Hartley Transform.  Hardware implementation of the FHT introduces two challenges: retrograde indexing   and data scaling.    We propose a novel addressing scheme that permits the fast computation of FHT butterflies,  and  describe a hardware implementation of conditional block floating point scaling that reduces error due to data growth with little extra cost.  Simulations reveal a processor capable of transforming a 1K-point sequence in 170 microseconds using a 15.4 MHz clock. 

 

 

 

 

 

 


 

 

 

1.0 Introduction

The fast Hartley transform has attracted considerable research interest as an alternative to the FFT.  A complete analogy exists between the two transforms.  All the properties applicable to the FFT, such as the convolution and shifting theorems, apply to the FHT.  In fact, the FHT is easily extracted from the real and imaginary components of the FFT (and vice versa). These properties have been well-documented in [Sore85], [Brac86], [Hou87],  and [Onei88].

The computational advantage of the FHT lies in its use of real-valued data.  As the vast majority of applications involve only real-valued data, significant performance gains are realized without much loss of generality.  If one performs the complex FFT on an N-point real-valued sequence, the product is a complex sequence of 2N real points, N of which are redundant.[1] The FHT, on the other hand, will yield a real sequence of the same length with no redundant terms, using only half the time and half the memory resources.

Real-valued FFT  algorithms have been derived [Duha86, Sore87, Mou88] which compare in speed and memory requirements of the FHT, but they require the implementation of a different algorithm to compute the inverse real-valued FFT.  This added complexity is undesirable, as algorithms for both the forward and inverse transform are preferred.

The FHT is obtained from the Discrete Hartley Transform (DHT) pair of a real function x(t), shown below:

 

                                                                                            (1)

 

                                                                                             (2)

 

where cas(x) = cos(x) + sin(x).  Note that the inverse DHT differs from the forward DHT only in a scaling factor, lending itself well to hardware implementation.

Like the DFT, naive implementations of the above equations imply O(N2) operations.  Using techniques similar to the derivation of the FFT, the number of operations required to perform the DHT can be reduced to O(NlogN)[2].

 

2.0 Previous Work

Hardware implementations of the FFT and similar algorithms fall into two categories:  programmable DSP processors, and dedicated FFT processors.  Most of those in the first class [Marr86], are extremely efficient when compared to general purpose Von Neumann  architectures, but suffer in performance when compared with  dedicated processors.[3] 

Academic efforts in dedicated FFT processors include a wafer-scale VLSI multiprocessor array capable of computing the FFT, proposed in [Kore81].   [Choi88] describes an extension of [Kore81] with improved fault tolerance, although performance figures are not reported.  The CUSP dedicated-FFT processor [Lind85] has a very simple host interface and an elegant solution to controlling the overflow associated with fixed-point arithmetic.  But the decision to implement bit-serial multipliers results in a single-processor system that is easily outperformed by ours and others [Shen88, Burs88].

One of the most successful commercial efforts in this area is Honeywell’s HDSP66110/220 chip-set [Shen88].  Capable of performing a 1024-point real-valued transform in 63 microseconds, this product combines a number of multiprocessing techniques to implement the fastest single-processor, radix-4 FFT engine known to the author.  Several HDSP66110/220’s can be cascaded to gain even further performance, with a corresponding increase in cost. 

Hardware realizations of the FHT are scarce.  We conjecture that this is due not only to its relatively new stature in the ranks of fast algorithms, but also to the peculiar quirks of its implementation.

Software implementations of the FHT on Intel’s iPSC hypercube [Lin88] and the Connection Machine [Land90] have demonstrated that architectures of this type do not yet possess the interprocessor communication speeds necessary to compute the FHT efficiently.  Here, 1K-point transform execution times are measured on the order of seconds.   Adding processors can actually degrade performance, as communication overhead between the processors begins to offset any realized performance gains.  Le-Ngoc and Vo offer another description of the implementation and performance of the FHT in [LENG89].  They publish estimated performance figures, of which only software implementations on the 80x86 microprocessor family reflect actual measurements.  Estimates on other architectures are obtained by halving published FFT calculation times.

A summary of performance benchmarks for a number of general-purpose and dedicated-FFT processors is given in a later section.  In the next section, we discuss the two principal challenges to hardware implementation of the FHT: retrograde-indexing and scaling.

 

3.0 The Challenges of Implementing the FHT

The two principal difficulties with implementing the FHT are retrograde indexing and data scaling.   Retrograde indexing refers to the peculiar addressing pattern of the FHT, while scaling arises from our use of fixed-point arithmetic to calculate the FHT.

 

3.1 Retrograde Indexing

Addressing for the FHT exhibits a unique asymmetry.  Figure 1 depicts the flow diagram for a decimation-in-time (DIT), radix-2, 16-point FHT [Kwon86].  Note the regular structure of the FHT.  It is identical to a radix-2 FFT except for the ‘T’ blocks, which implement retrograde indexing. Visualizing larger transforms is straightforward:  to construct a 32-point transform flow graph,  we duplicate the 16-point transform diagram below for stages 1-4 and add a fifth stage with T5 in its lower half[4]. 

 

 

Figure 1   Butterfly Diagram for 16-point FHT

[reproduced from [kwon86])

 

For the first two stages, FHT butterflies are similar to FFT butterflies, involving only two terms and no multiplications.  For stage 3 and after, however, an increasing number of butterflies involve non-trivial multiplications.  As T3 and T4 show, these butterflies require three input points to produce two output points (heavy lines, stage 3), introducing computational asymmetry and an I/O ratio of  3:2.  The I/O efficiency can be improved by noting that two of the three terms used for any one butterfly are shared by a second butterfly in the same iteration.   If the two associated butterflies are processed simultaneously, only four terms (not six) would be required (heavy lines, stage 4).  Thus if we calculate two FHT butterflies at once, we may use four input points to calculate four output points, giving an I/O ratio of 1:1 [SORE85].  The concurrent calculation of two related butterflies will be referred to as a dual radix-2 butterfly.  The equations of a dual radix-2 butterfly are shown below:

 

                                                (3)

                                (4)

                                        (5)

                                  (6)

 

where i = 1,...,log2N (stage counter), Ns = 2i (group size), f = 0,...,Ns/4 - 1 (butterflies within each group), and k = 0,N/Ns,...,(N/Ns - 1)*N/Ns  (base index for each group).  Whenever f evaluates to zero, Equations (3) - (6) are reduced to one addition or subtraction.  This is the case for all of stages 1 and 2, half the butterflies in stage 3, and so on, decreasing to only 1 butterfly in the last stage.  All the other butterflies involve general multiplication with the twiddle factors and two additions or subtractions.    The next section discusses the consequences these operations can have on data integrity.

 

3.2 Preventing Overflow

Additions and subtractions using fixed-point arithmetic can result in data overflow, a situation which yields erroneous results and must be avoided during FHT calculation.  In butterflies requiring three terms, two addition operations are involved.  Studying Equations (3) - (6), we observe that the condition which results in the largest data growth is when f/Ns = 1/8 and the data are equal and at their maximum.  Normalizing the maximum to be 1, the result of a worst-case computation is described by:

 

 

 

 

Because the result can grow by a factor of 2.4142, the data can experience up to élog22.4142ù  = 2 bits of overflow.  To prevent this, the data can be shifted right by two bits before performing the butterfly calculations.  In addition to reducing the data’s dynamic range, this data scaling  introduces truncation error if any of the discarded bits are non-zero.  As the FHT calculations proceed from stage to stage and further data shifts are required, the error can propagate and grow enough to severely distort or completely obliterate the true output signal.  

The technique used to scale the data can likewise degrade the signal-to-noise ratio.  The easiest solution is to unconditionally shift the data right two bits at each stage.  This would eliminate the possibility of overflow, at the cost of unnecessary shifting of data and a large loss of accuracy.  Clearly other solutions are preferable.

Our processor implements a scaling technique known as conditional “block-floating-point” (BFP), whereby the data are shifted only   when an overflow occurs [Welc69].   Instead of providing an exponent for each data word, as in floating-point, the entire sequence is associated with a single exponent.  Should an overflow occur during an iteration, the entire array is shifted right by one and the exponent incremented.

How and when the sequence is shifted can effect system performance.  [Welc69] suggests that the sequence be shifted immediately after an overflow.  In this method, the sequence may be traversed twice during an iteration for the sole purpose of overflow adjustment.  [ANDE88] limits the size of the incoming data to be two bits less than full-precision.  The data are then allowed to grow into these extra bits during an iteration, after which an explicit subroutine is called to shift the sequence by the maximum amount of overflow that occurred.  These schemes are not optimal because a significant portion of the FHT calculation is dedicated to adjusting the sequence.  Our implementation improves upon existing methods  by shifting the data as it arrives in the computational element during the next iteration, thereby precluding any time penalties.   We discuss this in detail in a later section.

The situation resulting in the maximum possible bit growth arises when the input data are equal and at full-scale; that is, x(n) = [2B-1-1,2B-1-1,...,2B-1-1] or [-2B-1,-2B-1,...,-2B-1], where B is the processor’s word-length.  The FHT of a constant input is an impulse whose amplitude is given by the following equation:

 

                                                                                                                               (7)

                                                                   

 

This result is intuitive, as a constant time-domain signal concentrates all its power into the DC component of its transform.   Equation 7 shows that the data can grow by a factor of N, or log2N bits, so when x is at full-scale, log2N shifts are necessary to prevent overflow.

We note that the inverse transform would likewise produce log2N bits of overflow given the same  maximum DC input sequence.  However, the inverse transform requires that we divide by N, equivalent to subtracting log2N from the data’s common exponent.  From equation 8, we see that the data, overall, does not experience any growth, although it experiences maximum data scaling.  For a detailed analysis of this and other examples showing the effects of data scaling, the reader is referred to [ERIC90].

 

                                                                                                                      (8)

 

Knowing that a sequence can grow by one bit during the first two stages and by two bits in all subsequent stages, unconditional fixed-point scaling techniques would shift the data a total of 2log2N - 2 bits,  almost twice  the maximum possible bit-growth.  For this reason, block-floating-point schemes can generally be expected to perform better than unconditional schemes.  Detailed error analyses for the fixed-point scaling techniques discussed here can be found in [welc69], [rabi73], and [oppe88].  A method of analysis specific to our implementation is presented in a later section.

 

 

4.0 Architecture of the FHT Processor

This section describes an architecture capable of computing the FHT very efficiently.  For 1024-point transforms, 98 percent of the execution time is spent performing FHT calculations.  We first discuss the system architecture, and then present solutions to the aforementioned implementation challenges.

 

4.1 System Architecture

The processor architecture embodies a number of techniques designed to optimize system performance and functionality.  The butterfly computational unit (BCU) utilizes a 5-stage pipeline, keeping the critical path as short as possible.  Additionally, the processor incorporates tightly-coupled, high speed buffer memories.  Organized to hold up to 1024 16-bit words, each memory is arranged to allow simultaneous, single-cycle access to four 16-bit data points.  Employing several of these memories allows for many such concurrent data transfers, boosting the data I/O rate significantly.

The processor can be programmed to transform sequence lengths of any power of two from 4 to 1024 points, thereby overcoming an inflexibility found in some dedicated hardware processors.   Larger transforms are a simple extension of the processor’s address generation unit and an increase in the depth of the memories.

A block diagram of the system is given in Figure 2.   The arithmetic element accepts four data elements (and the two sine and cosine terms) and outputs four processed elements every clock cycle.  Each of the four buffer memories shown contains two independently addressable arrays 32 bits wide by 256 words deep.  Two of these memories alternate as source and destination as the FHT calculations proceed from stage to stage.  The other two memories not involved in the transform can be used to load newly-sampled data and unload recently processed data asynchronously while transform calculations of a third sequence are in progress, accommodating high-speed, real-time FHT processing. 

Interface to a host machine is accommodated by 3 control lines and a 4-bit word indicating transform size which can easily be mapped into the host’s address space.  To begin a transform, the host places the value log2N on the data bus and asserts start.  Calculations can be put on hold or halted altogether by asserting pause or reset, respectively.  When a transform is complete, done is asserted.  The host can then recover the processed data and their common exponent.

 

 

Figure 2   System Block Diagram

 

We have mentioned previously the two primary challenges to implementing the FHT: retrograde indexing and scaling.  We now present our solutions to these problems.

 

4.2 Algorithm Control and Butterfly Computation

Because parallel access to all four elements of a dual radix-2 butterfly is desired, the processor employs an algorithm different from the one originally proposed in [Brac84].  Given a sequence of length 2v, v Î [2,3,...,10], the sequence is broken into 4 equal segments and arranged as depicted in Figure 3.  Each column in the array is a separate memory bank, where the first 1/4th of the sequence is loaded in the first bank, the second 1/4th in the second, and so on. Thus, a 16-point sequence would be stored as shown in Figure 3.

When calculations begin, the top data value  from each of the memories can be fetched simultaneously (via memory address 0 being sent to all four memories), providing the BCU with the necessary data to calculate the first butterfly.   Comparing Figures 1 and 3 will show that bit-reversal, an operation required before performing DIT butterfly calculations, is already partially accomplished (the destination addresses for the first stage will still need to be bit-reversed).  A simple binary “butterfly” counter extending from 0 to N/4 - 1 can serve as the address to fetch the input data. 

 

 

Figure 3   Arrangement of N-point Sequence into four separate memory banks

 

Once fetched, the data enter the Butterfly Computational Unit, pictured in Figure 4. The dotted lines in the figure show where clocked buffers are placed in order to implement the five-stage pipeline.  Once the data arrive from memory, the data are first scaled according to the amount of overflow that occurred during the previous stage.  Next, reordering of the data may be necessary due to the way data was written during the previous iteration.  After reordering, the butterfly computations from equations (3) - (6) are performed.  More post-calculation reordering may be necessary before the results are finally written to the destination memory.

As Figure 4 shows, butterfly calculation involves multipliers, adders and subtracters.  Disposing of most of the subscripts and letting  m denote the current stage number, Equations  (3) - (6)  simplify to:

 

A0m+1 = A0m + (B0mcosq + B1msinq)                                                                     (9)

A1m+1 = A1m + (B1mcosq  - B0msinq)                                                                   (10)

B0m+1 = A1m  - (B1mcosq  - B0msinq)                                                                    (11)

B1m+1 = A0m  - (B0mcosq  + B1msinq)                                                                   (12)

 

These equations are similar to those used to calculate just one radix-2 complex FFT butterfly .

 

Figure 4   BCU Block Diagram

        Although a sequence is stored in four separate memory banks, writing the results to the destination memory does not require the generation of four separate addresses.  The BCU result data are written to the destination memory in pairs, requiring only two addresses, A and B (see Figure 4).   Data must be written in this way to insure simultaneous access to four elements  using a single source address during the next iteration.

 

4.3 Address Generation and Control

        The correct computation of the FHT using the datapath of Figure 4 requires the specification of input and output reordering control signals.  For some butterflies, the data emerging from the A and B sides of the pipeline must be swapped.  This swapping must then be remembered  by the processor during the next butterfly stage.  Butterflies with 2 input points must also be handled differently. 

Output and input reordering are indicated by the boolean variables O and I.  These are used to re-route data coming from or going to the A or B memories, and are implemented as control signals in the processor.  The i control line governs the reordering of data before a butterfly calculation.  When the i control line is active in a given stage, the data coming from memory A is swapped with that of memory B.  When o is active, the processed data to be written to  memory A is swapped with that of memory B.  If o is active for a butterfly in any stage, i will be active in the corresponding butterfly in the next  stage.

        The c control line accommodates butterflies requiring 2  (instead of 4) input terms.  Note that in the first two stages c is always high.  The BCU computes the correct results, but they are not in the proper order for writing out to memory.  When c becomes active, results from the first adder and second subtracter (in the 2nd tier of ALUs) are written to destination memory A0 and A1, respectively, and results from the second adder and first subtracter are written to destination memory B0 and B1, respectively. 

        To analyze the operation of the control lines, it is convenient to number each butterfly with an address b,  0 <= b  < N/4 - 1, where N is the input size, and to label the stages of the FHT with a number s,  0 <=  s  < log N.  We divide the set of butterflies into groups  of size 2s -1, numbered from g   = 0 .. N/2s + 1-1 (if s  = 0 the group size is defined to be 0).  I, O, and C can then be specified as follows:

 

        s  = 0                                               I = 0, O = 1, C = 1

        s  = 1                                               I = 1, O = bit 0 of b , C = 1

        s  = 2 .. log N-2                              I = bit s-2  of b , O = bit s-1  of b , C = NOR(lower s-1   bits of b )

        s  = logN - 1                                   I = bit s-2  of b , O = bit s-1  of b , C = 1

 

        Bits are numbered from 0, and are assumed to be equal to zero if the indicated position is greater than logN - 3.  Output reordering is always required for the first stage, and then for all the odd numbered groups.  Input reordering is not required for  butterflies in stage 0, but is needed at stage s  for all butterflies with reordered outputs in the stage s-1.    The C signal, for two-input butterflies, is always active during the first and second stages.  C is then active for the first butterfly in each group until the last stage, when it is always asserted.

 

        The destination A and B addresses of a butterfly b  in group g   for stage s  can be specified in the following way:

 

                IF s  = 0

                        A := bit-reversed(b ), B := bit-reversed(b )

                ELSE IF g   is even

                        A := b

                        IF NOR(lower s-1   bits of b )

                                B := A plus 2s-1

                        ELSE

                                B := 2s (g+1) - A

                ELSE

                        B := b

                        IF NOR(lower s-1  bits of b )

                                A := B plus 2s-1

                        ELSE

                                A := 2s (g +1) - B

 

                For the first stage of the FHT, the A and B addresses are simply the bit-reversed equivalents of the butterfly addresses.  Otherwise, for even groups A is the butterfly address.  B is equal to A plus the group size if the butterfly is the first in a group, otherwise it is twice the group size times (g  + 1 ) minus A.  For odd groups, the A and B values are switched.

        Our processor uses four memory banks, with two alternating as source and destination with each stage of the FHT.  Alternatively, one may reduce system memory to two banks if special logic is added to stall the pipeline when the input and output stages attempt to access the same address.  In a processor with k  stages, this will occur when the destination address for butterfly b ,  is equal to b+k .  It is not difficult to show that for k   odd, accesses for s   > 0 are conflict-free.

        Destination addresses and control signal calculation for a 64-point FHT are shown in Table 1.  This transform has 16 dual-radix butterflies.  '•' denotes an asserted signal, 'b' a butterfly number, and 'g' a group number.  A and B indicate A and B destination addresses, while I, O, and C denote the appropriate control signals.  Groups are indicated by dotted lines.

 

b

g

I

C

O

A

B

 

g

I

C

O

A

B

 

g

I

C

O

A

B

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

-

 

0

0

 

0

 

0

1

 

0

 

 

0

2

1

-

 

8

8

 

1

1

0

 

0

 

 

1

3

2

-

 

4

4

 

2

 

2

3

 

1

 

2

0

3

-

 

12

12

 

3

3

2

 

1

 

3

1

4

-

 

2

2

 

4

 

4

5

 

2

 

 

4

6

5

-

 

10

10

 

5

5

4

 

2