UA WebZine "Ask the Doctors!" July 03 | Drs. David P. Berners and Jonathan S. Abel Answer Your Signal Processing Questions

Ask the Doctors! Drs. David P. Berners and Jonathan S. Abel Answer Your Signal Processing Questions.

Doctors David P. Berners
& Jonathan S. Abel

Q: "I thought you guys were saying that floating point systems are better than fixed point. Now your plugins run on ProTools on the 56k. Do they really sound the same?"
--blinky, via email

A: We have used a combination of techniques to ensure that the ProTools versions of our plugins have the fidelity we need. Here's a look at how it was done:

To discover the tradeoffs between fixed-point and floating-point systems, we must first have an understanding of the mechanics involved with each system. Let's begin by reviewing the way that fixed and float numbers are stored:

"With proper care any algorithm can be made to run on any DSP, provided enough horsepower."

Fixed-point
Here, the decimal point is always in the same place (hence the name). The most common format for fixed-point systems is to have one sign bit (a one or a zero, which indicates whether the number is positive or negative), followed by the decimal point. The rest of the bits are fractional bits. This gives an allowed range of about plus-one to minus-one.

You may ask, "Why not move the decimal point over a few places so that the range can be extended to plus-or-minus two or four?" The biggest advantage of limiting the range of absolute values to one is that with this system, whenever two numbers are multiplied, the product will always have an absolute value less than one, and will thus stay within the allowed numerical range. If the maximum representable value is bigger than one, we can have products which are greater than their multiplicands, which could cause saturation when doing multiplication. Furthermore, if the total number of bits used to represent a number stays the same, moving the decimal point does nothing to increase the dynamic range of the number system. The maximum allowed value goes up, but the granularity with which numbers can be represented goes up by the same factor.

Fixed-point number systems have several advantages. Processors using fixed-point math can perform adds and multiplies more simply than processors using floating-point math. Also, if a signal is known to have a small dynamic range, fixed-point systems give the best numerical precision for a given number of bits. Among the drawbacks of fixed-point systems are that large signal values will saturate or clip, and that smaller numbers will have less relative numerical precision.

Floating-point
For floating-point systems, each number is represented by two separate fields: the mantissa and the exponent. The mantissa is very similar to a fixed-point number. However, for floating-point systems, the value of the mantissa always has an absolute value between 0.5 and 1.0, which gives us one extra bit for free; since the leading bit will always be one, it can be left out. The exponent field is used to tell by how many bits the mantissa must be shifted---i.e. what power of two multiplies the mantissa---to obtain the actual value being represented. The exponent can usually take positive or negative values, and allows for systems with huge dynamic range.

Compared to a fixed-point system with the same total number of bits, the floating-point system will have fewer mantissa bits than the fixed-point system. This means that for numbers which are very close to plus- or minus-one, the fixed-point system will have superior precision. However, for signals with a wide dynamic range, floating-point has the advantage that the "noise floor" or quantization level is always the same number of dB down from the signal, as long as the signal stays within the (large) representable range of the exponent.

Floating-point number systems have the advantage that algorithms can be more easily designed for floating-point, with less attention paid to the dynamic range of internal variables. Also, for numbers with large dynamic range, numerical precision is more consistent. Disadvantages are that the DSP required to perform multiplies and adds becomes more complex (and DSP expensive), and for signals with very small dynamic range, precision can be worse than for fixed-point systems with the same number of bits.

Floating-Point Math on Fixed-Point Processors
Porting an algorithm to a fixed-point architecture requires attention to details which are not necessarily relevant in floating-point systems. To ensure robust performance, it is necessary to ensure that if any saturation that takes place within the algorithm---say, due to an extremely loud input---occurs at the output of the algorithm, rather than at some internal node. This makes it obvious to the user that clipping is taking place and a level reduction is required. Also, for fixed-point systems, care must be taken to make the dynamic range of internal nodes as small as possible to best preserve numerical precision. There is more freedom than may be thought possible in trying to achieve this. For example, there are many filter structures having the same transfer function (e.g. direct form, transposed, lattice) which have vastly different numerical properties. Careful selection of filter structures can help numerical performance significantly.

One common misconception about various processors is that they will only run in one numerical system (fixed or float). This is not the case. It is possible, for example, to use floating-point numbers with a Motorola 56000. However, conversion between the (native) fixed-point and floating-point must be done by hand, and multiplies and adds must be performed using multiple instructions (and multiple clocks) per operation.

When porting the UA compressors to ProTools, we discovered that, in order to achieve the numerical precision we wanted, it was necessary to keep some of our code running in floating-point. This hurt the instance count of the algorithms, but allowed the algorithms to be ported in a way that did not compromise the numerical precision.

There are some operations which lead to problems regardless of number systems. The most common example is the addition of a number to another number which is relatively very small. Both fixed-point and floating- point systems will have problems with this unless there are enough bits to provide the necessary precision. The solution to this problem is to go to double-precision, where two words are used to store each data element. This provides more numerical precision, regardless of number system used. However, double-precision operations can be up to three times as expensive in terms of DSP as single-precision.

To achieve a good port it is necessary to look at the required dynamic range and precision for each internal node, and to choose number systems and precision accordingly. The ProTools ports of our algorithms make use of both fixed- and floating- point systems, as well as single- and double precision storage. With proper care any algorithm can be made to run on any DSP, provided enough horsepower.

Do you have a question for the Doctors?