Ask the Doctors! Drs. David P. Berners and Jonathan S. Abel Answer Your Signal Processing Questions.

Doctors David P. Berners
& Jonathan S. Abel

Q: "I thought you guys were saying that floating point systems are better than fixed point. Now your plugins run on ProTools on the 56k. Do they really sound the same?"
blinky, via email
A: We have used a combination of techniques to ensure that the ProTools versions of our plugins have the fidelity we need. Here's a look at how it was done:
To discover the tradeoffs between fixedpoint and floatingpoint systems, we must first have an understanding of the mechanics involved with each system. Let's begin by reviewing the way that fixed and float numbers are stored:
"With proper care any algorithm can be made to run on any DSP, provided enough horsepower."
Fixedpoint
Here, the decimal point is always in the same place (hence the name). The most common format for fixedpoint systems is to have one sign bit (a one or a zero, which indicates whether the number is positive or negative), followed by the decimal point. The rest of the bits are fractional bits. This gives an allowed range of about plusone to minusone.
You may ask, "Why not move the decimal point over a few places so that the range can be extended to plusorminus two or four?" The biggest advantage of limiting the range of absolute values to one is that with this system, whenever two numbers are multiplied, the product will always have an absolute value less than one, and will thus stay within the allowed numerical range. If the maximum representable value is bigger than one, we can have products which are greater than their multiplicands, which could cause saturation when doing multiplication. Furthermore, if the total number of bits used to represent a number stays the same, moving the decimal point does nothing to increase the dynamic range of the number system. The maximum allowed value goes up, but the granularity with which numbers can be represented goes up by the same factor.
Fixedpoint number systems have several advantages. Processors using fixedpoint math can perform adds and multiplies more simply than processors using floatingpoint math. Also, if a signal is known to have a small dynamic range, fixedpoint systems give the best numerical precision for a given number of bits. Among the drawbacks of fixedpoint systems are that large signal values will saturate or clip, and that smaller numbers will have less relative numerical precision.
Floatingpoint
For floatingpoint systems, each number is represented by two separate fields: the mantissa and the exponent. The mantissa is very similar to a fixedpoint number. However, for floatingpoint systems, the value of the mantissa always has an absolute value between 0.5 and 1.0, which gives us one extra bit for free; since the leading bit will always be one, it can be left out. The exponent field is used to tell by how many bits the mantissa must be shiftedi.e. what power of two multiplies the mantissato obtain the actual value being represented. The exponent can usually take positive or negative values, and allows for systems with huge dynamic range.
Compared to a fixedpoint system with the same total number of bits, the floatingpoint system will have fewer mantissa bits than the fixedpoint system. This means that for numbers which are very close to plus or minusone, the fixedpoint system will have superior precision. However, for signals with a wide dynamic range, floatingpoint has the advantage that the "noise floor" or quantization level is always the same number of dB down from the signal, as long as the signal stays within the (large) representable range of the exponent.
Floatingpoint number systems have the advantage that algorithms can be more easily designed for floatingpoint, with less attention paid to the dynamic range of internal variables. Also, for numbers with large dynamic range, numerical precision is more consistent. Disadvantages are that the DSP required to perform multiplies and adds becomes more complex (and DSP expensive), and for signals with very small dynamic range, precision can be worse than for fixedpoint systems with the same number of bits.
FloatingPoint Math on FixedPoint Processors
Porting an algorithm to a fixedpoint architecture requires attention to details which are not necessarily relevant in floatingpoint systems. To ensure robust performance, it is necessary to ensure that if any saturation that takes place within the algorithmsay, due to an extremely loud inputoccurs at the output of the algorithm, rather than at some internal node. This makes it obvious to the user that clipping is taking place and a level reduction is required. Also, for fixedpoint systems, care must be taken to make the dynamic range of internal nodes as small as possible to best preserve numerical precision. There is more freedom than may be thought possible in trying to achieve this. For example, there are many filter structures having the same transfer function (e.g. direct form, transposed, lattice) which have vastly different numerical properties. Careful selection of filter structures can help numerical performance significantly.
One common misconception about various processors is that they will only run in one numerical system (fixed or float). This is not the case. It is possible, for example, to use floatingpoint numbers with a Motorola 56000. However, conversion between the (native) fixedpoint and floatingpoint must be done by hand, and multiplies and adds must be performed using multiple instructions (and multiple clocks) per operation.
When porting the UA compressors to ProTools, we discovered that, in order to achieve the numerical precision we wanted, it was necessary to keep some of our code running in floatingpoint. This hurt the instance count of the algorithms, but allowed the algorithms to be ported in a way that did not compromise the numerical precision.
There are some operations which lead to problems regardless of number systems. The most common example is the addition of a number to another number which is relatively very small. Both fixedpoint and floating point systems will have problems with this unless there are enough bits to provide the necessary precision. The solution to this problem is to go to doubleprecision, where two words are used to store each data element. This provides more numerical precision, regardless of number system used. However, doubleprecision operations can be up to three times as expensive in terms of DSP as singleprecision.
To achieve a good port it is necessary to look at the required dynamic range and precision for each internal node, and to choose number systems and precision accordingly. The ProTools ports of our algorithms make use of both fixed and floating point systems, as well as single and double precision storage. With proper care any algorithm can be made to run on any DSP, provided enough horsepower.
Do you have a question for the Doctors?