Floating Point Performance of Common Lisp
Christopher J. Vogt
631 N. 164th Circle, Omaha, NE 68118-2530
vogt@computer.org

"Round up the usual suspects" [6]

When a programming task entails a significant amount of floating point (FP) operations, the FP processing speed becomes the primary consideration in selecting a language to be used for implementation. As a computing professional, you round up the usual suspects: FORTRAN, C, Java, Ada, any static typed language, and depending on other considerations, you select a language, confident in your due diligence. Your confidence is misplaced. With a little more investigation you might be shocked, shocked to learn that a dynamic language like Common Lisp can deliver similar FP performance as your favorite static typed language, and opening up to you all the advantages that a dynamic language like Common Lisp has to offer, but without the historical side effect of poor FP performance.

"For a price Ugarte, for a price." [ 6]

Lisp has always been considered the leading language for AI research, and Common Lisp (CL) has become the preferred Lisp dialect since the 1994 ANSI standard was released [1]. While it has never been considered a number crunching language, recent CL implementations may change that perception. ANSI CL defines a set of type specifications, so the programmer can tell the compiler what type a variable is, just as you do in a static typed language like C. However, it is not required that you tell the compiler this information, nor is it required that the compiler pay any attention to your declarations, so they are more of a hint than a requirement. Many CL implementations ignore the type specifications if they occur, or they do not take full advantage of the information. The price you pay in CL for getting FP performance equivalent to that of a static typed language like C, is to include type declarations in the CL code. This can reduce the flexibility of the code, but allows the programmer the freedom to chose to specify the types with the reward of significantly improved performance, similar to that of static typed language implementations.

"I was misinformed." [6]

In evaluating FP performance, the Gabriel FFT benchmark [4] was used (all the code is available at http://members.home.com/vogt/fft/). This benchmark was converted to C and Java (The Java code is very similar to the C code, in fact the inner loops of the Java and the C code are identical). Appendix A has the CL code for the FFT benchmark that is modified to include variable type declarations. The programs were run on a Micron PC with dual 133MHZ Pentiums which have an 8K data cache on chip, with a1 MB secondary cache, and 64MB RAM, running Windows NT 4.0 SP3. The CL implementation used was a beta copy of Franz Allegro Common Lisp (ACL) 5.0. Microsoft Visual C++ version 4.2 is the C implementation used. The Java used included a byte code compiler by Symantec version 210.065. The results are shown in the table on the next page.
 
Data Set Size 1024  8192
Symbolics 3600 + FPU 3.87
ACL 5 Beta 3.33  54.3 
ACL 5 Beta w/declarations 0.046  0.765
Java 0.033 0.83 
MS VC++ 4.2 0.013 0.57 
The original FFT benchmark operates on two arrays with 1024 elements. Before running the benchmarks, it was anticipated that the Java and C versions would have similar performance, and when the results showed that the C version was 2.5 times faster, some investigation was called for. After some thought and research, it was determined that this was the result of cache affects. The C version was able to keep most of the data in the on chip cache, since the two 1024 element arrays are exactly 8K bytes of data. Increasing the arrays to 8192 seemed to get rid of the cache affects, and showed the C code to be only about 45% faster than the Java code, much more in line with expectations. The ACL code was slightly faster than the Java code, and only about 35% slower than the C code.

The CL code with declarations bears a little explanation. The changes to the CL code include not only the type declarations for variables, but also the type declarations for the values being returned by functions. Line 43 of fft-with-declarations gives an example of the former, and line 48 gives an example of the latter. Both lines are shown below:

     (declare (fixnum l))
     (the single-float (cos (f/ (single-float pi) (single-float le1))))

The first case above is not much different from the way in which most static typed languages declare the type of variables. In this case the variable "l" is being declared to be of type fixnum, which would be similar to the C code: "long l;". The second example is necessary because CL does not provide a method of specifying the type returned from a function, because functions can return different types. For example, the type returned from "sqrt" can be a floating point number or a complex number, in the case of (sqrt -1). So the use of "the" is a method of specifying to the compiler the expected type of the returned value. One downside to this is that if the function returns a value of a different type than what you told the compiler, the results are undefined. Often times your program will crash, as it does in C when a variable is declared a short, and it gets incremented past its maximum value. However, by promising the compiler that you know the type of the returned value, the compiler can produce code optimized for that particular case. To facilitate the use of "the" we use macros like "f-" to specify a floating point minus operation with both the arguments and the returned value is of type single-float. This use of macros helps to keep the code readable.

"That's so long ago I don't remember" [6]

There is another surprise in the FFT benchmark, and that is the revelation that without type declarations, CL FP implementations still have a long way to go in delivering acceptable performance. The performance of ACL without type declarations is almost two orders of magnitude slower than the version with type declarations. Additionally, it is about the same speed as the Symbolics, which is 10 year old hardware technology (note that ACL is not unique it its relatively poor FP performance without type declarations, and it was among the fastest CL implementations tested). The above tests were run on 1995 hardware technology. The Symbolics 3600+FPU is 1985 hardware technology, it runs at 5MHZ, has 6MB RAM that is clocked at 600ns. As compared to the Symbolics, the Micron has a clock that is 26 times as fast, the memory is 10 times as fast (not counting the caches), and none of this takes into account the efficiencies gained in the last 10 years in processor architecture. With hardware that is almost 100 times faster, ACL is only a scant 16% faster than Symbolics on this benchmark.

To be sure, there are significant differences in the hardware of the 3600 [5] and modern day microprocessors, and it is these differences that allowed for more efficient CL FP performance. The 3600 had custom hardware that performed type checks in parallel with the execution of the operation, and operated on 36-bit memory. Whereas a modern microprocessor like the Pentium operates on 32-bit memory and has no type checking hardware so the type checking must be done sequentially. As a consequence of this, ACL must box the FP data, consuming more memory than the C version, and spending many extra memory cycles checking the types of each and every memory reference.

However, there are type inferencing methods [3] that could be applied more frequently than they seem to be, that would speed up the FP performance of untyped CL code. This actually applies to more than just FP performance [2].

"I think this is the beginning of a beautiful friendship." [6]

CL implementations of FP operations on stock hardware have been traditionally much slower than static type languages. Utilizing type declarations, modern CL implementations can deliver similar FP performance to static typed languages like C and Java. Software developers no longer have to forego the benefits that CL has to offer in order to get good FP performance.

References

1. ANSI, Programming Language Common Lisp, 1994.

2. Baker, Henry G., The Nimble Type Inferencer for Common Lisp-84, unpublished memo, 1990.

3. Fateman, Richard J., Kevin A. Broughan, Diane K. Willcock and Duane Rettig, Fast Floating-Point Processing in Common Lisp, ACM Transactions on Mathematical Software, March 1995.

4. Gabriel, Richard P., Performance and Evaluation of Lisp Systems, The MIT Press 1985.

5. Moon, D. A., Symbolics Architecture, IEEE Computer, January 1987.

6. Warner Bros. Pictures Inc., Casablanca, 1943.
 

APPENDIX A - FFT Gabriel benchmark with type declarations:

;;;
;;; All modifications to the original Gabriel FFT benchmark are in bold
;;;

(defvar re (make-array 8193 :element-type 'single-float ':initial-element 0.0))
(defvar im (make-array 8193 :element-type 'single-float ':initial-element 0.0))

(defmacro f- (one two)
  `(the single-float (- (the single-float ,one) (the single-float ,two))))

(defmacro f+ (one two)
  `(the single-float (+ (the single-float ,one) (the single-float ,two))))

(defmacro f* (one two)
  `(the single-float (* (the single-float ,one) (the single-float ,two))))

(defmacro f/ (one two)
  `(the single-float (/ (the single-float ,one) (the single-float ,two))))

(defmacro i+ (one two)
  `(the fixnum (+ (the fixnum ,one) (the fixnum ,two))))

(defmacro i- (one two)
  `(the fixnum (- (the fixnum ,one) (the fixnum ,two))))

(defmacro i> (one two)
  `(> (the fixnum ,one) (the fixnum ,two)))

(defun fft-with-declarations (areal aimag) ;fast fourier transform
  (declare (optimize (safety 0) (speed 3))
           (type (simple-vector single-float) areal aimag))
  (prog (ar ai i j k m n le le1 ip nv2 nm1 ur ui wr wi tr ti)
    (declare (type (simple-vector single-float) ar ai)
             (fixnum i j k m n le le1 ip nv2 nm1)
             (single-float ur ui wr wi tr ti))
    (setq ar areal    ;initialize
          ai aimag
     n (array-dimension ar 0)
     n (1- n)
     nv2 (floor n 2)
     nm1 (1- n)
     m 0     ;compute m = log(n)
     i 1)
    l1 (cond ((< i n)
         (setq m (1+ m)
          i (+ i i))
         (go l1)))
    (cond ((not (equal n (expt 2 m)))
      (princ "error ... array size not a power of two.")
      (read)
      (return (terpri))))
    (setq j 1     ;interchange elements
     i 1)     ;in bit-reversed order
    l3 (cond ((< i j)
         (setq tr (aref ar j)
          ti (aref ai j))
         (setf (aref ar j) (aref ar i))
         (setf (aref ai j) (aref ai i))
         (setf (aref ar i) tr)
         (setf (aref ai i) ti)))
    (setq k nv2)
    l6 (cond ((< k j)
         (setq j (- j k)
          k (floor k 2))
         (go l6)))
    (setq j (+ j k)
     i (1+ i))
    (cond ((< i n)
      (go l3)))
    (do ((l 1 (i+ 1 l))) ((i> l m))   ;loop thru stages
      (declare (fixnum l))
      (setq le (the fixnum (expt 2 l))
            le1 (the fixnum (floor le 2))
            ur 1.0
            ui 0.0
            wr (the single-float (cos (f/ (single-float pi) (single-float le1))))
            wi (the single-float (sin (f/ (single-float pi) (single-float le1)))))
      (do ((j 1 (i+ 1 j))) ((i> j le1))  ;loop thru butterflies
        (declare (fixnum j))
        (do ((i j (i+ i le))) ((i> i n))  ;do a butterfly
          (declare (fixnum i))
          (setq ip (i+ i le1)
                tr (f- (f* (aref ar ip) ur)
                       (f* (aref ai ip) ui))
                ti (f+ (f* (aref ar ip) ui)
                       (f* (aref ai ip) ur)))
          (setf (aref ar ip) (f- (aref ar i) tr))
          (setf (aref ai ip) (f- (aref ai i) ti))
          (setf (aref ar i) (f+ (aref ar i) tr))
          (setf (aref ai i) (f+ (aref ai i) ti))))
      (setq tr (f- (f* ur wr) (f* ui wi))
            ti (f+ (f* ur wi) (f* ui wr))
            ur tr
            ui ti))
    (return t)))