| |
|
|
Note the LabVIEW portion of the benchmarks are missing due to the inability to successfully complete the tests. Hopefully I will work around this soon.
This is square and signed square of a real vector. The double precision performance is much worse than twice that of a single precision calculation but, the non-vector "FOR LOOP" calculation is 5 times worse! The vector calculation shows almost no difference for square or signed square. Of course the signed square is much slower than the regular square for the non-vector operations. We can see the effects of cache size on the veclib with the optimum size being 212 or about 32K bytes of input vector. Double precision does not change much with vector length and in fact surpasses single precision in performance at the longest vectors.
These are the dot products of Real, Complex, Inner Complex and Real⋅Complex vectors. The fastest is of course the Real single precision calculation but the double is not far behind. The simple "For Loop" is an order of magnitude slower but that difference is less at the higher problem sizes. Igor is way behind on this and probably due overhead of the package.
This is the addition, subtraction, multiplication and the multiply-add function on single and double precision real numbers. Again the differences get smaller at higher vector lengths.
This is a similar set of graphs of addition, subtraction, multiplication and conjugate multiplication-addition, but using complex vectors.
And the third set of graphs of addition, subtraction, and multiplication, of a real with a complex vector.
It is odd that the convolution for the veclib calculations seems to grow linearly with problem size. Since one factor of the size is all ready divided out it seems as if the number of calculations grows as N2 and not N log N which would be more like an FFT. It should grow at the same rate as the FFT timeing but does not seem to. For many basic calculation the veclib is very good. It is very optimized for dot products and vector-vector operations. It is odd that there is not a vector-scalar operation where a vector needs to be rotated so it would be Complex Vector * Complex Scalar. One could just do the computation by makeing a constant vector and mulitplying but it could be implemented faster as a basic routine since the constant vector would not need to be repeatedly loaded into the Altivec registers saving memory bottlenecks etc.
The Numerical Recipes algorithm is about a factor of 10 slower and Igor is about half that behind in performance. The double precision is about a factor of 3 slower but at the best performance at a vector size of 210, the real time bandwidth would be about 160 Mhz in single precision and 40 Mhz in double precions (800 Mhz CPU assumed). Igor seems to be somewhat erratic in performance and it may just be with the time sampling on the system. I hope to improve that test in the near future.
The functions tested are:
| Single Vector operations: 4 tests | Veclib | Non-Vector | Igor | LabVIEW | ||
| Square | Single/Double | Real | Y | Y | N | |
| Signed Square | Single/Double | Real | Y | Y | N | |
| Vector→Scalar: 8 tests | Veclib | Non-Vector | Igor | LabVIEW | ||
| Dot | Single/Double | Real/Complex | Y | Y | Y | |
| Inner Dot | Single/Double | Complex | Y | Y | Y | |
| Dot | Single/Double | Real * Complex | Y | Y | Y | |
| Vector-Vector: 22 tests | Veclib | Non-Vector | Igor | LabVIEW | ||
| Add | Singe/Double | Real/Complex/Real*Complex | Y | Y | Y | |
| Subtract | Single/Double | Real/Complex/Real*Complex | Y | Y | Y | |
| Multiply | Single/Double | Real/Complex/Real*Complex | Y | Y | Y | |
| Add-Mulitply | Single/Double | Real | Y | Y | Y | |
| Conjugate Multiply-Add | Single/Double | Complex | Y | Y | Y | |
| Convolution: 4 Tests | Veclib | Non-Vector | Igor | LabVIEW | ||
| Convolution | Single/Double | Real/Complex | Y | N | Y | |
| FFT: 12 Tests | Veclib | Non-Vector | Igor | LabVIEW | ||
| Inplace w/translation | Single/Double | Real | Y | Y | Y | |
| Inplace | Single/Double | Real/Complex | Y | N | N | |
| Out of Place w/translation | Single/Double | Real | Y | Y | Y | |
| Out of Place | Single/Double | Real/Complex | Y | N | N | |
| 2D FFT: 8 Tests | Veclib | Non-Vector | Igor | LabVIEW | ||
| In place | Single/Double | Real/Complex | Y | N | Y | |
| Out of Place | Single/Double | Real/Complex | Y | N | N | |
For a total of 58 tests, there are 38 non-vector tests, 42 Igor tests, The "non-vector" calculations are simple "FOR LOOPS" and code from Numerical Recipes. They are not attempted to be optimized at all and represent what would be a low effort calculation. No comments about the reliability, accuaracy or general coding style of Numerical Recipes please. The calculations were repeated numerous times and the total time for the repetions was taken. The times were normalized by the vector length and a single CPU speed. No adjustment for mulit-cpus is done. Therefore multi-cpu enhancements should show up in the graphs. CPU Cycles/Problem Size = (Total Time * CPU Speed)/(Repetitions * Vector Length) For the veclib tests these were the following (all numbers given as Log2)
| Size | C Code, vector and non-vector | Igor | LabVIEW | |||
|---|---|---|---|---|---|---|
| Size | Repetitions | Times (96 tests) | Repetitions | Times (42 tests) | Repetitions | Times |
| 6 | 21 | 6:24:14 | 19 | 0:13:21 | ||
| 7 | 20 | 3:15:12 | 18 | 0:09:30 | ||
| 8 | 19 | 1:41:10 | 17 | 0:07:19 | ||
| 9 | 18 | 0:54:45 | 16 | 0:06:47 | ||
| 10 | 19 | 2:10:40 | 17 | 0:06:13 | ||
| 11 | 18 | 1:48:17 | 16 | 0:05:56 | ||
| 12 | 17 | 2:07:55 | 15 | 0:05:51 | ||
| 13 | 16 | 3:13:21 | 14 | 0:05:56 | ||
| 14 | 15 | 5:51:55 | 13 | 0:06:04 | ||
| 15 | 14 | 11:44:33 | 12 | 0:06:19 | ||
| 16 | 8 | 0:48:42 | 11 | 0:06:44 | ||
| 17 | 7 | 1:45:53 | 10 | 0:07:27 | ||
| 18 | 6 | 4:28:44 | 9 | 0:08:48 | ||
| 19 | 5 | 12:12:30 | 8 | 0:10:07 | ||
In summary the veclib operations are good, and easy to implement. Double precision is slower than single precision but it is still much better than non-vector solutions. Packages such as Igor and LabVIEW have extra overhead that shows up as a speed hit. It is interesting that the performance for Igor does not get better at longer length problems since the overhead should be less per unit length. The Igor and LabVIEW packages tend to do all calculations in double precision for the obvious reason it makes things simpler but they can't take advantage of the extra speed when single precision is sufficient.
All calculations were done on a Dual 800 Mhz, Quicksilver Macintosh in either single user mode or with a many background processes killed as possible. The gcc version 3.1 compiler with maximum optimizations for speed was used. For those who want the gory details to make your own comparisons, the raw data files are: veclib, Igor, and LabVIEW. All numbers are (time * CPU Speed)/(problem size * iterations), lower is better. For copies of the implementation of the benchmarks, veclib Project Builder Files, Igor Experiment (inluding these graphs), and LabVIEW code. Please send me any comments or corrections.
Return to NHMFL Home Page
Return to NHMFL Operations Home Page
| |
|