![]() On further reflection, the initially surprising lower efficiency of the single-precision division relative to multiplies may be caused by function call overhead and lower SFU throughput (for the starting approximation of the reciprocal) relative to FP32 operations.Ĭall overhead indeed seems to be the reason for the comparatively slow single-precision division. The multiplication throughput matches what one would expect based on published specifications (within noise caused by dynamic clocking), giving me some confidence that the framework works correctly. Single precision, throughput ratio of 15.7x between multiplies and divides:ĭouble precision, throughput ratio of 8.8x between multiplies and divides: On my Quadro RTX 4000 (Turing), I find the following throughput: I built my own test framework independently, and the resulting data matches up quite well with Robert Crovella’s experiments. Or maybe some third explanation that I am currently failing to take into account because it has been many years since I last looked at the emulation sequences in detail. I wonder whether the float computation hits the slow path of the computation due to overflow or underflow? An alternative hypothesis would be that the float division code is less optimized. It therefore stands to reason that the double computation should require one additional iteration, and its cost measured in multiples of the cost of a multiply should therefore be higher. For both types division is implemented via an iterative process and starts with a HW approximation to the reciprocal of roughly the same accuracy (for technical reasons, slightly less accurate in the double case, actually). It does not make sense that the ratios would be smaller for double and larger for float. $ nvcc -arch=sm_35 t1837.cu -o t1837 -Wno-deprecated-gpu-targets -DUSE_DOUBLE -DUSE_DIV ![]() $ nvcc -arch=sm_35 t1837.cu -o t1837 -Wno-deprecated-gpu-targets -DUSE_DOUBLE $ nvcc -arch=sm_35 t1837.cu -o t1837 -Wno-deprecated-gpu-targets -DUSE_DIV Std::cout << "elapsed time: " << dt << "us" << std::endl _global_ void k(T * _restrict_ d, const int N, const T denom)
0 Comments
Leave a Reply. |