> In this report, we describe the decoupled-lookback method of single-pass parallel prefix scan and its implementation within the open-source CUB library of GPU parallel primitives
>> As of CUB 1.0.1 (2013), CUB's device-wide scan APIs have implemented our "decoupled look-back" algorithm for performing global prefix scan with only a single pass through the input data, as described in our 2016 technical report [1]
Where [1] is a footnote pointing at the exact paper you just linked.
-----------
> It doesn't use the word "spin" but repeated polling (step 4 in the algorithm presented in section 4.1, particularly when the flag is X) is basically the same.
That certainly sounds spinlock-ish. At least that gives me what to look for in the code.
> In this report, we describe the decoupled-lookback method of single-pass parallel prefix scan and its implementation within the open-source CUB library of GPU parallel primitives
The CUB-library also states:
https://nvlabs.github.io/cub/structcub_1_1_device_scan.html
>> As of CUB 1.0.1 (2013), CUB's device-wide scan APIs have implemented our "decoupled look-back" algorithm for performing global prefix scan with only a single pass through the input data, as described in our 2016 technical report [1]
Where [1] is a footnote pointing at the exact paper you just linked.
-----------
> It doesn't use the word "spin" but repeated polling (step 4 in the algorithm presented in section 4.1, particularly when the flag is X) is basically the same.
That certainly sounds spinlock-ish. At least that gives me what to look for in the code.