3rd paragraph: > In this report, we describe the decoupled-lookback method of si...

3rd paragraph:

> In this report, we describe the decoupled-lookback method of single-pass parallel prefix scan and its implementation within the open-source CUB library of GPU parallel primitives

The CUB-library also states:

https://nvlabs.github.io/cub/structcub_1_1_device_scan.html

>> As of CUB 1.0.1 (2013), CUB's device-wide scan APIs have implemented our "decoupled look-back" algorithm for performing global prefix scan with only a single pass through the input data, as described in our 2016 technical report [1]

Where [1] is a footnote pointing at the exact paper you just linked.

-----------

> It doesn't use the word "spin" but repeated polling (step 4 in the algorithm presented in section 4.1, particularly when the flag is X) is basically the same.

That certainly sounds spinlock-ish. At least that gives me what to look for in the code.