The SPEs only had cache! (And, a ton of registers.) What they were missing was RAM ;D
But, even that wasn’t as bad as it was made out to be. People rightly moan about the awkwardness of asynchronously moving data between main RAM and the SPE memory. What they don’t often mention is that the latency of those moves was about 500 cycles —the same latency as a cache miss on the PPE CPUs!
So, which was worse: implicitly waiting 500 cycles all over the place? Or, explicitly scheduling 500 cycle waits at specific points? Unsurprisingly, everyone preferred the first option :P