What do you mean "so slow"? It's by far the fastest framework covered by the paper in scenarios where threads don't outnumber CPU cores.
Taken from the article itself:
"However, Torch still achieves the best performance in our experiments in which Torch has nearly 12x speed up compared with TensorFlow under 4-thread setting."
Why can't Torch utilize more threads in CPU cores?
Taken from the article itself:
"both of them cannot run normally when threads usage is set to be bigger than the number of CPU cores on desktop CPU."
Do the authors set up the system correctly?
You're right that Torch is faster than TensorFlow in RNN. But Torch is slower than TesnorFlow in AlexNet and ResNet.
There is a set of benchmarks for many DL approaches as found in https://github.com/soumith/convnet-benchmarks
Context-switching is expensive. You have to swap out the data being worked on by thread #1 and swap in the data for thread #2. So you end up being bottlenecked by memory bandwidth and latency rather than by raw compute.