It actually works on top of an image captioning model, SD takes in keywords as well like "artstation" and "octane render" which are not covered in standard captioning so that is why the difference between using an off-the-shelf captioning model vs this