Are there any technical innovations here over Moshi, which invented some of the ...

Are there any technical innovations here over Moshi, which invented some of the pieces they use for their model? The only comparison I see is they split the temporal and depthwise transformers on the zeroth RVQ codebook, whereas Moshi has a special zeroth level vector quantizer distilled from a larger audio model, with the intent to preserve semantic information.

EDIT: also Moshi started with a pretrained traditional text LLM