Are there any technical innovations here over Moshi, which invented some of the pieces they use for their model? The only comparison I see is they split the temporal and depthwise transformers on the zeroth RVQ codebook, whereas Moshi has a special zeroth level vector quantizer distilled from a larger audio model, with the intent to preserve semantic information.
EDIT: also Moshi started with a pretrained traditional text LLM
EDIT: also Moshi started with a pretrained traditional text LLM