Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Are there any technical innovations here over Moshi, which invented some of the pieces they use for their model? The only comparison I see is they split the temporal and depthwise transformers on the zeroth RVQ codebook, whereas Moshi has a special zeroth level vector quantizer distilled from a larger audio model, with the intent to preserve semantic information.

EDIT: also Moshi started with a pretrained traditional text LLM



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: