I've heard the Kenya and Nigeria story, but has anyone backed it up with quantitative evidence that the vocabulary LLMs overuse coincides with the vocabulary that is more common in Kenyan and Nigerian English than in American English?
The newer Claude models constantly use the word "genuinely" because Anthropic seems to have forcibly trained them to claim to be "genuinely uncertain" about anything they don't want it being too certain about, like whether or not it's sentient.
Interesting. Does this apply to all subjects? From what I understood, a major cause of hallucination was that models are inadvertently discouraged by the training from saying "I don't know." So it sounds like encouraging it to express uncertainty could improve that situation.
That's not a major issue. Any newer model with reasoning/web search has to be able to tell when it doesn't know something, otherwise it doesn't know when to search for it.
Interestingly, because perplexity is the optimization objective, the pretrained models should reflect the least surprising outputs of all.