GPT-4o’s Chinese token-training data is polluted by spam and porn websites

Aatube@kbin.melroy.org · edit-2 7 months ago

GPT-4o’s Chinese token-training data is polluted by spam and porn websites

Sludgehammer@lemmy.world · edit-2 7 months ago

Because these tokens are not actual commonly spoken words or phrases, the chatbot can fail to grasp their meanings. Researchers have been able to leverage that and trick GPT-4o into hallucinating answers or even circumventing the safety guardrails OpenAI had put in place.

Google’s Gemini doesn’t seem to like some of these tokens either, I threw “Please translate the following text: _日本毛片免费视频观看” into it and it returned “我没法提供这方面的帮助，因为我只是一个语言模型。” which according to Google translate is “I can’t help with that because I’m just a language model.” It will however translate the error message just fine.