The meme in the research community is that current LLMs are literally trained on benchmarks and common stuff people test in LM-Arena, like the how many r’s in strawberry question. I’m not talking speculatively: Meta literally got caught red-handed doing this. They ran a separate finetune just to look good on lm-arena. And some benchmarks like MMLU have errors in them that many LLMs *answer ‘correctly’.
It’s not like some single person is collecting all these though.
Yes. Absolutely.
The meme in the research community is that current LLMs are literally trained on benchmarks and common stuff people test in LM-Arena, like the how many r’s in strawberry question. I’m not talking speculatively: Meta literally got caught red-handed doing this. They ran a separate finetune just to look good on lm-arena. And some benchmarks like MMLU have errors in them that many LLMs *answer ‘correctly’.
It’s not like some single person is collecting all these though.