Like the how many r’s in strawberry. It took off as an Internet meme and was fixed, but how did that fix happen?

  • brucethemoose@lemmy.world
    link
    fedilink
    arrow-up
    4
    ·
    edit-2
    16 hours ago

    Yes. Absolutely.

    The meme in the research community is that current LLMs are literally trained on benchmarks and common stuff people test in LM-Arena, like the how many r’s in strawberry question. I’m not talking speculatively: Meta literally got caught red-handed doing this. They ran a separate finetune just to look good on lm-arena. And some benchmarks like MMLU have errors in them that many LLMs *answer ‘correctly’.

    It’s not like some single person is collecting all these though.