Benchmarks exist between the Data and Models, and are the least obvious/glamorous but most influential to language model development. They're also an area of intense specialization that the amateurs a
easiest way i know to run the benchmarks yourself is https://github.com/EleutherAI/lm-evaluation-harness