How do large language models work beyond benchmark scores? | EducationPals.ai