The pursuit of developing AI agents capable of human-like thinking, planning, and decision-making is a thriving area of research. Large Language Models (LLMs) stand as the foundational technology in this endeavor.

As we delve deeper into advancing their capabilities, several recurring questions emerge:

  • Does the model possess sufficient knowledge to perform tasks accurately and efficiently?
  • How do we effectively activate this knowledge when needed?
  • Can the model emulate complex cognitive behaviors such as reasoning, planning, and decision-making to a satisfactory degree?

This article delves into these questions through an exploration of a recent mini-experiment conducted using the MMLU-Pro benchmark. The findings provide valuable insights into cognitive flexibility and its implications for AI agent development and prompt engineering strategies.

Background: Introducing MMLU-Pro

MMLU-Pro, the latest iteration of the Massive Multitask Language Understanding benchmark, challenges AI models with a diverse and rigorous set of tasks across 14 knowledge domains. Unlike its predecessor, MMLU, MMLU-Pro focuses heavily on reasoning-based multiple-choice questions that span various disciplines. These tasks demand not only broad knowledge but also the ability to apply it contextually—a hallmark of cognitive flexibility in human cognition.

Sample Question from MMLU-Pro

Consider this example from the “business” category:

Given annual earnings per share with a mean of $8.6 and a standard deviation of $3.4, what is the probability of observing an EPS less than $5.5?


  • A: 0.3571
  • B: 0.0625
  • C: 0.2345
  • D: 0.5000
  • E: 0.4112
  • F: 0.1814
  • G: 0.3035
  • H: 0.0923
  • I: 0.2756
  • J: 0.1587

Although categorized under ‘business’, this question requires statistical knowledge, specifically calculating the Z-score:

Z=X−μσZ = \frac{X – \mu}{\sigma}Z=σX−μ​

Substituting the values gives us Z=5.5−8.63.4=−0.9118Z = \frac{5.5 – 8.6}{3.4} = -0.9118Z=3.45.5−8.6​=−0.9118. Consulting the standard normal distribution table, the probability of Z being less than -0.9118 is approximately 18.14%, corresponding to answer “F”.

Leveraging Prompt Engineering

Addressing such problems with LLMs necessitates considering:

  • Does the model possess the requisite statistical knowledge?
  • How do we effectively activate this knowledge?
  • Can the model replicate the logical steps required to arrive at the correct answer?

Prompt engineering strategies like “Chain-of-Thought” (CoT) become pertinent. CoT involves guiding the model through reasoning steps either with or without prior examples, aiming to simulate human-like reasoning.

Mini-Experiment Methodology

To explore these dynamics, a mini-experiment using ChatGPT-4 involved sampling 10 questions from each MMLU-Pro knowledge domain. The experiment aimed to assess:

  • The efficacy of various prompt engineering techniques.
  • The impact of constraining reasoning and cognitive flexibility on model accuracy.

Techniques tested included:

  • Direct Question: {Question}. Select the correct answer from {Answers}. Respond with the chosen letter.
  • Chain-of-Thought (CoT): {Question}. Let’s think step by step and select the correct answer from {Answers}. Respond with the chosen letter.
  • Knowledge Domain Activation: {Question}. Consider the necessary knowledge and concepts and select the correct answer from {Answers}. Respond with the chosen letter.
  • Contextual Scaffolds: {Question}. My expectation is that you will answer correctly. Set up a context to maximize fulfillment and select the correct answer from {Answers}. Respond with the chosen letter.

Experimental Findings

Results indicated that unconstrained prompts, particularly the Direct Question approach, generally performed similarly. However, when reasoning was constrained, all techniques exhibited a comparable decline in accuracy—from an average of 66% to 51%.

This underscores the effectiveness of simpler prompt strategies and the importance of openly allowing models to exhibit cognitive flexibility and reasoning.

Considering Compute Efficiency

Token efficiency emerged as another crucial factor, especially as LLMs are integrated into diverse applications. Comparing the accuracy and token generation rates of different prompt strategies revealed that the Direct Question approach, which generated an average of 180 tokens per answer, was notably more efficient than CoT, which averaged 339 tokens per answer.

Conclusion: Fostering Cognitive Flexibility in AI Agents

The insights gleaned from this mini-experiment underscore the critical role of cognitive flexibility in LLMs and AI agent development. Cognitive flexibility, akin to human adaptability in switching between concepts and adjusting responses to varying demands, is pivotal for enhancing model proficiency across complex tasks.

Continued advancements in prompt engineering and model training could unlock greater capabilities in AI agents, enabling them to navigate and adapt to real-world complexities effectively.

As AI technology continues to evolve, fostering cognitive flexibility will likely prove instrumental in developing models that not only perform reliably but also possess a deeper understanding of and adaptability to diverse contexts and challenges.