Creating a LLM-as-a-Judge
Anthropic released details on Claude 3.5 SWEBench+SWEAgent, while OpenAI introduced SimpleQA and DeepMind launched NotebookLM. Apple announced new M4 Macbooks, and a new SOTA image model, Recraft v3, emerged. Hamel Husain presented a detailed 6,000-word treatise on creating LLM judges using a method called critique shadowing to align LLMs with domain experts, addressing the problem of untrusted and unused data in AI teams. The workflow involves expert-reviewed datasets and iterative prompt refinement. Additionally, Zep introduced a temporal knowledge graph memory layer to improve AI agent memory and reduce hallucinations. Anthropic also integrated Claude 3.5 Sonnet with GitHub Copilot, expanding access to Copilot Chat users.