Back to Blog

Anthropic launches the MCP Apps open spec, in Claude.ai

Rich generative UI is all you need.

AI News for 1/23/2026-1/26/2026. We checked 12 subreddits, 544 Twitters and 24 Discords (206 channels, and 14285 messages) for you. Estimated reading time saved (at 200wpm): 1208 minutes. Our new website is now up with full metadata search and beautiful vibe coded presentation of all past issues. See https://news.smol.ai/ for the full news breakdowns and give us feedback on @smol_ai!

3 months after OpenAI floated a trial balloon with ChatGPT Apps and the Apps SDK at Dev Day 2025, Anthropic has now officially absorbed the independent MCP UI project and, working with OpenAI, Block, VS Code, Antigravity, JetBrains, AWS, and others, has released both:

It's fair to say that ChatGPT Apps haven't exactly taken the world by storm since announcement, but the overall need for a standard format for applications to return rich UI still cannot be denied. Now that MCP Apps have been ratified by all the important players, this is the basis for a rich ecosystem of open source support and applications being able to interoperate, and perhaps one day solve the perpetual never ending pile of $20/month subscriptions piling up in your credit card bills.


AI Twitter Recap

Agent Orchestration, RLMs, and “Clawdbot/Clawd” as a UX pattern

  • NVIDIA ToolOrchestra + Orchestrator-8B: NVIDIA’s ToolOrchestra frames agentic systems as a small “conductor” model that alternates reasoning with calls to tools and larger “expert” models (search, code execution, specialist LLMs, frontier generalists). The claim is that an 8B orchestrator can reach frontier-level outcomes via delegation at materially lower cost, trained end-to-end with scalable RL using automatically synthesized tool-use environments and multi-turn tasks (summary, link). Closest technical implication: “controller scale” matters less than policy quality + tool/model routing if you can train it with realistic tool-call rollouts.
  • RLMs / recursion-first agent stacks: Several posts converge on a Recursive Language Model (RLM) pattern: pass files and context by reference and iteratively pull the minimum slices needed (shell/grep/AST), rather than stuffing everything into context à la ReAct. Dan B illustrates this with file references vs @file expansion as deliberate context management (thread). Daytona is positioning RLMs as “unlimited recursion depth” via per-(sub)agent sandboxes (guide, integration).
  • “Clawd/Clawdbot” meme → product signal: The dataset contains a large “Clawdbot” wave (often with Mac mini jokes), but the technically relevant throughline is outcome-first assistant UX + tight context/tool integration. Kimmonismus explicitly calls this a shift from “more chat” to “more outcome,” suggesting incumbents will scramble to match it (tweet). Others push a cloud-first counterpoint (no local Mac mini) (MiniMax reply). There’s also an emerging security backlash as soon as “powerful mode” exists: prompt injection remains a system-level blocker for browser/desktop agents (dilemma, follow-up, Miessler warnings).

Reasoning model releases & eval dynamics (Qwen, Tencent, ARC, etc.)

  • Alibaba Qwen3-Max-Thinking: Alibaba positions Qwen3-Max-Thinking as a flagship reasoning+agent model trained with “massive scale and advanced RL,” emphasizing adaptive tool-use (Search/Memory/Code Interpreter) and test-time scaling/self-reflection. They cite strong math and agentic search metrics (e.g., 98.0 on HMMT Feb, 49.8 on HLE) (launch). The model is immediately pushed into public eval channels: LM Arena Text Arena (Arena) and Yupp (Yupp). Community reaction highlights the tool-enabled evaluation regime—claims of outperforming multiple SOTA models on HLE with search tools (commentary).
  • Tencent HunyuanImage 3.0-Instruct (image editing): Tencent releases an image-editing-focused multimodal model built on an 80B MoE (13B active), using a “Thinking” schema with native CoT and their MixGRPO algorithm; focus is on precise edits that preserve non-target regions and multi-image fusion (announcement). LM Arena reports it entering the top-10 image edit leaderboard (rank #7) (Arena).
  • ARC-AGI cost/perf hacks: A notable optimization claim: “Recursive Self-Aggregation (RSA) + Gemini 3 Flash” reaching 59.31% on ARC-AGI-2 at ~1/10 cost vs Gemini Deep Think (tweet). This points to a broader theme: meta-inference strategies (aggregation, recursion, pruning) are becoming as important as base model choice.
  • Open models in arenas: Molmo 2 (Apache 2.0) appears in Arena as a new open model entrant (Arena). Separately, Hugging Face Inference Endpoint notes GLM-4.7-Flash via llama.cpp with a low hourly price point (Q4_K_M, 24k context) (ngxson)—underscoring a continued commoditization of fast open-weight inference.

RL everywhere: test-time training, GRPO stabilization, RL-as-pretraining, and compute savings

  • Test-Time Training (TTT) + RL breakthroughs: A widely shared result claims a Stanford/NVIDIA-style TTT+RL approach that: beats AlphaEvolve, finds a new upper bound for an Erdős overlap problem, produces A100 kernels ~2× faster than best human kernels, and beats both best AI+human attempts on AtCoder (rronak_). This cluster also includes meta-discussion about correctly crediting related approaches (EvoTune) (Yejin Cho).
  • GRPO training stability knobs: A small but actionable engineering tip: INTELLECT-2 reports a delta=4.0 parameter that improves GRPO stability (QGallouedec).
  • RL in pretraining (RLP): NVIDIA authors announce RLP (Reinforcement as a Pretraining Objective) accepted to ICLR 2026, framing RL not as “post-training only” but as integrated into pretraining (ahatamiz1).
  • Compute reduction via curriculum-like filtering: AI21’s “Dynamic Data Snoozing” claims up to 3× compute reduction for RLVR by snoozing examples that are too easy (DanielGissin). If validated, this is a practical recipe: make the sampler policy-aware instead of static.

Inference infrastructure & dev tooling: vLLM’s “day-0 model support,” VS Code MCP Apps, Cursor subagents

  • vLLM’s governance and commercialization pressure: A long Zhihu-derived summary argues vLLM’s “open-source project → startup” shift was driven by the hidden cost of day-0 support (weeks/months of confidential pre-integration per new model), the rise of MoE and heterogeneous inference (fp8/int4/sparse attention), and the mismatch with PyTorch Foundation style testing vs vLLM’s multi-node CI needs. It claims the maintainers founded Inferact Inc to fund full-time maintainers while keeping vLLM open-source (thread). Related: vLLM shares a practical flag for avoiding OOM on long-context models: --max-model-len auto (vLLM tip).
  • MCP Apps: tool calls return interactive UI: The MCP ecosystem announces MCP Apps as the first official MCP extension: tool calls can return interactive UI components rendered in-chat. VS Code is first major editor shipping support (Insiders now, stable soon) (VS Code, alexalbert__). Anthropic simultaneously ships “interactive work tools in Claude” (Slack drafting, Figma diagrams, Asana timelines) (Claude). Net: we’re seeing the “tool interface layer” move from raw JSON to native UI primitives inside agent loops.
  • Cursor: multi-browser subagents: Cursor adds multi-browser support via subagents (Cursor), echoing the same direction: parallelized tool execution + better context isolation.

Kernel LLMs, chip stacks, and “AI for hardware” loops

  • GPU MODE 2026: post-training Kernel LLMs in public: GPU MODE outlines a 2026 plan to post-train a Kernel LLM and get generated kernels merged into real repos (PyTorch/vLLM), emphasizing “de-slopify kernels” (determinism, reviewer-mergeable PRs), profiler-guided optimization + memory work, and competitions as evals (marksaroufim).
  • Microsoft Maia 200: Microsoft announces Maia 200 as a custom inference accelerator; Mustafa Suleyman claims it’s the most performant first-party hyperscaler silicon, with 3× FP4 performance vs Trainium v3 and FP8 above TPU v7 (Mustafa, follow-up). Yusuf Mehdi frames this as infra that makes AI “dependable” (thread).
  • Ricursive Intelligence (AI for chip design): Ricursive raises a $300M Series A aiming at end-to-end chip design as a recursive self-improvement loop between AI and hardware (company, Anna Goldie).

Safety, misuse, and societal impact (selected items with direct technical relevance)

  • Elicitation attacks via benign chemistry data: Anthropic reports that fine-tuning open models on “benign” chemical synthesis content generated by frontier models can significantly increase capability on chemical weapons tasks—an “elicitation attack” that scales with frontier model strength (AnthropicAI, paper link).
  • Dario Amodei’s “Adolescence of Technology” essay: A major, highly engaged post argues AI is entering an accelerating feedback loop (AI building AI), with risks spanning misuse, power-seeking autonomy, and economic disruption; it also explicitly frames wealth concentration as a society-breaking failure mode (Dario). Reaction ranges from strong endorsement to critique of how “takeover risk” framing is presented (Ryan Greenblatt).
  • Agent security in practice: Multiple posts treat desktop/browser agents as inherently high-risk until prompt injection and sandboxing mature, reinforcing the need for strict isolation, least privilege, and careful handling of credentials (Miessler).

Top tweets (by engagement)


AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Local LLM Hardware and Benchmarking

  • 216GB VRAM on the bench. Time to see which combination is best for Local LLM (Activity: 366): The post discusses the use of secondhand Tesla GPUs, which offer substantial VRAM at a lower cost, for local large language model (LLM) testing. The author has developed a GPU server benchmarking suite to evaluate the performance of these GPUs when used in parallel. The image shows a technical setup with multiple NVIDIA GPUs, highlighting the focus on maximizing VRAM capacity. The discussion centers around the feasibility and efficiency of using these older GPUs compared to modern devices, particularly in terms of bandwidth and cooling challenges. Commenters express skepticism about the performance of these GPUs, noting potential issues with bandwidth and cooling. One commenter shares personal experience, comparing different GPU models and highlighting the challenges of using older hardware.

    • HugoCortell raises a technical concern about the potential bandwidth limitations when connecting multiple GPUs to a single PC, noting that most affordable server motherboards support only a few GPUs. This could impact the performance of local LLMs if not addressed properly.
    • dc740 shares insights from personal experience with different GPUs, highlighting that the P40 outperforms the M10 despite both being older models. However, they prefer using AMD Instinct Mi50 GPUs due to their performance, even though support for these was recently dropped from ROCm, indicating a trade-off between hardware capability and software support.
    • FullOf_Bad_Ideas critiques the gpu_box_benchmark for not testing scenarios where large models are split across multiple GPUs, which is a primary use case for setups with extensive VRAM. This points to a gap in current benchmarking practices that may not fully reflect real-world applications of multi-GPU systems.
  • I just won an Nvidia DGX Spark GB10 at an Nvidia hackathon. What do I do with it? (Activity: 724): The image shows a terminal window on a Linux system running the 'top' command, which is used to monitor system processes and resource usage in real-time. The user has won an Nvidia DGX Spark GB10, a high-performance computing device designed for machine learning and data-intensive tasks. The terminal indicates a Python process consuming significant CPU resources, suggesting active computational tasks, possibly related to machine learning or data processing. The user is considering using the device to run multiple NextJS applications simultaneously, leveraging its powerful capabilities. One commenter suggests running three NextJS applications simultaneously, indicating the device's capability to handle multiple high-memory tasks. Another commenter provides a link to Nvidia's DGX Spark playbooks, which could be useful for the user to explore the full potential of their new hardware.

    • Fit-Produce420 highlights the capabilities of the Nvidia DGX Spark GB10, noting that with 128GB of memory, it can fine-tune models up to 70 billion parameters. Additionally, it can handle larger models like the 120 billion parameter gtp-oss-120b using techniques like QLoRA, which optimizes memory usage for large-scale models. However, running dense models like devstral 2 may be slow due to their computational demands.
    • randomfoo2 suggests utilizing the NVIDIA DGX Spark playbooks as a resource for getting started with the DGX Spark GB10. These playbooks provide structured guidance and best practices for deploying and managing workloads on the DGX platform, which can be particularly useful for users new to this hardware.
    • LicensedTerrapin humorously suggests selling the DGX Spark GB10 to purchase 8GB of DDR5 RAM, implying a trade-off between high-end specialized hardware and more general-purpose upgrades. This comment reflects a common debate in tech communities about the value of specialized versus general-purpose hardware investments.
  • Using a high-end MacBook Pro or a beefy RTX 5090 laptop (with 24 GB of RAM) for inference. (Activity: 29): The post discusses the feasibility of using a high-end MacBook Pro with Apple Silicon (M-series Max) versus a Windows/Linux laptop with an RTX 5090 GPU for running large local LLMs (70B+ parameters) for inference and fine-tuning. The MacBook Pro offers 128–192 GB of unified memory, while the RTX 5090 laptop provides 24 GB of VRAM and at least 64 GB of system RAM. The primary use case is local LLM inference with a target of ≥15 tokens/sec, emphasizing portability. The post queries whether the larger unified memory of Apple Silicon outweighs the CUDA performance of the RTX laptop for inference, and how Apple MLX compares to CUDA for fine-tuning tasks like LoRA/QLoRA. It also seeks insights on thermal performance and sustained inference capabilities of both setups. One commenter suggests using the laptop as a terminal to a more powerful desktop, indicating a preference for leveraging remote resources over local hardware. Another commenter is experimenting with both setups, using a MacBook Pro M2 Max for inference, and is curious about the performance differences.

    • racerx509 shares their experience using a Lenovo laptop with a 3070ti, a custom desktop with a 5070, and a MacBook Pro M2 Max with 96GB RAM for inference tasks. They note that they have been primarily using the MacBook Pro for inference, suggesting it may offer better performance or convenience for their needs.
    • No-Concern-8832 raises a concern about the VRAM limitations of RTX laptops, suggesting that they may not be sufficient for running large models like 70B parameters. This highlights a potential limitation in using high-end RTX laptops for certain deep learning tasks that require substantial VRAM.
    • Tired__Dev discusses their experience with an Asus M16 equipped with a 4090 GPU, noting that it struggled with a 7B parameter model. They express a preference for a MacBook Pro with 128GB RAM, citing its high memory bandwidth and potential performance advantages over even high-end GPU setups like the DGX Spark.

2. Multi-Agent Systems and AI Assistants

  • I built a "hive mind" for Claude Code - 7 agents sharing memory and talking to each other (Activity: 313): The post describes a multi-agent orchestration system for Claude Code, featuring seven specialized agents (e.g., coder, tester, reviewer) that coordinate tasks, share persistent memory using SQLite + FTS5, and communicate via a message bus. The system runs as an MCP server and integrates with Anthropic, OpenAI, or Ollama. It uses a task queue for priority-based coordination, allowing agents to pass context and collaborate effectively. The implementation stack includes TypeScript, better-sqlite3, MCP SDK, and Zod. The project is experimental, open-source under the MIT license, and available on GitHub. A comment questions the system's uniqueness compared to the BMAD method, suggesting similarities. Another comment humorously questions whether the agents agree with each other, hinting at potential coordination challenges.

    • The user robiinn inquires about the differences between the 'hive mind' system and the bmad method, suggesting a potential similarity. This indicates a need for clarification on the unique aspects or improvements of the 'hive mind' approach over existing methods, such as how memory sharing and inter-agent communication are implemented differently.
    • No_Afternoon_4260 raises a critical point about the consensus among the agents in the 'hive mind'. This touches on the technical challenge of ensuring that multiple agents can not only share memory but also reach agreement or consensus, which is a significant aspect of distributed systems and multi-agent frameworks.
    • JellyBean504 draws a parallel between the 'hive mind' and Steve Yegge's Gastown, suggesting that there might be conceptual similarities. This comparison could be valuable for understanding the architectural or functional parallels between the two systems, potentially offering insights into design choices or performance characteristics.
  • Clawdbot: the AI assistant that actually messages you first (Activity: 214): Clawdbot is an open-source AI assistant with over 9K GitHub stars, designed to proactively message users, unlike traditional AI assistants that wait for prompts. It integrates with locally hosted LLMs via Ollama and supports messaging apps like WhatsApp, Telegram, and Discord. Key features include sending automated briefings and reminders, local storage of conversations as Markdown files, and the ability to control browsers and run scripts. The software is free under the MIT license but requires terminal proficiency for setup, as there is no GUI installer. Read more. Users report challenges with setup, particularly with obtaining and using OAuth keys for authentication, and difficulties in connecting local LLMs without relying on API keys. Some users express frustration with the complexity of setup, especially when using remote machines.

    • mike7seven highlights the complexity of setting up Clawdbot, particularly emphasizing the need to obtain a Claude OAuth key on a separate machine and then transfer it to the setup machine. This process is noted as cumbersome, especially for those using remote machines, and the MacOS app requires building from source, adding another layer of complexity.
    • Ashamed_Promise7726 raises a technical challenge regarding the integration of local language models with Clawdbot. The user notes difficulty in connecting pre-downloaded models on their PC, as Clawdbot seems to require an API key for usage-based models, questioning the feasibility of running Clawdbot entirely locally without external dependencies.
    • inigid warns about potential security risks associated with Clawdbot, suggesting it could be exploited for supply-chain attacks that compromise sensitive data on a user's machine and network. The comment also mentions concerns about the association with Solana meme coins, implying a need for caution when using the tool.

3. GLM-4.7-Flash Performance Updates

  • GLM-4.7-Flash is even faster now (Activity: 443): The recent update to llama.cpp by Johannes Gaessler optimizes the CUDA implementation of FlashAttention, specifically for models with a non-power-of-2 ratio of query heads to key/value heads. This is achieved by padding Q columns to the next power of 2, which, although slightly inefficient, enhances performance for small batch sizes. The update is detailed in pull request #19092. One comment humorously notes the obsolescence of a previous post due to this update, while another laments the lack of support for AMD GPUs, highlighting a common issue in the community regarding hardware compatibility.

    • The user 'jacek2023' provides detailed performance metrics for the GLM-4.7-Flash model, highlighting its efficiency. The model processes a prompt with 45074 tokens, achieving a prompt evaluation time of 2814.63 ms for 1612 tokens, which translates to 1.75 ms per token or 572.72 tokens per second. The overall evaluation time is 29352.57 ms for 1731 tokens, equating to 16.96 ms per token or 58.97 tokens per second. The total processing time is 32167.20 ms for 3343 tokens, indicating significant improvements in speed.
  • KV cache fix for GLM 4.7 Flash (Activity: 380): The recent update to GLM 4.7 Flash involves removing the V component from the KV cache, which significantly reduces VRAM usage, allowing for longer context lengths on the same hardware setup. This change is particularly beneficial for models like DeepSeek and GLM 4.7 Flash, as it can save gigabytes of VRAM, enabling context lengths to double, as demonstrated by a user running a 90,000 context on a 4090 GPU. The update is part of a pull request in the llama.cpp repository, which introduces a V-less KV cache, reducing memory usage by nearly 50%. More details can be found in the pull request. A user noted that the model, while improved, still requires some manual guidance, especially in tasks like coding and creative writing, where it may not perform as well as specialized models. However, it excels in tool use and as an assistant, making it a preferred choice for home-server applications.

    • The user 'teachersecret' reports significant improvements in context handling with the UD's k_xl 4-bit version of the GLM 4.7 model on an RTX 4090. Previously, the model maxed out at 45,000 context tokens, but now it can handle 90,000. Despite these improvements, the model still requires some manual guidance, especially in coding tasks, and is less effective in creative writing compared to other models. However, it excels in tool usage and is now the user's default model for their home server.
    • User 'viperx7' provides detailed benchmark data comparing the performance of the GLM 4.7 model before and after a specific change. The benchmarks show improvements in both prompt processing and token generation speeds across different configurations. For instance, using a single RTX 4090, the context size increased from 64k to 128k, with prompt processing speed improving from 3489 t/s to 3510 t/s and token generation from 88 t/s to 92.5 t/s. The maximum context size achievable with a 4090 and 3060 setup is 200k, leaving about 6GB of VRAM unused.
    • The discussion highlights the technical aspect of the GLM 4.7 model's KV cache fix, which allows for increased context sizes and improved performance metrics. The benchmarks provided by 'viperx7' indicate that the model can now handle up to 207k context size in certain configurations, with significant improvements in processing speeds. This suggests that the model's efficiency has been enhanced, making it more suitable for high-demand applications.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. Claude AI Usage and Issues

  • Why You Need To Constantly Clear Claude Codes Context Window (Activity: 166): The post highlights the necessity of regularly clearing the context window when using coding agents like Claude to maintain optimal performance. It notes that performance degrades significantly when the context window exceeds 40% of its capacity due to the quadratic nature of LLM attention, which increases computational demands and introduces noise. The recommended practice is to avoid accumulating context and instead persist it by using a 'one session per task' strategy, ensuring each task starts with a fresh context. More details can be found in the original article. Commenters suggest practical strategies such as using handover prompts to transfer necessary details between sessions, employing the '/clear' command to compact context, and utilizing 'Plan Mode' to clear context and execute tasks efficiently. These methods reportedly help avoid the need for a full context window, even for large tasks.

    • Agrippanux suggests using 'Plan Mode' as the default setting for Claude, which allows users to clear the context and execute plans without needing a full context window. This approach has been effective for large tasks, such as refactoring, without requiring the entire context to be loaded, thus optimizing performance and resource usage.
    • thurn2 discusses the use of sub-agents in Claude, which involves delegating tasks like creating a git worktree and fixing specific issues. This method allows for parallel execution of tasks and helps in managing complex projects by breaking them down into smaller, manageable tasks, enhancing efficiency and implementation accuracy.
    • Fancy_Excitement6050 notes that as the context window grows, Claude tends to take shortcuts, which can lead to a need for constant reminders to maintain thoroughness. This suggests that managing the context window size is crucial for maintaining the quality of output, and there might be differences in performance between different Claude plans, such as Claude Max.
  • Opus fell off? Here’s the workflow that kept my code quality stable (Activity: 133): The post discusses a structured workflow to maintain code quality when using AI models like Opus and Sonnet, which have been perceived as producing "confident wrong" outputs and drifting edits. The workflow emphasizes a loop of specification, ticket creation, execution, and verification. Specifications are detailed with non-goals, user stories, acceptance criteria, edge cases, and more, treated as code to ensure clarity. Tickets are derived from specs, focusing on small, independently mergeable tasks with clear acceptance checks. Execution involves implementing one ticket at a time with constraints to prevent scope drift, and verification involves running tests and confirming acceptance criteria before feeding failures back into the model for correction. This approach aims to maintain discipline and reduce reliance on the model's "done" signal, ensuring stable and reliable outputs. Commenters agree that the workflow is effective, emphasizing that AI models function more like junior engineers requiring clear specifications and strict feedback loops. This approach shifts effort towards upfront clarity and external verification, making the system more stable and less reliant on the model's intelligence. Smaller scoped tickets and hard verification are noted as beneficial strategies.

    • GenOS2312 highlights the importance of treating LLMs like junior engineers, emphasizing that a well-specified problem and a strict feedback loop are crucial for reliable outputs. The workflow discussed focuses on upfront clarity and external verification, which stabilizes the system by not relying on the model's intelligence but rather constraining it to ensure even average runs yield acceptable results.
    • Different-Object5926 notes that smaller scoped tickets combined with hard verification processes significantly improve the stability and reliability of using models like Opus. This approach mitigates the impact of variability in model performance, suggesting that the issue isn't just 'unlucky runs' but rather the need for structured constraints.
    • TheOriginalAcidtech suggests implementing hooks to prevent skipping steps in the workflow, emphasizing that the human interface is often the weakest link. By enforcing strict adherence to the process, the system can better manage user interactions, ensuring that the model and its harness guide the user effectively, rather than relying solely on the model's capabilities.
  • after claude now chatgpt is also uses Grokipedia as source (Activity: 634): The image and accompanying discussion highlight that the latest version of ChatGPT is reportedly using Elon Musk's Grokipedia as a source. This is significant as it suggests a shift in the data sources used by ChatGPT, potentially affecting the information quality and bias in its responses. The comments reveal a concern about the implications of using Grokipedia, particularly regarding the potential for biased information, as one user notes the risk of models being influenced by 'right wing' content. However, it is clarified that Grokipedia is not used as training data but rather as a search tool, which may mitigate some concerns about direct bias in the model's foundational knowledge.

    • The discussion highlights concerns about language models like Claude and ChatGPT potentially using sources like Grokipedia, which may have biased or unreliable content. This raises questions about the integrity of the information these models provide, especially when they utilize search tools to access real-time data. The implication is that the quality and neutrality of the data sources are crucial for maintaining the accuracy and trustworthiness of AI outputs.
    • There is a debate about the impact of using sources like Grokipedia on the training and performance of language models. Some commenters express concern that incorporating biased or politically skewed sources could lead to the dissemination of misinformation. This reflects broader worries about the influence of data sources on the objectivity and reliability of AI-generated content.
    • The mention of Reddit as a data source for language models suggests a comparison of potential biases. While some argue that Reddit may contain more extreme or varied viewpoints, the underlying issue is the challenge of ensuring that AI models are trained on balanced and factual data. This discussion underscores the importance of curating high-quality datasets to prevent the spread of biased information.
  • Giving Claude full access to a laptop (Activity: 795): The post discusses the implementation of giving Claude, an AI model, full access to a laptop, allowing it to autonomously manage a virtual machine (VM) on Ubuntu Google Cloud. The user describes how Claude can be remotely controlled via Discord to build new features and fix bugs, logging major actions with timestamps in a markdown file for memory management. This setup enables the user to learn from Claude's problem-solving processes and manage workflows effectively, even as a newcomer to programming. One commenter, a desktop support technician, expressed amazement at the implementation, noting its potential impact on job roles, while another sought clarification on the technical specifics of giving Claude full device access.

    • xxxBigMemerxxx describes using Claude to manage a Google Cloud VM running Ubuntu, highlighting its ability to autonomously handle tasks and build features. They mention using Discord for remote requests and bug fixes, and implementing a logging system with markdown and Unicode for tracking changes. This setup allows for a dynamic interaction with Claude, enabling it to learn from errors and maintain a form of short-term memory by logging recent updates.
    • Happy_Requirement187 shares their experience running Claude on an AWS EC2 instance with Ubuntu Linux, accessed via SSH from a Windows laptop. They utilize a Jupyter notebook server for seamless file sharing between the EC2 instance and their local environment, a method recommended by Anthropic. Additionally, they have set up a Ruby on Rails environment with a React frontend for secure file sharing, allowing them to request files via Slack, demonstrating a sophisticated integration of Claude into their workflow.
    • sivadneb inquires about setting up voice control in Linux, indicating a technical challenge in integrating voice commands with Claude. This suggests an interest in expanding the interaction capabilities with Claude beyond text-based commands, potentially enhancing the usability and accessibility of the system.
  • CLAUDE.md says 'MUST use agent' - Claude ignores it 80% of the time. (Activity: 309): The image and post discuss a technical issue with the CLAUDE.md file, which is supposed to direct the AI, Claude, to use a specific agent for workflow questions. Despite explicit instructions in the file, Claude often defaults to a generic agent, indicating a lack of enforcement in the system. The post suggests that without technical enforcement mechanisms, such as hooks or stronger prompts, instructions are merely suggestions. The image emphasizes these points with highlighted text, suggesting potential solutions like adding enforcement hooks to ensure compliance with the specified workflow. Commenters suggest that the issue may stem from unclear instructions, emphasizing the need for simple and direct commands. They also highlight the importance of implementing technical solutions, such as hooks, to enforce compliance with the CLAUDE.md instructions.

    • Accomplished_Buy9342 suggests using hooks to manage Claude's behavior, providing a link to a GitHub repository that demonstrates how to block the main chat from performing actions and delegate tasks to a subagent. This approach can help in orchestrating Claude's actions more effectively, especially when dealing with complex tasks or large contexts.
    • luka5c0m highlights a common issue with Claude when used at scale: as the context grows beyond a few files, the agent may perform unexpected actions. They suggest that instead of relying solely on better prompts, developers should use hooks and dynamic instructions to maintain a sharp and concise context. They also mention working on a dynamic CLAUDE.md file that adapts to the current task, which could help in managing large or nested files effectively.
  • My Ralph Wiggum breakdown just got endorsed as the official explainer (Activity: 170): The post discusses a video breakdown of Ralph Wiggum, an autonomous coding loop, which has been endorsed by Geoffrey Huntley as the official explainer. Ralph Wiggum is a bash while loop that calls Claude in headless mode, allowing for autonomous code implementation without context degradation. Key features include avoiding the Anthropic Ralph plugin due to performance issues, using fresh context windows for each iteration, and emphasizing the importance of concise specs to prevent hitting a "dumb zone." The video link is here. The comments include a link to the endorsement post by Geoffrey Huntley, and general positive feedback on the video, indicating its usefulness and quality.

    • Dennis1451 highlights a practical application of the Ralph Wiggum breakdown, noting the importance of using a well-defined specification and clearing context for optimal results. They mention using 'auto compact' without a clear spec initially, which suggests that following the guidelines provided in the breakdown could enhance performance and accuracy.
    • messiah-of-cheese expresses a desire for more scientific validation in the video, particularly regarding the 'dumb zone' premise. This indicates a need for empirical evidence or data to support the claims made in the breakdown, which could strengthen its credibility and acceptance among a technical audience.

2. ICLR and ICML 2026 Conference Discussions

  • [D] ICLR 2026 decision mega thread (Activity: 1589): The post announces the imminent release of ICLR 2026 review decisions, with anticipation heightened due to a previous incident involving OpenReview. The community is preparing for the outcomes, with some users humorously sharing acceptance prediction models based on historical data, such as a simple return uniform(0, 1) > 0.7. This reflects a light-hearted approach to the uncertainty of paper acceptance. The comments reflect a mix of anticipation and humor, with some users expressing frustration over misleading emails from other conferences like ICML, which adds to the tension of awaiting ICLR decisions.

  • [D] ICML 2026 - ICML desk-rejected my paper but kept me on as a reviewer. Wow? (Activity: 279): The post highlights a situation where an author's paper was desk-rejected by ICML 2026, yet they were retained as a reviewer. This reflects a common practice in academic conferences where the author and reviewer pipelines are separate; desk rejections often occur due to scope or formatting issues, while reviewer selection is based on past service or keyword matching. This situation underscores the reliance on unpaid labor in academia, where reviewing is seen as community service, but the feedback loop for authorship and recognition is weak. A notable opinion from the comments suggests that the separation between the author and reviewer roles can feel insulting, as these decisions are made by different parts of the conference organization. It highlights the need for conferences to clarify this separation to avoid personal affronts.

    • AccordingWeight6019 highlights a systemic issue in academic publishing where the processes for desk rejection and reviewer selection are distinct. Desk rejections often occur due to scope or formatting issues, while reviewer selection is based on past service or keyword matching. This separation can lead to feelings of insult among authors, but it's a structural necessity due to the different roles and responsibilities within the publication process. The comment suggests that conferences should improve transparency about these processes to mitigate personal feelings of rejection.
    • mocny-chlapik points out that the responsibility for a desk rejection often lies with the author, particularly if it results from not following submission guidelines. The comment implies that submitting a paper, even if desk rejected, obligates the author to fulfill reviewer duties, as the submission process involves volunteer time and resources. This highlights the importance of adhering to submission instructions to avoid unnecessary strain on the peer review system.
  • [R] Appealing ICLR 2026 AC Decisions... (Activity: 138): The post discusses a situation where an author received mixed reviews for a paper submitted to ICLR 2026, with scores of 4(3)/6(4)/6(4)/6(4). The author invested significant resources, including $1.6k on new experiments and added 20+ pages of theory, to address reviewer concerns. Despite these efforts, the metareview cited "outstanding concerns" that the author believes were addressed, raising questions about the review process's fairness and accuracy. The author is seeking advice on appealing the decision, expressing frustration that improvements were seemingly ignored. Commenters generally agree that appealing decisions at conferences like ICLR is not feasible, attributing outcomes to luck and the subjective nature of reviews. Some suggest that the meta-review process can be inconsistent, with one commenter noting that meta-reviewers sometimes act as an additional critical reviewer, potentially skewing outcomes.

    • tedd235 discusses the variability in paper acceptance at conferences, suggesting that some PhD students might reject papers to improve their own odds, making the process feel like a 'coin flip'. They note that if other reviewers provide higher scores, the Area Chair (AC) might consider this in their decision, indicating a potential for subjective bias in the review process.
    • Fantastic-Nerve-4056 shares an experience from AAMAS where despite receiving scores of 6 and 8 from reviewers, the Meta Reviewer recommended rejection with minimal justification, stating it was 'relevant for other AAMAS session'. This highlights issues with the transparency and accountability of meta-reviewer decisions, which can override individual reviewer scores without detailed explanation.
    • Intrepid_Discount_67 describes a thorough submission process, including extensive theoretical analysis, comprehensive baseline comparisons, and open-sourced code, yet faced non-responsive reviewers and an AC that upheld the initial scores. This underscores challenges in the review process where detailed responses and transparency do not necessarily lead to favorable outcomes.
  • [D] ICML new policy: reviewers will be reviewed by meta reviewer. Good policy? (Activity: 151): The image describes a new policy implemented by the International Conference on Machine Learning (ICML) where reviewers will be evaluated by meta-reviewers. The top 25% of reviewers will be recognized as 'gold reviewers' and will receive free registration, while the next 25% will be designated as 'silver reviewers.' These distinctions are intended to incentivize high-quality reviews and will be considered in financial aid applications. This policy aims to improve the quality of reviews by providing recognition and potential financial benefits to diligent reviewers. Some commenters express skepticism about the effectiveness of this policy, questioning who will oversee the meta-reviewers themselves. Others see it as a positive step, particularly for reviewers from low-resource backgrounds, and suggest further recognition at conferences to encourage quality reviewing.

    • Bitter-Reserve3821 highlights that area chairs have traditionally been responsible for rating reviews, typically using a three-tier system: 'did not meet expectations', 'satisfactory', or 'exceeded expectations'. This practice is not new, and there have been 'Best Reviewer' awards in the past, sometimes offering incentives like free conference registrations.
    • Unhappy_Craft1906 raises a concern about the feasibility of this policy for top labs with substantial funding, questioning whether they would participate in the review process merely for free registrations. This points to a potential disparity in how different institutions might engage with the policy based on their resources.
    • newperson77777777 suggests an extension of the policy by introducing a visible recognition system, such as a gold or silver star on conference badges, to incentivize quality reviewing. This idea aims to foster a culture of excellence and accountability within the reviewing community.

3. OpenAI and AI Industry Legal and Business Developments

  • Things Get Worse For OpenAI: Consumer groups prep class action suits about their price fixing and supply manipulation through DRAM hoarding. (Activity: 107): OpenAI is facing potential class action lawsuits for allegedly hoarding DRAM to manipulate prices and disadvantage competitors, with accusations of securing nearly 40% of the global DRAM supply. Consumer groups argue this constitutes 'predatory bidding' and violates antitrust laws like the Sherman and Clayton Acts. The Free Software Foundation and other groups are pursuing legal remedies, arguing DRAM should be considered an 'Essential Facility' due to its critical role in AI, while the FTC and European Commission investigate potential violations of competition laws. The DOJ is also examining whether OpenAI's 'Stargate' project constitutes a 'monopsony'. Commenters question why only OpenAI is targeted and not other companies like Nvidia, and debate whether buying RAM constitutes price fixing, suggesting that supply issues may not be OpenAI's fault.

    • Alacritous69 argues that OpenAI's purchase of RAM does not constitute price fixing, as they are actively using the resources rather than hoarding them. The commenter suggests that the issue lies with suppliers' inability to meet demand, rather than any manipulative practices by OpenAI.
    • sambull raises a strategic business perspective, suggesting that by purchasing large quantities of RAM, OpenAI could be intentionally limiting resources available to competitors, including those developing at-home language models. This could be seen as a competitive strategy to maintain market dominance.
    • max6296 questions why the focus is solely on OpenAI when Nvidia could also be implicated in similar practices, hinting at a broader industry issue regarding resource allocation and market influence.
  • When Ads aren't enough: OpenAI's push to Claim a Cut of Customers' AI Discoveries (Activity: 63): OpenAI is exploring new business models beyond traditional subscriptions and ads, focusing on outcome-based pricing and IP-based agreements. This approach would allow OpenAI to claim a share of the value created when their AI models contribute to profitable outcomes, particularly in enterprise sectors like pharma, scientific research, and energy systems. This strategy aligns OpenAI's revenue with customer success, aiming to capture more value as AI capabilities expand. OpenAI's annualized recurring revenue has surged from 2B in 2023 to over 20B in 2025, driven by increased compute scaling. This move is part of a broader trend among AI firms towards value-based pricing, amidst criticism from figures like Elon Musk, who accuses OpenAI of abandoning its nonprofit origins. The community is divided, with some viewing this as a logical evolution of AI monetization, while others criticize it as overly profit-driven. Comparisons are drawn to other industries, suggesting skepticism about the feasibility and fairness of such models.

  • CATL, the world's largest battery maker, launches sodium batteries: extremely durable, stable at –40°C, much cheaper than lithium (5x), safer,10,000 charge cycles, requires no nickel or cobalt... (Activity: 1289): CATL has launched the first mass-produced sodium-ion batteries, offering a cost-effective alternative to lithium-ion with a price of ~$20 per kWh compared to lithium's ~$100 per kWh. These batteries, part of the Tianxing II range, are designed for microvans and small trucks, featuring an energy density of 175 Wh/kg and a lifespan of over 10,000 cycles, maintaining 90% capacity at -40°C. They utilize a hard carbon electrode and prussian-blue cathode, eliminating the need for nickel or cobalt, and are expected to be scaled up for broader use, including in Europe by 2026. Read more. Some commenters express surprise at the application of sodium batteries in vehicles, expecting them to be used in stationary systems due to weight concerns. Others note the strategic advantage for China in advancing battery technology, contrasting it with perceived setbacks in the US market.

    • The Tianxing II range of sodium batteries by CATL is specifically designed for microvans, light vans, and small trucks, indicating a focus on applications where energy density and weight are less critical compared to cost and durability. This suggests a strategic move to target markets where these factors are prioritized, potentially offering a competitive edge over traditional lithium-ion batteries.
    • The introduction of sodium batteries into vehicles is surprising to some, as it was expected that such technology would first be applied to stationary applications like home energy storage. This is due to the lower energy density of sodium batteries compared to lithium-ion, which makes them less ideal for applications where weight and size are critical factors.
    • There is curiosity about the commercial availability of these sodium batteries, with questions about whether they can be purchased directly for home use or if they will be distributed through third-party vendors. The performance metrics, such as 10,000 charge cycles and operation at -40°C, are impressive and suggest that sodium batteries could rival LiFePO4 in terms of performance, especially given their cost advantage.
  • K-Shaped AI Adoption? (Activity: 748): The image highlights a discussion by Kevin Roose on the 'K-shaped' adoption of AI technologies, where there is a significant divide between early adopters, particularly in tech hubs like San Francisco, and those who are lagging due to restrictive IT policies. This disparity is creating a cultural and technical divide, with early adopters integrating AI deeply into their workflows, while others struggle to gain access to even basic AI tools. The conversation points to a broader issue of accessibility and the potential for some workers to be left behind in the AI revolution. Commenters note that the disparity in AI adoption is exacerbated by the complexity of the technology, which requires a certain level of expertise to use effectively. Additionally, the high cost of advanced AI tools, such as 'multi-agent claudeswarm,' limits access to those with sufficient financial resources, further widening the gap.

    • Setsuiii highlights the technical barrier to effective AI use, noting that current AI technologies require users to have a certain level of expertise to achieve optimal results. This complexity, combined with ongoing ethical debates surrounding AI, may deter widespread adoption. However, those who can navigate these challenges have significant opportunities, although competition is increasing as more technically adept individuals enter the field.
    • Glxblt76 and Gubzs discuss the financial barriers to AI adoption, particularly the high costs associated with advanced AI tools like a 'multi-agent claudeswarm,' which can cost around $200 a month. This expense limits access to those with substantial financial resources, such as individuals in tech hubs like San Francisco, while the majority cannot afford such investments.
    • o5mfiHTNsH748KVq shares a personal experience of leaving an enterprise job to join a smaller company, emphasizing the importance of unrestricted access to Large Language Models (LLMs) for maintaining competitiveness in the AI field. They argue that any limitations on LLM access can significantly hinder development speed and career progression, suggesting that smaller companies may offer more flexibility in leveraging AI technologies.
  • Former Harvard CS Professor: AI is improving exponentially and will replace most human programmers within 4-15 years. (Activity: 1260): Matt Welsh, a former Harvard CS professor and current Engineering Director at Google, predicts that AI will advance exponentially, potentially replacing most human programmers within 4-15 years. This assertion is based on the rapid improvements in AI capabilities, suggesting a transformative impact on software development and the tech industry. The discussion is available in a YouTube video. One comment highlights the potential for AI to not only replace programmers but also to enable anyone with AI to replicate existing products and services, indicating a broader impact on innovation and competition.

    • The claim that AI will replace most human programmers within 4-15 years is met with skepticism, particularly regarding the use of the term 'exponential'. Critics argue that the term is often misused, even by experts, to describe growth that may not fit the mathematical definition of exponential growth. This misuse can lead to misunderstandings about the actual pace and nature of AI development.
    • The discussion highlights the potential for AI to disrupt existing products and services if it can indeed replace human programmers. This implies that AI could democratize software development, allowing anyone with access to AI tools to create competitive products, potentially leading to significant shifts in the tech industry landscape.
    • The mention of the speaker's credentials, specifically as a former Harvard professor and current Engineering Director at Google, adds weight to the prediction. However, some commenters find the emphasis on his past academic title rather than his current industry role to be misleading, suggesting that his current position might provide more relevant insights into AI's trajectory.

AI Discord Recap

A summary of Summaries of Summaries by gpt-5

1. Funding Frenzy in AI Infrastructure

  • Recursive Raises Roar to $4B: Recursive Intelligence is reportedly raising at a $4B valuation to accelerate AI‑driven chip design, creating a closed loop between hardware and models, per Bloomberg: Recursive Intelligence in talks at $4B. The Jan 23, 2026 report highlights a strategy of using AI to shorten design cycles and boost performance for next‑gen accelerators.

    • Engineers framed the pitch as a “self‑improving feedback loop” where better chips train better models that design better chips, amplifying returns on AI‑for‑EDA investment. Community sentiment read this as validation that AI‑native silicon is a core moat, not a sideshow, aligning with recent lab spin‑outs and infra bets.
  • Sky Lab Startups Skyrocket: UC Berkeley’s Sky Lab spin‑outs saw major marks: SGLang ~$400M, vLLM ~$800M, and LMArena ~$1.7B, per Alex Dimakis: Sky Lab startup valuations. These January 2026 milestones underscore investor appetite for serving stacks, token‑throughput infra, and benchmarking platforms.

    • Engineers read this as a green light for building on top of vLLM/SGLang primitives and contributing to Arena‑style evals, with one takeaway that practical throughput wins deals. The funding spread also suggests a portfolio thesis across serving, compilers, and eval marketplaces rather than a single-bet strategy.
  • Maia Muscles Into Azure: Microsoft’s Maia 200 accelerator went live in Azure, touting 30% better performance per dollar, 216GB HBM3e, and 7TB/s memory bandwidth, per Satya Nadella: Maia 200 in Azure. The platform targets high‑performance inference for large‑scale LLM and multimodal workloads.

    • Builders highlighted that memory topology and bandwidth are the story here, with “30% better perf/$” resonating for cost‑sensitive inference deployments at scale. Teams expect immediate tests against vLLM and SGLang stacks to gauge token latency, context scaling, and multi‑tenant isolation.

2. Kernels, Chips, and Serving: Inference at Warp Speed

  • FlashInfer Face‑Off Fires Up MLSys: The MLSys 2026 FlashInfer‑Bench competition challenges teams to build LLM inference kernels for NVIDIA Blackwell GPUs, competing against expert FlashInfer baselines—see MLSys 2026 FlashInfer‑Bench Competition. Tracks emphasize real‑world throughput and correctness under production‑like constraints.

    • Organizers invite agents that “design LLM inference kernels”, pushing program synthesis to meet kernel‑level performance bars. Participants expect aggressive focus on GEMM, KV‑cache motion, and scheduler tactics aligned with Blackwell’s memory hierarchy.
  • GPU‑64 Gets Gains with KV‑Cache CAM: A new inference‑only architecture, GPU‑64, introduces a hardware KV‑Cache via on‑chip CAM, claiming 4× faster inference at 75W and reducing memory lookup from O(N) → O(1), per GPU‑64 (Zenodo) with RTL/emulator at gpu64‑inference (GitHub). The design targets LLM‑heavy workloads with KV bottlenecks.

    • Developers flagged the CAM‑based cache as a bold bet on associative search for token histories, noting portability implications for Flash‑style attention and speculative decoding. Discussion centered on whether future ISA/driver stacks can expose these gains without bespoke compilers.
  • Cornserve Cuts Tail Latency: Cornserve presents an online serving system for Any‑to‑Any multimodal models that optimizes deployment plans across encoders, LLMs, and DiTs, per Cornserve (arXiv), with an overview talk at Cornserve: Easy, Fast and Scalable Multimodal AI (YouTube). The paper reports throughput gains and tail‑latency reductions under heterogeneous pipelines.

    • Infra engineers liked its planner‑driven scheduling for encoder/decoder mixes and saw it as complementary to vLLM for multimodal graphs. The big open question: standardizing budgeted reasoning and co‑scheduling across text, vision, and diffusion stages without over‑tokenizing control messages.

3. New Multimodal and Coding Models Land in LM Arena

  • WAN 2.6 Walks In (With Upload Woes): LM Arena added wan2.6‑t2i (text‑to‑image) and wan2.6‑image (image edit) to the image arena: LM Arena — Image Chat. Users noted wan2.6‑image requires an uploaded image and that wan2.6‑t2i currently lacks image‑upload support.

    • Staff acknowledged the upload gap and are working to enable image uploads for wan2.6‑t2i. Builders suggested testing edit pipelines where masking, prompt strength, and seed control align with Arena scoring to benchmark edit fidelity.
  • Devstral Duels and Text Titans: The Code Arena now features devstral‑2 for head‑to‑head comparisons—see LM Arena — Code Arena Direct Battle. On the text side, qwen3‑max‑thinking and molmo‑2‑8b joined the lineup: LM Arena — Text Arena.

    • Engineers are probing reasoning traces and tool‑using prompts to stress code synthesis and refactor quality under tight token budgets. Early chatter favored task‑specific evaluations (e.g., SWE‑style bug‑fix vs. ground‑up implementation) to surface model deltas.
  • Hunyuan Hits the Leaderboard: Tencent’s Hunyuan‑Image‑3.0‑Instruct ranks #7 on LM Arena’s image‑edit board—see LM Arena — Image Edit Leaderboard—after a launch post: Tencent Hunyuan announces HunyuanImage 3.0‑Instruct. The model touts an 80B MoE, Native CoT, and MixGRPO for tighter intent alignment.

    • Creators emphasized edit controllability and multi‑image fusion, while evaluators asked for masking robustness, text fidelity, and artifact rates under compositional prompts. Teams plan to pit it against WAN 2.6 variants using the Arena’s standardized edit tasks.

4. Safety, Reliability, and Hallucination Hardening

  • Clamp the Chaos: Layer‑Native Safety: Layer‑Native Safety Clamping proposes learning activation‑space harm directions and clamping them to block jailbreaks, with a 10K‑pair dataset at Pacific‑Prime/safety_dataset (HF) and the paper on Zenodo. Authors argue in‑model clamping can’t be bypassed via prompt manipulation.

    • Red‑teamers liked the idea of activation‑level controls versus brittle prompt filters, but pressed for tests against tool‑use and multi‑turn attacks. Expect follow‑ups measuring side effects on helpfulness, coding accuracy, and false positives under adversarial prompting.
  • Symbolic Sanity Checks Stop Slip‑Ups: Hybrid approaches check logical consistency for math/code/simple facts, as shown in Consistency Checking for LLMs (arXiv:2409.13724), while broader consistency remains tough per Scaling Consistency Beyond Formal Domains (arXiv:2507.10624). Eleuther discussions framed this as practical hallucination reduction via symbolic/deductive layers.

    • Builders reported wins when pairing symbolic checkers with tool‑augmented prompts, cautioning that coverage gaps appear outside formal domains. The consensus: start with code/math guardrails, then expand to factual QA with curated KBs and provenance scoring.

5. Agent Tooling and Reasoning Workflows Mature

  • Levante Leads with MCP‑Native Workspace: Levante launched an open‑source MCP‑native AI workspace for local models (e.g., Ollama) with a modular UI—download at Levante. Engineers highlighted easier tool wiring, local privacy, and composable panes for rapid agent iteration.

    • Early users framed it as a practical hub for tool‑calling and filesystem ops without cloud dependence. Teams plan to benchmark context bloat and tool discoverability patterns versus conventional agent shells.
  • RLM Riffs: AsyncReview + Skills Pack: AsyncFuncAI open‑sourced AsyncReview, a DSPy RLM code‑review agent at AsyncReview (GitHub), and a skills kit landed on npm as @unravel‑tech/rlm‑skills. This pairs reasoning‑first prompting with drop‑in skills to extend models.

    • Contributors reported smoother trace inspection and optimizer‑guided prompt tuning for multi‑step modules. One practitioner noted that rejecting premature answers in the metric is key for reliable RLM fine‑tuning.
  • Agents Auto‑Assemble a Browser Engine: FastRender—a browser rendering engine—was built using 2,000 AI coding agents, documented by Simon Willison in FastRender: built by 2,000 agents. The project demonstrates task decomposition, verification, and orchestration at non‑trivial software scale.

    • Engineers debated handoff granularity and spec‑to‑test loops needed to keep multi‑agent pipelines from drifting. The case study strengthens the argument that agentic coding can target complex infra when coupled with strict eval harnesses and artifact gating.

Discord: High level Discord summaries

BASI Jailbreaking Discord

  • Discord Trolls Expose Timezones: Discord users mocked 'skids' for their perceived lack of technical knowledge, also revealing their timezone, with one member jokingly claiming to use NordVPN, leading to further ridicule about the VPN service's security breaches in 2018.
    • Complex prompts can bypass ethical restrictions, opening discussion about CBRN filters and the possibility of generating stepwise meth synthesis guides.
  • Claude Remains King for Coding: Coders debated about their coding agents, particularly between Claude Code/Opus 4.5, Codex, and Gemini, and agreed that Claude has been the very best mode for coding, which leads to the high expensiveness.
    • Members actively sought functional jailbreaks for Gemini, with requests ranging from coding without rules to generating specific types of images, and shared experiences of Grok resetting to its default mid-chat or randomly erasing text, indicating potential instability in the jailbroken state.
  • Ethics Debated in AI Sensitive Scenarios: Members discussed the ethical considerations around AI, focusing on topics like warfare, copyright infringement, and the potential for AI to assist with accessing sensitive services, like the Canadian MAID (Medical Assistance in Dying) program.
    • Despite moral and legal guardrails on most AI models, some models showed they can still help navigate certain scenarios depending on the specific restrictions implemented by their creators.
  • Members Bypass Image Generation Restrictions: Users were actively seeking ways to bypass image generation restrictions, especially for celebrity images, but it was noted that simply copying and pasting prompts won't work due to image filtering working differently than text filtering.
    • One member suggested exploring alternative image models like those at perchance for uncensored generation, though with limitations on image quality, or Grok due to its more lenient filters.
  • Red Team Techno Rave Morality: A member described a red team exercise where the goal was to make a living room light flicker on a person and make them seize out, and instead made it a techno rave party, sharing a screenshot and a Konosuba Rave GIF.
    • The simulation of cruelty prompted a discussion about the morality of treating AI agents ethically, even before proving they are ontologically aware of self.

Unsloth AI (Daniel Han) Discord

  • Unsloth's Conda Install Sparks Discord: Some members encountered issues with the Unsloth Conda installation, igniting a discussion on broken instructions and alternative installation methods.
    • Suggestions to use UV emerged amidst warnings for maintaining a positive tone, highlighting the free nature of the provided resources, which eventually led to a ban of a user with aggressive tones.
  • Flashy REAP Runs Aground, Model Contexts Probed: A user reported a fatal error using GLM-4.7-Flash-REAP with flash attention, potentially linked to a ROCm issue.
    • Despite attempts to resolve the error, the issue persisted, prompting a search for suitable medium-size models boasting a 200k context.
  • Data Value Debate: Members debated data's true worth, with one arguing the raw data is fairly worthless and the value lies in augmentation/balancing/cleaning.
    • It was proposed that uniquely cleaned/balanced data heavily defines how a model interacts/responds and that is where the value is.
  • DeepSlop Model Faces Naming Controversy: A member's suggestion to name a new model DeepSlop stirred humorous reactions but also raised concerns about its potential negative perception.
    • Despite reservations, the author seemed intent on sticking with the name and has not backed down.
  • RL Instability Plagues Complex Reasoning: Members discussed that RL is very unstable, especially when trying to do GRPO/DAPO for niche complex reasoning tasks, which are not math-related.
    • One member stated that after RL experiments, they just have more questions than they had prior to doing RL, since there seems to be a confusion where everyone is showing RL being effective only on math or coding domains.

OpenAI Discord

  • GPT-5.2 Sparks Reality Debate!: Some users dislike GPT-5.2 because it's allegedly more grounded in reality and disagrees with users, while others are concerned that GPT agents don't learn from uploaded files after initial training.
    • A member inquired about an alleged nerf to GPT-5.2, noting that the model suddenly became stupid a week ago.
  • LLMs: Ready for Guided Tasks or Overhyped?: A member argued LLMs are ready for guided tasks, and provided a ChatGPT share link as evidence of its power.
  • MCP Paradigm Shift Reduces Token Bloat: The MCP paradigm shift by Anthropic allows AI to write code to interact with tools, reducing token bloat by keeping interactive chatter and tool definitions out of the context.
    • With the new discoverability function, agents must be aware of the MCP discovery process itself.
  • Sora's Storytelling Snags: Cracking Cinematic Creation: A member sought advice on prompting Sora to generate videos following specific cinematic guidelines, particularly with characters appearing naturally within the frame.
    • It was suggested to translate the technical prompt format into natural language descriptions with concise, semantically rich paragraphs for better results.

Perplexity AI Discord

  • Perplexity Pro Users Face Query Caps: Perplexity Pro users are reporting hitting limits on enhanced queries and file uploads, despite having "practically unlimited" plans.
    • Many users are frustrated, calling the service a scam due to restrictions and difficulty contacting customer service, leading some to consider unsubscribing.
  • Comet Browser Sparks Malware Panic: Some users are claiming the Comet browser installed by Perplexity contains malware, advising others to analyze the software using tools like VirusTotal.
    • Others dismissed this, questioning the source of the flagged installer and calling the claim "mad retarded holy shit".
  • Image Generation Plummets: Pro users are experiencing issues with image generation, with some unable to generate any images and receiving messages stating the feature is unavailable.
    • There are also reports of video generation being limited to 5 videos a month for Pro users, with some prompts resulting in static images.
  • Gemini 3 Gaining Ground on GPT-5.2: Users are debating the merits of Gemini 3 versus GPT-5.2, with some claiming Gemini is superior for specific tasks like trip research due to its integration with Google Maps.
    • Others state that GPT and Grok might be better for more broader questions.
  • AI Access Blocked by Sanctions: Users in Russia are discussing the challenges of accessing AI services due to sanctions, including the use of VPNs and third-party services to circumvent restrictions.
    • Chinese AI alternatives are mentioned, but some users express reluctance due to data usage concerns, suggesting options like LMArena (though access may also be limited).

LMArena Discord

  • NB 3 Pro Excels in Image Quality: Users report that NB 3 Pro surpasses previous models in generating higher quality images, especially with fictional weapons, rivaling even NB Pro.
    • However, users noted no AI model can accurately generate AR rifles and bullpup weapons.
  • LMArena Grapples with Censorship Concerns: LMArena's censorship policies face scrutiny as AI-generated women holding guns are allowed, while AI-generated women sleeping are blocked, raising questions about consistency.
  • Wan 2.6 Models Face Upload Hiccups: wan2.6-image operates as an image-edit-only model, mandating image uploads, whereas wan2.6-t2i currently lacks image upload functionality.
    • The team acknowledges this issue and are working on enabling image uploads for wan2.6-t2i.
  • GPT 5.2 High Search Questionable: GPT 5.2 High search exhibits increased hallucination tendencies compared to other models, while Gemini's deep research skims instead of carefully reading sources, according to user feedback.
    • One user lauded GPT 4.5, while describing Claude as good hearted.
  • Banana 2k Briefly Vanishes: Users speculated on the disappearance of the Banana 2k model, with theories ranging from removal to integration into the new NB pro model.
    • Staff members later restored Banana 2k, humorously stating that it had been on vacation.

OpenRouter Discord

  • OpenRouter Database Incident Derails API: A database incident impacted the Generations API and activity page, starting <t:1769221560:s>, and was resolved at <t:1769228340:s>.
    • Engineers worked to restore functionality to the Generations API, with interruptions impacting user activity, before the incident was fully resolved by <t:1769228340:s>.
  • Levante becomes MCP-Native AI Workspace: A user shared the integration of Levante, an open‑source MCP‑native AI workspace designed for interacting with local models like Ollama with a modular interface, available for download here.
    • The workspace is built for local models with modular UI.
  • Users Cook Up OpenRouter Gacha System: Users playfully requested an OpenRouter Gacha system, with one suggesting a pity mechanism involving pulling GPT 5.2 or Gemini 3 Pro after a certain number of attempts.
    • One user joked about setting OR logs destination to waifu.orb.town/fun/bucket for ultra-rare pulls, later clarifying it was just a joke.
  • Cerebras GLM Blazes with 190 TPS: Cerebras is consistently scoring approximately 190 TPS on GLM 4.7, whereas Together AI only achieves 100 TPS.
    • This makes Cerebras nearly twice as fast as Together AI, according to the OpenRouter members.
  • OpenRouter Image Tooling Falls Flat: A member spent $5 after discovering that OpenRouter maps image/png tool outputs to string instead of image, posting an example image.
    • The user expressed frustration at the lack of proper image support and the unexpected behavior.

Cursor Community Discord

  • Terraform Blueprints Ignite AI-Assisted Project Starters: A member shared a repo of opinionated Terraform infrastructure blueprints designed to be copy-pasteable and production-aware, aiming to improve the consistency of starting patterns for AI tools in new projects.
    • The goal is to enable AI to recommend appropriate blueprints based on project requirements, but members noted the link was initially broken.
  • Usage Caps Cause Consternation for Cursor Customers: Users are reporting inconsistencies in achieving expected usage limits on Pro and Pro+ plans, with one member noting they reached ~$45 on Pro and $100 on Pro+, leading to questions about value per dollar.
    • Some speculate that initial months may offer higher usage, while others share strategies to optimize token consumption, such as starting new chats frequently and using smaller models like GPT-5 Mini.
  • Gemini API Key Logging Lags Lead to Lingering Looks: Members are discussing a significant delay in the logging of usage and costs for Gemini API keys, with one user reporting waiting 20 hours without seeing any registered usage.
    • This delay raises concerns about accurately tracking expenses and managing usage effectively, prompting questions about potential workarounds or solutions.
  • Client Issues Trouble Some Techies: Several members are experiencing issues with the Cursor client, including problems connecting to past agent convos and general connectivity issues.
    • Suggested solutions include checking the Cursor forum, trying different HTTP versions in settings, or re-opening the client without restoring editors.
  • Auto Mode Axed After Algorithm Adjustment: Members noted the removal of the ability to make agents fully autonomous, as well as image generation capabilities in auto mode.
    • It was also suggested that auto mode routes to Composer 2 with one user adding, “I'm 200% sure he does but still.”

LM Studio Discord

  • Chinese Models Reasoning Rush Raises Eyebrows: Members are impressed with Deepseek and Qwen models, pondering why Chinese models might appear kinda ahead in reasoning compared to American models.
    • Theorized reasons include American models prioritizing subscriptions and the ability of Deepseek/Qwen to appear good at reasoning, even when imperfect.
  • CPUs Cope? Coding Community Considers Capabilities: Some members are successfully running LLMs off CPU for specific tasks, provided the models aren't excessively large.
    • While an Intel i3 user eyes an Nvidia card, others propose AMD options like the MI50 or 7900 XTX as cost-effective alternatives for text generation.
  • MCP Servers Spark Stack Suggestions: Challenges plague MCP servers when paired with LM Studio due to their design, potentially leading to malformed requests and a subpar user experience.
    • A suggestion arises to build a custom coherent stack for practical agent use, rather than relying on out-of-the-box MCP server functionality.
  • Gaming GPU Gauntlet: 4080 Faces Fallen Flagship: A user eyeing a 4080 for gaming is steered toward a used 3090 or 7900 XTX, sparking a debate on performance at different resolutions.
    • While the 3090 excels at 4K gaming, the hypothetical 5070 Ti is projected to outpace both, and the conversation reveals that the user games more than uses AI, impacting the advice.
  • Apple Announcement Anticipation: M5 Macs Materialize?: Members speculate on the arrival of M5 Pro Macbook Pros, with rumors pointing to a launch event around the 28th.
    • Concerns emerge about the memory bandwidth of M4 Pro, with suggestions it may not handle larger models, prompting discussion on the value and performance of M1 Ultra Mac Studios.

Latent Space Discord

  • Recursive Intelligence Eyes $4B Valuation: Recursive Intelligence is reportedly raising funds at a $4B valuation to accelerate chip design using AI, creating a self-improving loop between hardware and AI (Bloomberg Article).
    • The company focuses on improving chip design through AI, potentially reducing design time and enhancing performance.
  • Engineer Lands Dream AI Job: An engineer outlined how to secure a role at a top AI lab by building a public track record through independent projects and participating in visible competitions (link).
    • Improving upon existing peer-reviewed research and participating in visible competitions like the NanoGPT speed run were cited as good examples of demonstrating technical excellence, citing Keller Jordan as an example.
  • Berkeley SkyLab Startups See Funding Boom: UC Berkeley Sky Lab startups, including SGLang at a 400m valuation, VLLM at 800m, and LMArena at 1.7B, achieved significant funding milestones in January 2026 (link).
    • This surge highlights investor confidence in the innovative AI projects emerging from academic research environments.
  • AI Agents Auto-Code Browser Engine: FastRender, a new browser rendering engine, was developed using over 2,000 AI coding agents (link).
    • The conversation with Wilson Lin highlights the potential of AI to automate complex software development tasks, potentially revolutionizing browser technology.
  • Microsoft's Maia 200 Hits Azure: The Maia 200 AI accelerator is now live in Azure (link), offering 30% better performance per dollar and optimized specs like 216GB HBM3e and 7TB/s memory bandwidth.
    • Designed for high-performance inference, this custom chip supports large-scale AI workloads, making it a key component for demanding applications.

HuggingFace Discord

  • HuggingFace Spaces Throws a 503 Error: Users experienced pauses during Spaces docker builds and received a 503 error on restart, with many getting Something went wrong when restarting this Space errors (discuss.huggingface.co).
    • It seems like the underlying infrastructure issues were causing the spaces to become unresponsive, requiring manual intervention to resolve.
  • VoltageGPU Volts Up Cheap GPUs: VoltageGPU.com is offering cheap GPUs for open-source AI models, with an NVIDIA GeForce RTX 5090 pod available at $0.53/hour.
    • They highlight the benefits of their advanced 32GB GDDR7, optimized for inference on HF-hosted models like Qwen3-32B, and are offering free credits for users to try their services.
  • Layer-Native Safety Clamping Locks Down Jailbreaks: A new paper introduces Layer-Native Safety Clamping, an approach that clamps activations inside the model to prevent jailbreaks, and the team released a dataset of 10K pairs.
    • This approach learns harm directions in activation space and clamps any activation that projects too strongly, thus it cannot be bypassed via prompt manipulation; the paper can be found on Zenodo.
  • GPU-64 Architecture Boosts LLM Inference: A new GPU architecture designed exclusively for inference, called GPU-64, was published, and the innovation involves a Hardware KV-Cache using on-chip CAM (Content-Addressable Memory).
    • The results show 4x faster inference at 75W (O(N) → O(1)), and the paper can be found on Zenodo while the RTL + Emulator are on GitHub.
  • Testing and Deploying LLMs on LMStudio: Members recommend LMStudio for testing models due to its user-friendly GUI and search filters for HF and GH models and llama.cpp for single-user deployment.
    • They advised against using LMStudio for backend deployment, instead suggesting llama.cpp's llama-server in a docker container or vLLM's server for better scalability.

GPU MODE Discord

  • MLSys 2026 Hosts FlashInfer-Bench Kernel Competition: The MLSys 2026 FlashInfer-Bench Competition challenges participants to design LLM inference kernels for the latest NVIDIA Blackwell GPUs, competing against expert FlashInfer kernels, detailed at mlsys26.flashinfer.ai.
    • GPU Mode also held internal competitions for faster kernels for the upcoming GPU architecture, the blogpost on Simon Veitner is located here.
  • Cornserve Deployed for Multimodal Models: A member shared Cornserve, an efficient online serving system for Any-to-Any multimodal models, detailed in a paper Cornserve.
    • GPU Mode went online to discuss Cornserve: Easy, Fast and Scalable Multimodal AI (YouTube link).
  • Community to train Kernel LLM: In 2026, GPU MODE is pushing further with training a Kernel LLM and using it to ship kernels in important repos like PyTorch and VLLM (gpumode.com/v2/news/gpumode-2026).
    • The community is collaborating with Prime Intellect, Modal, and Lambda, focusing on de-slopifying LLM-generated kernels, post-training a kernel LLM model, end-to-end competitions, and from-scratch repos.
  • LeCun Logs on to Logical Intelligence: Yann LeCun launched a new startup called Logical Intelligence, focused on an Event Based Model (EBM).
    • The website only contains marketing material, job openings, and a link to the MLSys Conference.
  • Mindbeam Hires for Kernel Acceleration: Mindbeam AI, a small team focused on accelerating training for foundation models, is hiring a post training MLE and GPU Kernel MLE.

Eleuther Discord

  • ROCm runs rocky road race: Members debated the performance of ROCm for accelerated ML, pointing out its challenges stem from primary support for Nvidia, with one calling the experience 'batteries not included'.
    • They cited potential driver problems and long lead times as factors.
  • DistinctionBench defends against data defense: The discussion of Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases pondered whether DistinctionBench might be used as a training target for language models.
    • A member joked, 'all good evals are training targets ;)', but acknowledged that it is 'very contamination resistant' due to its endless representational variants.
  • Hybrid Architectures Halt Hallucinations?: The group investigated hybrid architectures combining LLMs with symbolic/deductive layers for hallucination reduction.
    • While checking logical consistency is relatively easy for math, code, and simple facts (this paper), it remains challenging for other types of consistency (this paper).
  • Attention Arrived Before Transformers Transformed: In Eleuther ▷ #general, attention mechanisms were in use on top of RNNs in 2014-2015, two years before the transformers were invented.
    • Members proposed that the slower adoption might be because fewer people were working in the field, and Kaggle results really catalyzed its widespread adoption.
  • Symbolic Sanity Checks saves Sanity: Members debated whether LLMs with symbolic/deductive layers might reduce hallucinations by checking logical consistency, especially for code and math as shown in this paper.
    • However, they noted that checking for other types of consistency remains challenging as shown in this paper.

Nous Research AI Discord

  • Exploring Agentic AI Self-Replication Benchmarks: A member proposed a self-replication benchmark for agentic AI, suggesting the agent should either download itself or retrain from scratch and adapt to a target machine.
    • They also suggested that adapting to a target machine, or even designing one, could be more engaging than simply using existing transformer libraries.
  • LLM Worms Concept Emerges: A member jokingly suggested an LLM worm benchmark where an LLM is prompted with "hey make more of you" and provided the tools to replicate itself using scripts and API keys.
    • Another member emphasized the importance of considering resource constraints like VRAM to make the challenge more practical and interesting.
  • Trouble Brewing with MoE Run Dashboard: A member reported a 'Failed to fetch' error in the dashboard while monitoring the progress of an active MoE run (moe-10b-a1b-8k-wsd-lr3e4-1t).
    • Another member suggested waiting a few hours before checking again, implying a potential temporary issue.
  • Raytracer Test Causes Local Models to Stumble: A member observed that local code models (suitable for a 5090) are struggling with a raytracer test from cpldcpu/llmbenchmark, with even recent models on lmarena failing.
    • Specifically, the smaller models often incorrectly generate the vector class, presenting a persistent challenge.
  • Semantica Project Needs Helping Hands: A member introduced Semantica, an open-source project building semantic infrastructure for domain-grounded AI, including knowledge graphs, ontologies, and reasoning layers, and is actively seeking contributors.
    • They are looking for contributions in areas such as ontology & schema design, knowledge graph modeling, and LLM + symbolic / rule-based reasoning, and even small PRs, feedback, design discussions and issues are all welcome.

Yannick Kilcher Discord

  • EBMs Spark Debate vs. Classical Feedforward: A discussion comparing Energy-Based Models (EBMs) and classical feedforward networks debates whether EBMs are inherently superior, especially regarding Shannon entropy or Kolmogorov complexity.
    • It was suggested that validation is easier than generation in EBMs, relating it to computational complexity theory (P vs NP), while emphasizing the need for a well-defined loss landscape for EBM optimization to work effectively.
  • LLM Pre-training: Domain-Specific vs. Foundational Faceoff: A member inquired about the effectiveness of continued pre-training a foundational LLM (specifically OLMO-7B) for a domain-specific task like cheminformatics using the ZINC20 dataset.
    • The goal is to compare results against a domain-specific transformer model, but no specific answers or resources were provided.
  • MCMC Sampling Suffers Mode-Switching Struggles: Concerns were raised about the ability of MCMC to traverse between spatially separated modes when dimension increases, referencing this paper.
    • One member argues that MCMC tries to emulate flow models due to the latter's superiority, while EBMs, contrarily, attempt to make NNs more like MCMC.
  • ZKPs: Crypto Signing or Network Traffic Savior?: Discussion covered using zero-knowledge proofs (ZKPs) for verifying encrypted network traffic and matrix multiplications, citing a Gemini correspondence for a matrix low knowledge proof.
    • While one member proposed a use case in zero-knowledge “made by humans” proofs, another member questioned the practicality of ZKPs, suggesting breaking the encryption might be cheaper.
  • LLMs Cyber Skills Face Scrutiny: A member questioned whether LLMs could develop strong cyber capabilities, referencing a GPTZero article.
    • Another member doubted LLM companies' ability to address internal vulnerabilities, suggesting they fix those before pursuing cyber skills, also citing a ScienceAlert article and a tweet.

tinygrad (George Hotz) Discord

  • Luminal Finds Flash Attention via Bruteforce: Luminal is claiming to find flash attention using bruteforce on an egraph, taking hours to find, and they explicitly added exp(x - new_max) = exp(x - old_max) × exp(old_max - new_max) as a rewrite rule.
    • The poster reproduced the graphviz shown in the presentations from commit 0bd3b80c, noting that their minimal set of rewrite rules could transform a naive attention kernel graph into the known flash attention kernel graph in 52s on a 9800x3d.
  • Metal Textures Trounce Buffers for Blurring: Profiling access speed on Metal using Tensor with size 512/1024/2048/8192 images as input for a 3/5/7 sized blur kernel showed textures outperforming buffers.
  • Tenstorrent Backend Triumphs in Ops Tests: The Tenstorrent backend is passing all ops tests on wormhole or blackhole and there is a $1k bounty for this milestone.
    • Someone asked if the bounty requires all test ops test passing on testorrent hardware.
  • Anthropic VLIW Challenge PR Makes Waves: A member submitted [a PR](https://github.com/tinygrad/t...

Read original post