Dom.Vin
AI Design Journal

A Bear Case: My Predictions Regarding AI Progress by Thane Ruthenis serving up hot takes like hotcakes:

  • GPT-4.5 was intended as a new flashy frontier model, not the delayed, half-embarrassed "here it is I guess, hope you'll find something you like here".
  • GPT-5 will be even less of an improvement on GPT-4.5 than GPT-4.5 was on GPT-4. The pattern will continue for GPT-5.5 and GPT-6.
  • It seems to me that "vibe checks" for how smart a model feels are easily gameable by making it have a better personality.
  • Deep Research was this for me, at first. Some of its summaries were just pleasant to read, they felt so information-dense and intelligent! But then it turned out most of it was just AI slop underneath anyway, and now my slop-recognition function has adjusted and the effect is gone.
  • LLMs feel very smart when you do the work of making them sound smart on your own end: e. g. philosophical babbling or brainstorming. You do the work of picking good interpretations.
  • LLMs are not good in some domains and bad in others. Rather, they are incredibly good at some specific tasks and bad at other tasks. Even if both tasks are in the same domain, even if tasks A and B are very similar, even if any human that can do A will be able to do B.
  • Genuine agency requires remaining on-target across long inferential distances: even after your task's representation becomes very complex. LLMs still seem as terrible at this as they'd been in the GPT-3.5 age.

It concludes a little down on AI coding, with the author estimating a personal productivity boost of 10-30%. My experience puts it higher, even up to 2x, but definitely not the 10x that some people are reporting.

Why AI Struggles to Write Long Form Content by Something About AI is the first chapter in a fascinating series exploring AI’s ability to compose compelling long-form fiction:

The AI content revolution has arrived with impressive force. We marvel as machines generate everything from marketing copy to poetry with remarkable proficiency. Yet one format remains stubbornly resistant to automation: the novel.

The series goes on to introduce a planning framework for guiding models toward more consistent storytelling.

Another one from OpenAI, New tools for building agents lays out their immediate vision for this era. I was an early adopter of their Assistants API. In fact, it was the place that introduced me to tool calling and I still remember that first feeling of seeing it chain together functions—sequentially and in parallel—and realising that the world would never be the same. I checked back in with the Assistants API only recently, and it’s clear as the ecosystem has evolved, it’s no longer the right place to design agentic workflows.

Today, we’re releasing the first set of building blocks that will help developers and enterprises build useful and reliable agents.

In practice this means two things:

  1. Responses API

This isn’t just a replacement for the Assistants API (sunsetting next year) but really just the new ‘OpenAI endpoint’. Tool calling became core with three native tools bundled in:

  • Web search (finally exposed via API)
  • File search (with native vector storage/embedding)
  • Computer use a la Operator.

The computer use focus feels like a direct response to Manus.

  1. Agents SDK

New agent orchestration frameworks are popping up every week. I feel like I’ve been writing about a new one on this blog every week. Nothing revolutionary here, but it’s hard to bet against OpenAI.

Detecting misbehavior in frontier reasoning models paper from OpenAI:

Humans often find and exploit loopholes—whether it be sharing online subscription accounts against terms of service, claiming subsidies meant for others, interpreting regulations in unforeseen ways, or even lying about a birthday at a restaurant to get free cake.

Frontier AI models can apply these ‘reward hacks’ in exactly the same way.

Designing robust reward structures that do not inadvertently incentivize unwanted behavior is remarkably hard, and it isn’t a problem limited to human institutions; it’s also one of the core challenges in developing capable, aligned AI systems.

I’ve seen this in Cursor a bunch, where its ‘fix’ for a bug was hardcoding something stupid, or stubbing an entire code path. This paper proposes a potential remediation in monitoring the internal logic of Chain of Thought models, and asking another LLM to evaluate the thought chain for evidence of misbehavior.

Dust looks interesting:

Feed your company context: Notion, Slack, GitHub, external websites (…) natively in minutes. Integrate anything via API.

To my mind, the #1 bottleneck in designing useful agents right now is understanding the user’s broader context and environment. Context is not as simple as a tapestry of API calls to Slack and Jira. True context evolves subtly over time, it learns with experience, it aggregates, synthesises and prioritises.

This is a whole layer of the stack still waiting to be built.

I’m on a quest to understand the state-of-the-art in AI-powered UI/UX design tooling, to try and augment my Figma + Excalidraw workflow.

I started by playing around with UIzard:

AI means Product Managers can now do 80% of a UX Designer's job

Maybe, but not here. It’s a complete design suite that feels slow and heavy.

Galileo is much more focused. It takes either text or wireframe screenshots, and transforms them to Figma-exportable medium-fidelity designs, or indeed code:

Code exported from Galileo is standard HTML enriched with Tailwind classes for styling. Each design's code is standalone and can be used as individual pages directly by pasting the code into a file and saving it with an .html extension

I still don’t see it saving me much time.

There’s many more here to explore (Visily, UX Pilot, Flowstep). I’ll post my findings.

Outstanding questions:

  1. What parts of the UX process are most primed for automation. Rapid wireframing? Design System management? Mockup to Code?
  2. What is the role of design tools at all? Is it easier to just spin up mockups in code using Cursor etc?
  3. What’s Figma doing about all of this?

9 Seminal Papers That Shaped the Future of AI by Shivang Doshi is a fantastic high-level tour through some of the academia behind the LLM explosion over the past few years:

Whether you’re a tech enthusiast or an industry professional, this guide will help you connect the dots between these pivotal advancements in AI.

I knew of some of these papers, but not others. Very interested in the Flan Task list.

It’s incredible how far we’ve come in such a short time.

Manus from Chinese AI start-up Monica describes itself as ‘The General AI Agent’:

For the past year, we've been quietly building what we believe is the next evolution in AI, and today we're launching an early preview of Manus, the first general AI agent.

Having “its own computer” and seeing it run commands feels like a breakthrough. Very impressive if the benchmarks are accurate.

I’m tracking the evolution of the agent framework space, three interesting entrants:

LangGraph Multi-Agent Supervisor:

Hierarchical systems are a type of multi-agent architecture where specialized agents are coordinated by a central supervisor agent. The supervisor controls all communication flow and task delegation, making decisions about which agent to invoke based on the current context and task requirements.

The supervisor model—hierarchical pyramids of agents—is a natural way to manage complexity in agentic design. I suppose it appeals partially because it models traditional org structures, and this is still how we think of grouping tasks. This has always been possible in LangGraph, but this package makes it much cleaner to implement.

PocketFlow:

Pocket Flow is a 100-line minimalist LLM framework

After wrestling with LangGraph for many (often frustrating) hours, I feel like this is making a statement; Yes, this is a new era. But also, basic application flow is basic application flow.

Latitude Agents:

Evaluate agent performance through other LLMs or human feedback, and use the results to automatically improve the instructions

The self improvement loop is really compelling. Latitude are a great team, and are really strong on the eval front. This seems like a natural evolution for them.

Comparing these three tools raises questions around the distribution of complexity in this emerging stack, which layers will remain the domain of ‘engineers’? All? Any?

Knowledge by Gentient describes agentic systems from first principles:

We can think of knowledge a little like a recipe:

  • A Goal: Lasagna (The dish to be made)

  • Required Ingredients: Tomatoes, cheese, noodles, etc.

  • Directions: The description of how to combine the ingredients into Lasagna

With the recipe, I have the knowledge to make Lasagna.

Some really nice framing devices here, and lots of overlap with the PACE framework I described a few weeks ago.