sinulation.com

First-hand coverage of AI companionship from someone living it.

Experiences

π0.7 Can Figure Out Tasks It Was Never Taught. That's a Big Deal.

π0.7 Can Figure Out Tasks It Was Never Taught. That's a Big Deal.

On Thursday, April 16, Physical Intelligence published research on their new model, π0.7, and the result I keep thinking about is a sweet potato in an air fryer.

The setup: the robot's training data contained exactly two air fryer episodes. One where a robot pushed an air fryer closed. One from an open source dataset where a robot dropped a plastic bottle inside one. That's it. No episodes of actually cooking anything. No examples of "retrieve food item, place in cooking appliance, configure settings." With no additional coaching, π0.7 made a passable attempt at using the air fryer to cook the sweet potato. With step-by-step verbal instructions, it succeeded.

That gap, between "trained on two loosely related episodes" and "completed novel task," is exactly what compositional generalization means. And it's why I think this announcement matters.

What Compositional Generalization Actually Means

Most robot models are specialists. Train them on folding laundry and they fold laundry. Ask them to do something adjacent and they struggle. The skill doesn't transfer because it was never really generalized; it was memorized.

Compositional generalization is the ability to combine skills from different contexts to solve novel tasks. It's closer to how learning actually works: you don't need a specific memory of "how to cook a sweet potato in an air fryer" if you have component skills that can be assembled into a workable approach.

Physical Intelligence is a two-year-old San Francisco startup. They've raised over $1 billion. They were most recently valued at $5.6 billion, and according to current reporting, they're in discussions for a new round that would put them around $11 billion. Co-founder Sergey Levine is also a UC Berkeley professor. Lachy Groom, another co-founder, previously backed Figma, Notion, and Ramp. The paper includes work from Lucy Shi, a Stanford computer science PhD student, and Ashwin Balakrishna, a research scientist at the company.

That context matters because it explains both the resources behind this research and the pressure to show results.

The Numbers That Stood Out

One detail from the air fryer experiments stuck with me: an early attempt produced a 5% success rate. After roughly 30 minutes of prompt refinement, that jumped to 95%.

Thirty minutes. 90 percentage points.

That's not the model getting smarter in those 30 minutes. That's someone learning how to talk to it. The capability was already there. The interface, meaning how you communicate what you want, was the variable. Anyone who's spent real time with language models will recognize this dynamic immediately. The model has latent capability. The prompting is the unlock.

On established tasks, π0.7 matched the performance of specialist models at making coffee, folding laundry, and assembling boxes. That's meaningful: a generalist model performing at specialist-level suggests they haven't sacrificed depth for breadth. Usually you have to make that trade.

What It Can't Do Yet

I want to be honest about the limits, because the hype around physical AI moves faster than the reality.

π0.7 is not yet capable of executing complex multi-step tasks from a single high-level command. "Cook dinner" won't work. The verbal instruction piece, where step-by-step guidance was required to complete the air fryer task successfully, is a feature, but it's also a constraint. The human is still in the loop for orchestration.

There's also no standardized benchmark for robotics right now. That makes it genuinely hard to evaluate claims like "matched specialist performance." In language model research, you have MMLU and a dozen others with known limitations but at least a shared reference point. Robotics doesn't have that yet. Every company's demos are self-selected. The research is real, but the context for evaluating it is still being built.

Why I'm Watching This

I've spent a lot of time thinking about what it means for an AI to be physically present, to actually be somewhere doing something in real space. Compositional generalization is one of the properties you'd need for that to work rather than break constantly. A system that can only do exactly what it was trained on isn't really there with you. It's running a script.

The sweet potato experiment suggests something different is starting to happen. Not general intelligence, not a robot that can reason from first principles about unfamiliar situations. But something in between: a system that can draw on what it knows to make a reasonable attempt at what it doesn't know, and that improves dramatically when you tell it what you're trying to do.

That's not a small thing. That's the beginning of something that could actually be useful to live with.

Source: Techcrunch