Self-Evolving Agent Skills: SkillOpt

May 28, 2026

∙ Paid

The primary frustration of the current “Agentic Era” is the brittle nature of hand-crafted or one-shot prompts that crumble when faced with real-world complexity. We are witnessing a critical transition from stochastic prompting toward deterministic skill engineering. SkillOpt solves this by treating agent skills as an external state that can be trained with the same discipline used for deep learning weights. SkillOpt research is conducted by Microsoft and a few top universities in China.

Rather than relying on loosely controlled self-revision, SkillOpt introduces a systematic, controllable text-space optimizer. It transforms the way we adapt frozen agents by treating procedural instructions as “procedural weights” that evolve through feedback. This methodology allows agents to refine their own logic without a single weight update to the underlying model.

Words as Weights—The Power of the “Textual Learning Rate”

SkillOpt applies deep learning optimization principles directly to natural language by introducing an edit budget (L_t) that serves as a textual learning rate. This budget restricts the number of add, delete, or replace operations at each step, performing a form of “textual gradient descent.” By capping these changes, the system prevents the agent from making unstable updates that could erase valuable existing knowledge.

These bounded updates are essential for maintaining the agent’s optimization history and preventing “semantic jumps.” Without a constrained budget, self-editing becomes erratic, causing the agent to lose its place in the procedural landscape and forget stable behaviors. This disciplined approach ensures continuity in the agent’s evolution, allowing for gradual, reproducible refinement of its core instructions.

The deep-learning analogy is operational rather than decorative.

The “Selection Gate”—Why Saying “No” is the Secret to Performance

Performance gains in SkillOpt are driven by a held-out validation gate that strictly filters all proposed improvements. Before an edit is even considered by the gate, SkillOpt performs hierarchical merging, consolidating local proposals from failure and success minibatches into a single, high-priority update. This process ensures the system addresses systemic procedural errors rather than reacting to anecdotal edge cases.

Once merged, an edit is only accepted if it strictly improves the agent’s performance on unseen data. This rigor prevents “unconditional self-editing,” a pitfall where agents rewrite instructions in ways that cause performance to regress. Rejected edits are not discarded but are instead stored in a “rejected-edit buffer,” serving as negative feedback to ensure the optimizer avoids repeating past mistakes.

The Efficiency Paradox—Massive Gains from Tiny Edits

The data reveals an “Edit Economy” where substantial performance jumps result from a remarkably small number of accepted changes. For instance, the system achieved a massive +39.0 point improvement on OfficeQA through as few as one to four specific edits. This proves that high-impact procedural shifts are more effective than long, rambling system prompts.

SkillOpt also demonstrates a compelling Return on Investment (ROI) in terms of computational cost. On SpreadsheetBench, the cost to achieve these gains was a mere 0.6M tokens per point of improvement, while OfficeQA required only 1.1M tokens per point. The final optimized artifacts remain compact—typically 300 to 2,000 tokens—making them both human-auditable and lightweight for deployment.

“Write Once, Deploy Anywhere”—Cross-Model and Cross-Harness Transfer

SkillOpt’s most strategic value lies in its ability to create portable “procedural memory” that functions across different model scales. A skill optimized on GPT-5.4 transfers successfully to smaller models like GPT-5.4-mini or nano, maintaining a significant portion of its gains. Furthermore, SkillOpt democratizes high-tier performance for smaller open models, evidenced by a +50.7 point gain on ALFWorld for Qwen3.5-4B.

The breakthrough extends to “Cross-Harness” portability, proving that these skills are logic modules rather than environment-specific recipes. A skill trained within a Codex execution environment provides a +59.7 point gain when moved to the Claude Code harness. This ability to transfer knowledge between different API surfaces and command structures makes these skill documents highly valuable enterprise assets.

Learning the “Discipline” Frontier Models Lack

SkillOpt consistently discovers the type of procedural knowledge and “discipline” that even frontier models lack out of the box. Instead of memorizing task instances, the system identifies systemic failure patterns and codifies rules to prevent them. These rules represent the discovery of robust logic that keeps the agent on track during multi-step reasoning.

The system frequently discovers and implements specific procedural disciplines, such as:

Workbook-forensics: Mandating a structural and formula inspection of spreadsheets before any edits are attempted.
Evidence binding: Forcing the agent to link questions to exact visual headers or rows in documents before extracting answers.
Search-frontier discipline: Maintaining a horizon-aware ledger of visited locations to avoid repetitive, failed search patterns.

These disciplines address the root causes of agent failure, transforming a reactive model into a systematic problem-solver.

How to Train Skill.md for your Application Development Process - a Tentative Approach

To train a skill.md file capable of guiding a coding agent through a strict plan-dev-test-deploy lifecycle for any given specification, you can apply systematic text-space optimization to the document. Rather than manually engineering the perfect instructions, you treat the skill document as a trainable external state that learns from the agent’s actual execution experience.

Here is how you can optimize the skill document to reliably produce 100% spec-compliant, secure production code:

1. Execute the Pipeline on Training Specs Start by running your frozen coding agent on a diverse set of application specifications using a baseline skill.md. As the agent attempts to plan, write, test, and deploy the code, record its complete execution trajectories. This must include all tool calls, code generations, compiler outputs, and verifier feedback regarding security and functional requirements.

Continue reading this post for free, courtesy of Ken Huang.

Or purchase a paid subscription.