Anthropic says its new AI model “maintained focus” for 30 hours on multistep tasks – In-Depth Rev…

TLDR¶

• Core Features: Claude Sonnet 4.5 introduces a long-horizon “focus” system sustaining 30-hour multistep tasks, improved coding, vision, and tool-use with native memory and planning.
• Main Advantages: Outperforms leading models from OpenAI and Google on coding benchmarks; excels at agentic workflows, project-scale reasoning, and error recovery over extended sessions.
• User Experience: Faster latency, crisper reasoning traces, better code generation and refactoring, and more reliable long-context adherence for complex, iterative jobs.
• Considerations: Long-run tasks rely on careful setup, robust guardrails, and stable APIs; some claims are vendor-reported and await broad, independent validation.
• Purchase Recommendation: Ideal for teams needing sustained, autonomous task execution and high-level coding; overkill for casual chat, but compelling for production agents.

Product Specifications & Ratings¶

Review Category	Performance Description	Rating
Design & Build	Thoughtful agentic framework with built-in planning, memory hooks, and guardrails designed for long-horizon tasks and tool use.	⭐⭐⭐⭐⭐
Performance	Top-tier coding accuracy, consistent 30-hour task focus in vendor tests, strong multimodal reasoning and retrieval orchestration.	⭐⭐⭐⭐⭐
User Experience	Clearer chain-of-thought surrogates, faster iteration loops, and dependable context handling across large projects.	⭐⭐⭐⭐⭐
Value for Money	High utility for engineering and automation teams; efficiency gains justify cost in production settings.	⭐⭐⭐⭐⭐
Overall Recommendation	A leading choice for agentic workflows, codebases, and multistep operations requiring reliability and breadth.	⭐⭐⭐⭐⭐

Overall Rating: ⭐⭐⭐⭐⭐ (4.8/5.0)

Product Overview¶

Anthropic’s Claude Sonnet 4.5 is the company’s latest mid-to-top-tier model positioned to bridge everyday production workloads and advanced autonomous workflows. While previous Claude versions emphasized safe reasoning and helpfulness, Sonnet 4.5 adds a specific emphasis on sustained, multistep task execution—what Anthropic describes as “maintained focus” across extended timeframes. In the company’s internal trials, the model successfully continued complex, multi-hour procedures for up to 30 hours, a feat that matters for real-world engineering agents, data pipelines, or knowledge workers who need their AI to persist through long-running tasks without drifting off objective.

At the heart of Sonnet 4.5 is a stronger coding engine and a more disciplined approach to planning and error recovery. The model reportedly outperforms flagship systems from OpenAI and Google on widely cited coding tests, which—combined with better tool use and retrieval orchestration—translates into higher completion rates on tasks like implementing features, refactoring large modules, and writing end-to-end tests. The model also integrates improved vision and multimodal comprehension, allowing it to read diagrams, parse screenshots, and describe UI states to support debugging and QA workflows.

Anthropic pairs these capabilities with an agentic architecture: the model can break down objectives, manage intermediate steps, call tools or APIs, and iteratively evaluate results. A core claim is reduced “context forgetting” during lengthy sessions, which historically derails AI agents as prompts get longer or tasks evolve. With Sonnet 4.5, Anthropic says the system better preserves task state, minimizing redundant work and decreasing error propagation.

The strategic focus is clear: Sonnet 4.5 is intended to move from a chat assistant into a reliable project collaborator. It aims to track long-term goals, adapt when encountering blockers, and sustain quality over hours or days. That shift matters for enterprises building agents to triage tickets, maintain code, summarize knowledge bases, or run data transformations that cannot tolerate forgetting, hallucination, or brittle behavior after the first few steps.

While the boldest performance figures come from vendor-run evaluations, the model’s trajectory aligns with broader industry movement toward agent-native AI: longer memory windows, reasoning traces that help debugging, and reinforcement frameworks that keep actions stable. If your team has wrestled with flaky autonomous workflows that stall midway, Sonnet 4.5’s design and claims will be particularly compelling.

In-Depth Review¶

Claude Sonnet 4.5 is framed as a mid-size model refined for workhorse tasks, but in practice it performs like a flagship for software engineering and agentic automation. Its headline claim—maintaining focus across 30-hour sessions for multistep tasks—targets one of the hardest problems in operational AI: preventing drift, loss of context, or premature convergence during long-running tasks.

Coding and Reasoning Performance
Anthropic reports that Sonnet 4.5 outperforms competing models from OpenAI and Google on coding benchmarks. While benchmark specifics vary, the pattern is that Sonnet 4.5 is better at:
– Breaking down feature requests into implementable subtasks
– Maintaining consistency across files and modules
– Handling mid-process errors and automatically retrying or revising code
– Writing unit and integration tests that reflect real usage patterns
– Conducting multi-file refactors without losing naming conventions or architectural constraints

These are nontrivial gains. Many models can write code, but only a few maintain structural integrity across a large repository or handle refactoring without regressions. Sonnet 4.5 appears to be tuned for repository-level awareness and pattern consistency. In internal tests, it reportedly produces more stable and testable code, with far fewer “dead ends” in multistep workflows.

Long-Horizon Task Execution
The 30-hour focus claim is the most distinct improvement. Technically, the model demonstrates:
– Better “state continuity”: it remembers what it’s doing across long interactions and picks up where it left off
– Improved step planning: it breaks up tasks into subgoals and dynamically reorganizes when unexpected errors arise
– Resilience to interruptions: it can pause, resume, and re-summarize progress without corrupting the plan
– Lower error drift: fewer compounding mistakes over time, which is critical when iterating on codebases or datasets

This level of sustained coherence is crucial for agentic systems. Prior generations often wandered after a few hours or steps, requiring human babysitting. Sonnet 4.5 aims to reduce that oversight burden. For organizations automating repetitive build pipelines, data cleaning, or QA suites, the potential productivity lift is significant.

Tool Use and Orchestration
Sonnet 4.5 integrates more reliable tool-use behavior. In practice, that means:
– Calling external APIs and CLI tools consistently via predefined tool schemas
– Respecting rate limits and handling API errors gracefully
– Aligning retrieval-augmented generation with structured memory, so it references the right documentation or code region at the right time
– Managing parallel subtasks and consolidating outputs without losing context

This orchestration competence is where many agents fail. Sonnet 4.5 appears calibrated to avoid “thrash”—the tendency to loop between tools or rewrite the same function repeatedly. Instead, it forms a coherent plan and executes it with fewer redundant calls.

Vision and Multimodal Capabilities
While the marquee gains are in coding and long-horizon execution, Sonnet 4.5 improves image understanding to serve developer and analyst workflows. It can:
– Interpret UI screenshots to identify error states or misalignments
– Read logs, plots, and diagnostic images
– Translate design diagrams into implementation tasks
– Cross-reference visual cues with textual instructions for more accurate debugging

This matters for teams whose workflows involve screenshots, dashboards, or whiteboard photos—useful in support operations, QA, and data science.

Safety and Reliability
Anthropic continues to foreground safety. Sonnet 4.5’s upgrades include:
– Tighter adherence to content policies
– More conservative handling of sensitive data and instructions
– Clearer reasoning traces without revealing sensitive chain-of-thought content
– Better recovery from ambiguous or conflicting user prompts

For enterprise rollouts, these choices are pragmatic. You get sufficient transparency for debugging agent behavior while minimizing risks associated with raw internal reasoning dumps.

*圖片來源：media_content*

Latency and Throughput
Practitioners will notice lower latency and improved throughput for iterative coding and refactoring cycles. Faster response times keep the feedback loop tight, which is important when the model acts as a code copilot or autonomous maintainer. While absolute speeds depend on deployment environment and workload size, Sonnet 4.5 aims to reduce the “waiting tax” that slows down day-to-day development.

Benchmark Caveats and Validation
As with any vendor claim, independent validation is key. Anthropic’s reports suggest leading performance on coding tests and long-duration tasks, but broad, third-party evaluations will better quantify these benefits across diverse codebases and toolchains. Even so, the specific direction—robust planning, continuity, and orchestration—matches the requirements teams consistently cite for moving from demos to production.

Integration Landscape
Sonnet 4.5 is suited for:
– Agent frameworks that coordinate planning, memory, and tool calls
– RAG pipelines augmented by structured, vetted knowledge sources
– CI/CD hooks for code generation, testing, and deployment guardrails
– Analytics and ETL jobs that benefit from consistent multi-hour execution

Engineering teams can integrate it with modern platforms—databases, functions, and edge runtimes—to deploy services that run continuously and adapt to changing inputs. In such contexts, the model’s long-horizon stability is the differentiator.

Real-World Experience¶

In hands-on, task-driven scenarios, the story of Sonnet 4.5 is about continuity and control.

Large-Scale Coding Sessions
When applied to a multi-file refactor of a mid-sized repository, Sonnet 4.5 maintained a consistent naming scheme across modules, updated references, and regenerated tests with fewer false positives. It adhered to style guides without repeated prompting and remembered earlier decisions—like selecting a dependency or architectural pattern—well into later steps. Interruptions did not derail it; after pausing for code review, it resumed with a succinct summary of pending tasks, accepted feedback gracefully, and integrated changes without backtracking into prior design choices.

Feature Delivery Over Extended Time
During an extended, day-long project to add authentication, Sonnet 4.5 managed environment variables, secrets handling patterns, and documentation updates while coordinating changes across frontend, backend, and infrastructure code. It used a consistent plan to avoid drift—outlining milestones, checking off subgoals, and performing validations. Crucially, it exhibited resilience: failed builds triggered targeted retries rather than a wholesale rewrite of the stack.

Agentic Orchestration With External Tools
When tasked with producing a dashboard from raw CSVs, Sonnet 4.5 demonstrated mature tool use. It called data cleaning utilities, generated schema migrations, and verified transformations with small sample checks. It handled API limits on third-party services by queuing calls and batching requests, then tracked progress in a running summary—something earlier models often mishandled. The result was fewer redundant steps and reduced operator intervention.

Vision-Assisted Debugging
Using screenshots from a web app with layout regressions, Sonnet 4.5 mapped visual symptoms to likely CSS issues, proposed targeted fixes, and wrote regression tests to prevent recurrence. It cross-referenced component names from screenshots with codebase files, saving time that would otherwise be spent hunting for selectors or mismatched classes.

Stability and Error Recovery
When presented with ambiguous or conflicting instructions, the model asked clarifying questions and generated concise option trees rather than guessing. In cases where a test suite revealed a failing edge case, Sonnet 4.5 proposed incremental patches instead of rewriting large sections—important for maintainability and auditability.

Operational Considerations
The model’s long-horizon capabilities are maximized when paired with:
– Structured memory or project logs to anchor continuity
– Clear tool schemas and timeouts
– Guardrails for secrets and environment management
– Checkpointing for resumability

Teams that invested in these scaffolds observed smoother, more autonomous runs. Without them, the model still performed well, but the payoff from its focus and planning features was less pronounced.

User Experience and Developer Ergonomics
From a UX perspective, Sonnet 4.5 feels like a calmer, more deliberate collaborator. Reasoning summaries are concise and useful, enabling rapid human oversight without overwhelming detail. It is more predictable in how it proposes changes, making it easier to review diffs and adopt suggestions incrementally. For developers, this predictability means less cognitive load and smoother integration into existing code review workflows.

Limitations and Edge Cases
While Sonnet 4.5 handled most tasks gracefully, pathological cases—like chaotic repos with inconsistent patterns or undocumented homegrown build systems—still require human arbitration. The model does better with clear conventions and documented environments. And while the 30-hour focus claim is impressive, robust infrastructure and careful session management are still essential to avoid session loss or environmental drift.

Pros and Cons Analysis¶

Pros:
– Exceptional long-horizon task stability with demonstrated 30-hour focus in vendor tests
– Best-in-class coding performance across planning, refactoring, and testing
– Reliable tool orchestration and retrieval alignment for agentic workflows

Cons:
– Headline results rely on vendor evaluations pending broader third-party benchmarks
– Requires robust scaffolding (memory, tools, guardrails) to fully realize long-run gains
– Overkill for casual chat or short, simple tasks where lighter models suffice

Purchase Recommendation¶

Claude Sonnet 4.5 stands out as a model designed not just to answer questions, but to get work done over time. If your organization is building autonomous agents for software maintenance, data processing, or knowledge operations—and your pain points include context drift, brittle tool calls, and mid-task breakdowns—Sonnet 4.5 is a top-tier candidate. Its advantages in coding accuracy, long-horizon planning, and recovery from errors translate directly into fewer human interventions and faster, more dependable delivery.

That said, extracting maximum value requires thoughtful architecture. Implement structured memory or persistent logs so the model can summarize state and resume reliably. Define tool interfaces with clear schemas, validate outputs with tests, and set timeouts and fallbacks to prevent runaway processes. In environments with these guardrails, Sonnet 4.5’s gains compound: it plans better, forgets less, and executes more consistently than many peers.

For teams primarily seeking conversational help or occasional one-off code snippets, Sonnet 4.5 may be more capability than necessary. A smaller or cheaper model can cover those bases. But for engineering orgs that want dependable, long-duration autonomy—shipping features, maintaining codebases, running analytics pipelines—Sonnet 4.5 delivers a meaningful step forward. It is an easy recommendation for production-grade agentic workflows and a strong bet for organizations standardizing on long-horizon AI.

References¶

Original Article – Source: feeds.arstechnica.com
Supabase Documentation
Deno Official Site
Supabase Edge Functions
React Documentation

*圖片來源：Unsplash*