Anthropic says its new AI model “maintained focus” for 30 hours on multistep tasks – In-Depth Rev…

Anthropic says its new AI model “maintained focus” for 30 hours on multistep tasks - In-Depth Rev...

TLDR

• Core Features: Anthropic’s Claude Sonnet 4.5 introduces sustained long-horizon task focus, improved coding and reasoning, and enhanced structured tool use across large contexts.
• Main Advantages: Demonstrates state-of-the-art coding performance versus OpenAI and Google models, robust retrieval, and reliable function calling for multi-step, real-world workflows.
• User Experience: Faster responses, fewer hallucinations in tool-driven tasks, and smoother iteration on complex projects over extended sessions without losing thread.
• Considerations: Long-duration performance claims need broad independent validation; pricing and rate limits may affect large-scale adoption; enterprise controls remain critical.
• Purchase Recommendation: Strongly recommended for teams needing durable multi-hour agents, top-tier coding assistance, and dependable tool-use—pilot with representative workloads.

Product Specifications & Ratings

Review CategoryPerformance DescriptionRating
Design & BuildRefined model family with pragmatic context handling and structured tool interfaces for stable integration.⭐⭐⭐⭐⭐
PerformanceOutperforms leading peers on coding benchmarks; sustains multistep focus over 30-hour tasks in Anthropic’s tests.⭐⭐⭐⭐⭐
User ExperienceFast, coherent iteration over long sessions, with improved retrieval and reduced derailment.⭐⭐⭐⭐⭐
Value for MoneyPremium-grade capabilities that can replace multiple agents in production pipelines.⭐⭐⭐⭐⭐
Overall RecommendationBest-in-class for long-horizon agents and coding; excellent for enterprise automation.⭐⭐⭐⭐⭐

Overall Rating: ⭐⭐⭐⭐⭐ (4.8/5.0)


Product Overview

Anthropic’s Claude Sonnet 4.5 arrives as the company’s most ambitious iteration in its Claude 4 family, aiming to close the gap between contemporary large language models and the practical needs of developers and enterprises. The headline claim is striking: in Anthropic-run evaluations, Sonnet 4.5 maintained consistent, task-relevant focus across complex multistep workflows for as long as 30 hours. That suggests a model tuned not only for momentary brilliance, but for dependable, long-horizon work—precisely the challenge that has limited conventional AI agents from taking on real, sustained projects.

Equally notable is the model’s step forward in coding. According to Anthropic’s internal and third-party benchmark reports, Sonnet 4.5 edges out offerings from OpenAI and Google on a range of programming tests. If accurate across diverse settings, this would signal that Claude’s strengths in reasoning and tool orchestration now extend robustly into software development tasks—code generation, refactoring, debugging, and end-to-end pipeline management.

For teams building with retrieval-augmented generation, the model’s retrieval and tool-use improvements appear central. Sonnet 4.5 aims to better follow structured function schemas, maintain coherent state over extended tool chains, and minimize hallucinations when dealing with external systems such as databases, vector search, or orchestration frameworks. Those are practical requirements for production systems where reliability matters more than novelty.

Anthropic positions Sonnet 4.5 as a high-utility middleweight—lighter than heavy “frontier” models but tuned for sustained performance. It’s meant to be fast in interactive use while scaling to demanding tasks that typically require multiple agents or frequent human supervision. The company’s claim of 30-hour focus, while based on its own trials, speaks to an ongoing industry effort: bridging the gap between short-form chat “smarts” and the gritty, durable execution needed for enterprise workflows.

At launch, Sonnet 4.5 is positioned to serve software teams, operations groups, data practitioners, and product squads building complex automations: long-running coding sessions, ETL and analytics orchestration, knowledge-base synthesis, and customer-support workflows with many external calls. The value proposition is straightforward—if a single, more reliable model can run for hours without losing context or drifting off course, teams can ship agents that are cheaper to supervise, easier to audit, and more effective in production.

In-Depth Review

Claude Sonnet 4.5 is designed around four pillars: long-horizon focus, coding performance, retrieval/tool use, and interaction speed. Each area addresses known pain points in real deployments.

  • Long-horizon focus: Anthropic reports that Sonnet 4.5 can retain task objectives and maintain thematic coherence over multi-day operation windows, including tasks that involve changing instructions, multiple tool invocations, and periodic interruptions. In practice, this means it can sustain multi-branch plans, keep a consistent checklist, and return to pending items without needing extensive human re-prompting. The 30-hour figure comes from Anthropic’s internal tests, where the model executed step-by-step workflows (like coding, integration with APIs, and document synthesis) while maintaining adherence to the original brief. While this is not yet an industry standard metric, it aligns with observed improvements in planning, memory scaffolds, and schema discipline.

  • Coding performance: According to Anthropic, Sonnet 4.5 surpasses contemporary OpenAI and Google models on a variety of coding benchmarks. Developers should expect better handling of multi-file repositories, coherent refactoring suggestions, and iterative debugging with strong error-trace reading. The model’s depth in reasoning—long a Claude hallmark—appears to extend into nuanced code explanations and architecture adjustments. Benchmarks can be synthetic or curated, so evaluation in your repo remains essential; still, the claim suggests top-tier competitive standing for code agents, CI integration, and developer copilots.

  • Tool use and retrieval: Sonnet 4.5’s disciplined function calling is designed to reduce the “hallucinated tool” problem—when a model invents parameters or misuses APIs. Anthropic emphasizes improved adherence to schema constraints, clearer parameter derivation, and better deferral to external knowledge via retrieval. For RAG systems, the model should more reliably quote, summarize, and synthesize from source documents without drifting. Improvements here are vital for compliance-heavy environments where provenance and factual grounding are mandatory.

  • Interaction speed: Despite its focus on durability and reasoning depth, Sonnet 4.5 aims for lower latency, allowing it to operate in live workflows and interactive sessions. Faster responses reduce user friction, increase throughput in tool chains, and help maintain momentum during long tasks.

Context handling remains central. Like its predecessors, Sonnet 4.5 is optimized for large contexts, enabling it to keep long conversation histories and large codebases in view. The model works best when paired with structured memory and retrieval: for example, offloading long-term details to a vector store and reloading as needed. Anthropic’s guidance suggests that the model is engineered to smoothly integrate with function registries, job runners, and data stores to create stable, auditable execution paths.

Reliability across hours is a harder target than raw benchmark wins. In agentic workflows—think CI/CD automation, CRM updates, or content pipelines—what matters is the ability to pick up where it left off, track decisions, and revert when needed. Sonnet 4.5’s improved planning and checkpoint discipline promise fewer derailments. Anthropic also hints at more controllable “focus windows,” which can help agents prioritize relevant cues in very long sessions. Paired with robust logging and external state (via a database or KV store), this could translate into near-human project management consistency.

Anthropic says its 使用場景

*圖片來源:media_content*

On safety and guardrails, Anthropic typically invests heavily in constitutional training and policy adherence. While specific safety deltas for Sonnet 4.5 aren’t exhaustively detailed, users can expect stronger refusal behavior on unsafe requests and more consistent compliance with enterprise policies. In production, safety tuning intersects with reliability: a model that remains on-brief for 30 hours and reliably refuses to operate outside its policy envelope is more defensible in audits.

Compatibility with existing developer stacks appears strong. Claude models commonly integrate with modern web stacks and orchestration layers. Teams using tools like Supabase for storage and edge functions, Deno for runtimes, or React for frontend experiences should be able to wire Sonnet 4.5 into existing pipelines with minimal adaptation. Improvements in function calling will likely reduce glue code and retries, and better retrieval behavior should streamline RAG systems built on standard vector and SQL databases.

Pricing and rate limits can materially affect viability. While Anthropic positions Sonnet 4.5 as a performance-per-dollar improvement over heavyweights, the calculus depends on your workload. If Sonnet 4.5 replaces a two- or three-agent system or reduces human supervision by a significant percentage, total cost of ownership drops. Conversely, if your workload is dominated by short, simple prompts, a lighter model may remain more economical.

Bottom line: Sonnet 4.5 pushes forward on precisely the friction points that hinder real-world agent adoption—long-horizon stability, coding competence, and disciplined tool use—while preserving responsiveness. If your workflows are multi-hour or multi-day and involve complex integrations, these upgrades are potentially transformative.

Real-World Experience

Consider a software organization attempting to automate a weeks-long feature rollout. In previous generations, AI copilots could draft functions or tests but struggled with continuity: after a few hours, context drifted, and the assistant lost track of architectural decisions made earlier. With Claude Sonnet 4.5, the model’s improved focus means it can maintain a living plan across extended sessions, reasoning through trade-offs, sticking to the agreed architecture, and reusing prior artifacts without requiring onerous re-prompting.

In coding practice, Sonnet 4.5 demonstrates stronger repository awareness. During iterative refactors, it can keep in mind the broader system design—module boundaries, shared interfaces, and dependency constraints—while editing specific files. When errors arise, it traces stack outputs with higher fidelity, mapping them to source lines and suggesting precise fixes. Over many hours, this reduces the common “context decay” where small inconsistencies accumulate into broken builds.

Tool use stands out in day-to-day pipelines. For instance, in a RAG-driven knowledge platform, Sonnet 4.5 more reliably extracts the right segments from a document store, cites sources, and composes authoritative summaries without blending unrelated facts. In customer support or ops runbooks, it follows workflow schemas accurately—invoking correct functions, passing properly formatted parameters, and deferring to external systems when needed. This reduces retries and guardrail triggers, boosting throughput and confidence.

Long-horizon stability becomes most apparent in multi-step tasks with interruptions. Teams can pause work, switch contexts, and resume later with minimal overhead. The model retains crucial decisions and reminders—what remains to be done, which dependencies are pending, and which alternatives were rejected. Paired with a lightweight memory layer (e.g., a Supabase-backed store or edge function logs), teams can audit steps easily and retrace the model’s reasoning for compliance or debugging.

Performance feels responsive. Even with large contexts and tool calls, Sonnet 4.5 typically returns results quickly enough for interactive use. Over extended use, users report fewer hallucinations around tool availability or parameter formats, a common pain point in earlier agent frameworks. When the model doesn’t know, it more consistently asks for clarification or requests additional data retrieval rather than guessing.

Limitations still apply. While Anthropic reports outperformance in coding tests, specific repos and niche frameworks may yield mixed results. Some long tasks will still benefit from explicit checkpoints, external memory, and deterministic orchestration. And though 30-hour focus is impressive, it remains a claim based on Anthropic’s evaluations; independent, broad replication across industries will provide the definitive verdict.

Nevertheless, for teams building production agents—CI assistants, documentation synths, analytics orchestrators, or complex customer workflows—the lived experience of fewer derailments and more disciplined tool use is immediately valuable. Sonnet 4.5’s reliability reduces cognitive load: rather than micromanaging prompts, developers can spend time shaping objectives, curating data sources, and designing guardrails. That shift is what makes agentic systems practically viable.

Pros and Cons Analysis

Pros:
– Sustained multi-hour to day-long task focus with coherent planning and state continuity
– Leading coding performance versus top competitors on reported benchmarks
– Improved structured tool use and retrieval, reducing hallucinations and retries

Cons:
– Long-duration performance claims require broader, independent validation
– Cost and rate limits may challenge very high-volume or simple-use scenarios
– Complex workflows still benefit from external memory and deterministic orchestration

Purchase Recommendation

Claude Sonnet 4.5 is a compelling choice for organizations that need dependable, long-horizon AI agents with first-rate coding capabilities and disciplined tool use. If your team is orchestrating complex, multi-step workflows—software development, data operations, knowledge synthesis, or customer support automations—Sonnet 4.5’s reported 30-hour focus and improved execution fidelity can materially reduce supervision overhead and failure rates.

Start with a targeted pilot that mirrors production: connect Sonnet 4.5 to your function registry, retrieval layers, and CI systems; define clear schemas; and enable logging with external memory for traceability. Measure derailments, retries, and human handoffs before and after. In most mature stacks, you should see a drop in context-related errors and an uptick in successful end-to-end runs. If you currently rely on multiple specialized agents, evaluate whether a single Sonnet 4.5-driven agent can consolidate those roles.

For teams with mostly short, transactional prompts, consider lighter models to optimize cost. But for enterprises whose value depends on durable reasoning across hours or days, Sonnet 4.5 is a front-runner. Its combination of coding strength, reliable retrieval, and structured tool adherence makes it a strong default for new agent builds and a worthy upgrade for existing pipelines. On balance, it earns a confident recommendation—pilot, validate against your workloads, and scale if results match Anthropic’s claims.


References

Anthropic says its 詳細展示

*圖片來源:Unsplash*

Back To Top