Back to blog

Claude Opus 4.5 from Anthropic: The AI Model That Outperformed Engineers in Internal Tests

Hello HaWkers, Anthropic just launched Claude Opus 4.5, and the results are generating intense discussion in the developer community. According to the company, the model outperformed all human candidates in internal engineering tests.

This claim raises important questions about the future of software development and AI's role as a tool or coworker.

The Claude Opus 4.5 Announcement

Anthropic positioned Opus 4.5 as its most capable model to date.

What Internal Benchmarks Say

Result that caught attention:

"Claude Opus 4.5 outperformed all human candidates in Anthropic's internal engineering tests."

Important context:

  • Tests applied to Anthropic's own engineering candidates
  • Involve problem-solving, code, and reasoning
  • Anthropic candidates are highly qualified (ex-Google, ex-OpenAI, etc.)

Target Audience

The model was designed for:

  1. Professional developers - Assistance with complex code
  2. Knowledge workers - Analysts, consultants, researchers
  3. Enterprises - Cognitive task automation

Claude Opus 4.5 Capabilities

Analysis of the main improvements over previous versions.

Code and Engineering

SWE-bench Benchmark:

Model Score Bug Types
GPT-4 Turbo 23.5% Simple
Claude 3.5 Sonnet 38.2% Intermediate
Claude 3.5 Opus 45.8% Complex
Claude Opus 4.5 52.3% Very complex

Demonstrated capabilities:

// Example of task that Opus 4.5 solves well

// Input: Vague bug report
const bugReport = `
  User reports that checkout fails "sometimes"
  when there are more than 5 items in cart.
  Logs show timeout in production.
`;

// Expected output from Claude Opus 4.5:
/*
Bug Analysis:

1. SYMPTOM: Timeout in checkout with 5+ items
2. HYPOTHESIS: Shipping calculation with multiple items
3. INVESTIGATION:
   - Check ShippingCalculator.calculateBatch()
   - N+1 query in getShippingRates()
   - Connection pool exhausted

4. PROBABLE ROOT CAUSE:
   File: src/services/shipping.ts:142
   The loop makes one query per item instead of batch

5. SOLUTION:
   Refactor to batch query:
   [code with fix]

6. PREVENTION:
   - Add load test
   - Implement timeout with retry
   - Query count monitoring
*/

Long Context

Context window:

Model Context Practical Use
GPT-4 128k ~96k useful tokens
Claude 3.5 200k ~150k useful tokens
Claude Opus 4.5 200k ~180k useful tokens (better recall)

The improvement is not in size, but in recall - the model can use information from distant parts of context more effectively.

Reasoning and Analysis

GPQA Diamond (PhD-level questions):

Performance on advanced scientific questions:

GPT-4 Turbo:        ████████████░░░░░░░░  60.2%
Claude 3.5 Opus:    ████████████████░░░░  78.4%
Gemini 3 Pro:       ████████████████░░░░  76.8%
Claude Opus 4.5:    ████████████████████  84.1%

Writing and Text Analysis

An area where Claude traditionally excels.

Capabilities:

  • Long document analysis with high precision
  • Summarization that maintains important nuances
  • Clear and well-structured technical writing
  • Code to documentation translation

Claude Opus 4.5 vs Competitors

Direct comparison with other leading models.

General Comparison Table

Aspect GPT-4 Turbo Gemini 3 Pro Claude Opus 4.5
Code Very good Very good Excellent
Reasoning Good Very good Excellent
Writing Very good Good Excellent
Context 128k 1M 200k
Speed Fast Fast Medium
Cost Medium Medium High
API Mature Mature Mature

When to Use Each One

const modelSelection = {
  gpt4_turbo: {
    bestFor: [
      'Fast tasks',
      'High request volume',
      'OpenAI ecosystem integration',
      'Limited budget',
    ],
    avoidWhen: [
      'Very long context needed',
      'Maximum precision in complex code',
    ],
  },

  gemini3_pro: {
    bestFor: [
      'Extremely long context (1M tokens)',
      'Google Workspace integration',
      'Complex multimodal analysis',
      'Search and RAG',
    ],
    avoidWhen: [
      'Creative and refined writing',
      'Tasks requiring nuance',
    ],
  },

  claude_opus_4_5: {
    bestFor: [
      'Complex code and debugging',
      'Deep document analysis',
      'Tasks requiring careful reasoning',
      'High-quality technical writing',
    ],
    avoidWhen: [
      'Very limited budget',
      'Critical latency',
      'Context above 200k',
    ],
  },
};

API and Integration

Practical guide to using Claude Opus 4.5 in projects.

Basic Setup

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

// Basic call
async function askClaude(prompt) {
  const response = await client.messages.create({
    model: 'claude-opus-4-5-20251101',
    max_tokens: 4096,
    messages: [
      {
        role: 'user',
        content: prompt,
      },
    ],
  });

  return response.content[0].text;
}

// With system prompt
async function askClaudeWithContext(systemPrompt, userPrompt) {
  const response = await client.messages.create({
    model: 'claude-opus-4-5-20251101',
    max_tokens: 4096,
    system: systemPrompt,
    messages: [
      {
        role: 'user',
        content: userPrompt,
      },
    ],
  });

  return response.content[0].text;
}

Streaming

// Streaming for long responses
async function streamClaude(prompt) {
  const stream = await client.messages.create({
    model: 'claude-opus-4-5-20251101',
    max_tokens: 4096,
    messages: [{ role: 'user', content: prompt }],
    stream: true,
  });

  for await (const event of stream) {
    if (event.type === 'content_block_delta') {
      process.stdout.write(event.delta.text);
    }
  }
}

Automated Code Review

// Example: Automated code review with Claude
async function reviewCode(code, context = '') {
  const systemPrompt = `
    You are a senior engineer doing code review.
    Be direct, constructive and focused on:
    1. Bugs and edge cases
    2. Performance
    3. Security
    4. Readability
    5. Best practices

    Response format:
    ## Summary
    [Overall assessment in 1-2 sentences]

    ## Issues
    [Prioritized list of problems]

    ## Suggestions
    [Optional improvements]

    ## Corrected Code (if applicable)
    [Improved version]
  `;

  const userPrompt = `
    ${context ? `Context: ${context}\n\n` : ''}
    Review this code:

    \`\`\`
    ${code}
    \`\`\`
  `;

  return askClaudeWithContext(systemPrompt, userPrompt);
}

// Usage
const review = await reviewCode(`
  async function getUsers() {
    const users = await db.query("SELECT * FROM users WHERE name = '" + req.query.name + "'");
    return users;
  }
`);

// Claude will identify:
// - SQL Injection
// - Missing input validation
// - Unnecessary SELECT *
// - Missing error handling

Codebase Analysis

// Analyze project architecture
async function analyzeCodebase(files) {
  const systemPrompt = `
    You are a software architect analyzing a project.
    Analyze the structure, patterns used, and suggest improvements.
  `;

  const fileContents = files
    .map(f => `=== ${f.path} ===\n${f.content}`)
    .join('\n\n');

  const userPrompt = `
    Analyze this project:

    ${fileContents}

    Focus on:
    1. Overall architecture
    2. Design patterns
    3. Improvement points
    4. Technical risks
  `;

  return askClaudeWithContext(systemPrompt, userPrompt);
}

Pricing and Costs

Claude Opus 4.5 is a premium model.

Pricing Structure

Prices per 1 million tokens:

Model Input Output Context Cache
Claude 3 Haiku $0.25 $1.25 $0.03
Claude 3.5 Sonnet $3 $15 $0.30
Claude 3.5 Opus $15 $75 $1.50
Claude Opus 4.5 $15 $75 $1.50

Calculating Costs

function calculateCost(inputTokens, outputTokens) {
  const PRICES = {
    haiku: { input: 0.25, output: 1.25 },
    sonnet: { input: 3, output: 15 },
    opus: { input: 15, output: 75 },
  };

  return Object.entries(PRICES).reduce((acc, [model, prices]) => {
    const cost = (
      (inputTokens / 1_000_000) * prices.input +
      (outputTokens / 1_000_000) * prices.output
    );
    acc[model] = `$${cost.toFixed(4)}`;
    return acc;
  }, {});
}

// Example: Code review of medium file
// ~2000 tokens input, ~1500 tokens output
console.log(calculateCost(2000, 1500));
// {
//   haiku: '$0.0024',
//   sonnet: '$0.0285',
//   opus: '$0.1425'
// }

// Example: Large codebase analysis
// ~50000 tokens input, ~5000 tokens output
console.log(calculateCost(50000, 5000));
// {
//   haiku: '$0.0188',
//   sonnet: '$0.2250',
//   opus: '$1.1250'
// }

Optimizing Costs

Strategies:

const costOptimization = {
  caching: {
    description: 'Use prompt caching for repeated context',
    savings: 'Up to 90% on input tokens',
    when: 'Long system prompts, project context',
  },

  rightModel: {
    description: 'Use appropriate model for each task',
    strategy: {
      haiku: 'Classification, parsing, simple tasks',
      sonnet: 'Medium code, standard analysis',
      opus: 'Only for complex tasks requiring reasoning',
    },
  },

  batching: {
    description: 'Group related requests',
    savings: 'Reduces context overhead',
    example: 'Review 5 PRs in one call instead of 5',
  },
};

Practical Use Cases

Where Opus 4.5 shines in real scenarios.

1. Code Assistant in IDE

// VS Code integration via extension
class ClaudeCodeAssistant {
  async explainCode(selection) {
    return this.query(`Explain this code in detail:\n${selection}`);
  }

  async suggestRefactor(code, goal) {
    return this.query(`
      Refactor this code to ${goal}:
      ${code}
    `);
  }

  async generateTests(code) {
    return this.query(`
      Generate complete unit tests for:
      ${code}

      Include:
      - Happy path
      - Edge cases
      - Error cases
    `);
  }

  async debug(code, error) {
    return this.query(`
      This code is generating an error:

      Code:
      ${code}

      Error:
      ${error}

      Identify the cause and propose a fix.
    `);
  }
}

2. Automatic Documentation

async function generateDocs(codeFile) {
  const response = await askClaudeWithContext(
    `You are a technical writer generating API documentation.`,
    `
      Generate complete documentation for this module:

      ${codeFile}

      Include:
      - General description
      - Usage examples
      - Parameters and returns
      - Edge cases and errors
    `
  );

  return response;
}

3. Pull Request Analysis

async function analyzePR(diff, description) {
  return askClaudeWithContext(
    `You are a senior engineer reviewing PRs.
     Be constructive but rigorous.`,
    `
      PR: ${description}

      Diff:
      ${diff}

      Analyze:
      1. Does the change do what it promises?
      2. Are there bugs introduced?
      3. Are there performance issues?
      4. Are there security risks?
      5. Are tests sufficient?
      6. Improvement suggestions
    `
  );
}

Limitations and Considerations

What Claude Opus 4.5 doesn't do well.

Known Limitations

Limitation Description Workaround
Speed Slower than Sonnet Use Sonnet for simple tasks
Cost 5x more expensive than Sonnet Caching and right model
Hallucinations Still occur Always verify output
Knowledge Data cutoff RAG for recent data
Execution Doesn't execute code Integrate with sandbox

When NOT to Use

Avoid Opus 4.5 for:

✗ Simple classification tasks
  └─ Use Haiku: 60x cheaper

✗ Very high request volume
  └─ Use Sonnet: 5x cheaper

✗ Critical latency (<1s)
  └─ Use Haiku or GPT-4 Turbo

✗ Pure creative tasks
  └─ Models are comparable, use cheaper

✗ Real-time data needed
  └─ Combine with RAG/search

Conclusion

Claude Opus 4.5 represents the state of the art in language models for software development. The claim of outperforming human engineers in tests should be contextualized - these are specific tests in a controlled environment - but public benchmarks confirm impressive capabilities.

Main takeaways:

  1. Anthropic's best model for tasks requiring deep reasoning
  2. Excellent for code - debugging, review, architecture
  3. Effective long context - improved recall in 200k tokens
  4. High cost - use selectively for tasks that justify it
  5. Complementary, not replacement - always validate output

For developers, the recommendation is:

  • Use Haiku for simple tasks (classification, parsing)
  • Use Sonnet for day-to-day code
  • Reserve Opus 4.5 for really complex problems

If you want to understand more about how AI models are transforming development, check out our article about OpenAI o3 and Code Benchmarks.

Let's go! 🦅

📚 Want to Leverage AI to the Fullest in Development?

To use AI tools effectively, you need to understand the code they generate well.

Complete Study Material

If you want to strengthen your foundation to evaluate and improve AI-generated code:

Investment options:

  • 1x of $4.90 on card
  • or $4.90 at sight

👉 Learn About JavaScript Guide

💡 Solid fundamentals = AI as a powerful tool

Comments (0)

This article has no comments yet 😢. Be the first! 🚀🦅

Add comments