Claude Opus 4.5 from Anthropic: The AI Model That Outperformed Engineers in Internal Tests

Hello HaWkers, Anthropic just launched Claude Opus 4.5, and the results are generating intense discussion in the developer community. According to the company, the model outperformed all human candidates in internal engineering tests.

This claim raises important questions about the future of software development and AI's role as a tool or coworker.

The Claude Opus 4.5 Announcement

Anthropic positioned Opus 4.5 as its most capable model to date.

What Internal Benchmarks Say

Result that caught attention:

"Claude Opus 4.5 outperformed all human candidates in Anthropic's internal engineering tests."

Important context:

Tests applied to Anthropic's own engineering candidates
Involve problem-solving, code, and reasoning
Anthropic candidates are highly qualified (ex-Google, ex-OpenAI, etc.)

Target Audience

The model was designed for:

Professional developers - Assistance with complex code
Knowledge workers - Analysts, consultants, researchers
Enterprises - Cognitive task automation

Claude Opus 4.5 Capabilities

Analysis of the main improvements over previous versions.

Code and Engineering

SWE-bench Benchmark:

Model	Score	Bug Types
GPT-4 Turbo	23.5%	Simple
Claude 3.5 Sonnet	38.2%	Intermediate
Claude 3.5 Opus	45.8%	Complex
Claude Opus 4.5	52.3%	Very complex

Demonstrated capabilities:

// Example of task that Opus 4.5 solves well

// Input: Vague bug report
const bugReport = `
  User reports that checkout fails "sometimes"
  when there are more than 5 items in cart.
  Logs show timeout in production.
`;

// Expected output from Claude Opus 4.5:
/*
Bug Analysis:

1. SYMPTOM: Timeout in checkout with 5+ items
2. HYPOTHESIS: Shipping calculation with multiple items
3. INVESTIGATION:
   - Check ShippingCalculator.calculateBatch()
   - N+1 query in getShippingRates()
   - Connection pool exhausted

4. PROBABLE ROOT CAUSE:
   File: src/services/shipping.ts:142
   The loop makes one query per item instead of batch

5. SOLUTION:
   Refactor to batch query:
   [code with fix]

6. PREVENTION:
   - Add load test
   - Implement timeout with retry
   - Query count monitoring
*/

Long Context

Context window:

Model	Context	Practical Use
GPT-4	128k	~96k useful tokens
Claude 3.5	200k	~150k useful tokens
Claude Opus 4.5	200k	~180k useful tokens (better recall)

The improvement is not in size, but in recall - the model can use information from distant parts of context more effectively.

Reasoning and Analysis

GPQA Diamond (PhD-level questions):

Performance on advanced scientific questions:

GPT-4 Turbo:        ████████████░░░░░░░░  60.2%
Claude 3.5 Opus:    ████████████████░░░░  78.4%
Gemini 3 Pro:       ████████████████░░░░  76.8%
Claude Opus 4.5:    ████████████████████  84.1%

Writing and Text Analysis

An area where Claude traditionally excels.

Capabilities:

Long document analysis with high precision
Summarization that maintains important nuances
Clear and well-structured technical writing
Code to documentation translation

Claude Opus 4.5 vs Competitors

Direct comparison with other leading models.

General Comparison Table

Aspect	GPT-4 Turbo	Gemini 3 Pro	Claude Opus 4.5
Code	Very good	Very good	Excellent
Reasoning	Good	Very good	Excellent
Writing	Very good	Good	Excellent
Context	128k	1M	200k
Speed	Fast	Fast	Medium
Cost	Medium	Medium	High
API	Mature	Mature	Mature

When to Use Each One

const modelSelection = {
  gpt4_turbo: {
    bestFor: [
      'Fast tasks',
      'High request volume',
      'OpenAI ecosystem integration',
      'Limited budget',
    ],
    avoidWhen: [
      'Very long context needed',
      'Maximum precision in complex code',
    ],
  },

  gemini3_pro: {
    bestFor: [
      'Extremely long context (1M tokens)',
      'Google Workspace integration',
      'Complex multimodal analysis',
      'Search and RAG',
    ],
    avoidWhen: [
      'Creative and refined writing',
      'Tasks requiring nuance',
    ],
  },

  claude_opus_4_5: {
    bestFor: [
      'Complex code and debugging',
      'Deep document analysis',
      'Tasks requiring careful reasoning',
      'High-quality technical writing',
    ],
    avoidWhen: [
      'Very limited budget',
      'Critical latency',
      'Context above 200k',
    ],
  },
};

API and Integration

Practical guide to using Claude Opus 4.5 in projects.

Basic Setup

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

// Basic call
async function askClaude(prompt) {
  const response = await client.messages.create({
    model: 'claude-opus-4-5-20251101',
    max_tokens: 4096,
    messages: [
      {
        role: 'user',
        content: prompt,
      },
    ],
  });

  return response.content[0].text;
}

// With system prompt
async function askClaudeWithContext(systemPrompt, userPrompt) {
  const response = await client.messages.create({
    model: 'claude-opus-4-5-20251101',
    max_tokens: 4096,
    system: systemPrompt,
    messages: [
      {
        role: 'user',
        content: userPrompt,
      },
    ],
  });

  return response.content[0].text;
}

Streaming

// Streaming for long responses
async function streamClaude(prompt) {
  const stream = await client.messages.create({
    model: 'claude-opus-4-5-20251101',
    max_tokens: 4096,
    messages: [{ role: 'user', content: prompt }],
    stream: true,
  });

  for await (const event of stream) {
    if (event.type === 'content_block_delta') {
      process.stdout.write(event.delta.text);
    }
  }
}

Automated Code Review

// Example: Automated code review with Claude
async function reviewCode(code, context = '') {
  const systemPrompt = `
    You are a senior engineer doing code review.
    Be direct, constructive and focused on:
    1. Bugs and edge cases
    2. Performance
    3. Security
    4. Readability
    5. Best practices

    Response format:
    ## Summary
    [Overall assessment in 1-2 sentences]

    ## Issues
    [Prioritized list of problems]

    ## Suggestions
    [Optional improvements]

    ## Corrected Code (if applicable)
    [Improved version]
  `;

  const userPrompt = `
    ${context ? `Context: ${context}\n\n` : ''}
    Review this code:

    \`\`\`
    ${code}
    \`\`\`
  `;

  return askClaudeWithContext(systemPrompt, userPrompt);
}

// Usage
const review = await reviewCode(`
  async function getUsers() {
    const users = await db.query("SELECT * FROM users WHERE name = '" + req.query.name + "'");
    return users;
  }
`);

// Claude will identify:
// - SQL Injection
// - Missing input validation
// - Unnecessary SELECT *
// - Missing error handling

Codebase Analysis

// Analyze project architecture
async function analyzeCodebase(files) {
  const systemPrompt = `
    You are a software architect analyzing a project.
    Analyze the structure, patterns used, and suggest improvements.
  `;

  const fileContents = files
    .map(f => `=== ${f.path} ===\n${f.content}`)
    .join('\n\n');

  const userPrompt = `
    Analyze this project:

    ${fileContents}

    Focus on:
    1. Overall architecture
    2. Design patterns
    3. Improvement points
    4. Technical risks
  `;

  return askClaudeWithContext(systemPrompt, userPrompt);
}

Pricing and Costs

Claude Opus 4.5 is a premium model.

Pricing Structure

Prices per 1 million tokens:

Model	Input	Output	Context Cache
Claude 3 Haiku	$0.25	$1.25	$0.03
Claude 3.5 Sonnet	$3	$15	$0.30
Claude 3.5 Opus	$15	$75	$1.50
Claude Opus 4.5	$15	$75	$1.50

Calculating Costs

function calculateCost(inputTokens, outputTokens) {
  const PRICES = {
    haiku: { input: 0.25, output: 1.25 },
    sonnet: { input: 3, output: 15 },
    opus: { input: 15, output: 75 },
  };

  return Object.entries(PRICES).reduce((acc, [model, prices]) => {
    const cost = (
      (inputTokens / 1_000_000) * prices.input +
      (outputTokens / 1_000_000) * prices.output
    );
    acc[model] = `$${cost.toFixed(4)}`;
    return acc;
  }, {});
}

// Example: Code review of medium file
// ~2000 tokens input, ~1500 tokens output
console.log(calculateCost(2000, 1500));
// {
//   haiku: '$0.0024',
//   sonnet: '$0.0285',
//   opus: '$0.1425'
// }

// Example: Large codebase analysis
// ~50000 tokens input, ~5000 tokens output
console.log(calculateCost(50000, 5000));
// {
//   haiku: '$0.0188',
//   sonnet: '$0.2250',
//   opus: '$1.1250'
// }

Optimizing Costs

Strategies:

const costOptimization = {
  caching: {
    description: 'Use prompt caching for repeated context',
    savings: 'Up to 90% on input tokens',
    when: 'Long system prompts, project context',
  },

  rightModel: {
    description: 'Use appropriate model for each task',
    strategy: {
      haiku: 'Classification, parsing, simple tasks',
      sonnet: 'Medium code, standard analysis',
      opus: 'Only for complex tasks requiring reasoning',
    },
  },

  batching: {
    description: 'Group related requests',
    savings: 'Reduces context overhead',
    example: 'Review 5 PRs in one call instead of 5',
  },
};

Practical Use Cases

Where Opus 4.5 shines in real scenarios.

1. Code Assistant in IDE

// VS Code integration via extension
class ClaudeCodeAssistant {
  async explainCode(selection) {
    return this.query(`Explain this code in detail:\n${selection}`);
  }

  async suggestRefactor(code, goal) {
    return this.query(`
      Refactor this code to ${goal}:
      ${code}
    `);
  }

  async generateTests(code) {
    return this.query(`
      Generate complete unit tests for:
      ${code}

      Include:
      - Happy path
      - Edge cases
      - Error cases
    `);
  }

  async debug(code, error) {
    return this.query(`
      This code is generating an error:

      Code:
      ${code}

      Error:
      ${error}

      Identify the cause and propose a fix.
    `);
  }
}

2. Automatic Documentation

async function generateDocs(codeFile) {
  const response = await askClaudeWithContext(
    `You are a technical writer generating API documentation.`,
    `
      Generate complete documentation for this module:

      ${codeFile}

      Include:
      - General description
      - Usage examples
      - Parameters and returns
      - Edge cases and errors
    `
  );

  return response;
}

3. Pull Request Analysis

async function analyzePR(diff, description) {
  return askClaudeWithContext(
    `You are a senior engineer reviewing PRs.
     Be constructive but rigorous.`,
    `
      PR: ${description}

      Diff:
      ${diff}

      Analyze:
      1. Does the change do what it promises?
      2. Are there bugs introduced?
      3. Are there performance issues?
      4. Are there security risks?
      5. Are tests sufficient?
      6. Improvement suggestions
    `
  );
}

Limitations and Considerations

What Claude Opus 4.5 doesn't do well.

Known Limitations

Limitation	Description	Workaround
Speed	Slower than Sonnet	Use Sonnet for simple tasks
Cost	5x more expensive than Sonnet	Caching and right model
Hallucinations	Still occur	Always verify output
Knowledge	Data cutoff	RAG for recent data
Execution	Doesn't execute code	Integrate with sandbox

When NOT to Use

Avoid Opus 4.5 for:

✗ Simple classification tasks
  └─ Use Haiku: 60x cheaper

✗ Very high request volume
  └─ Use Sonnet: 5x cheaper

✗ Critical latency (<1s)
  └─ Use Haiku or GPT-4 Turbo

✗ Pure creative tasks
  └─ Models are comparable, use cheaper

✗ Real-time data needed
  └─ Combine with RAG/search

Conclusion

Claude Opus 4.5 represents the state of the art in language models for software development. The claim of outperforming human engineers in tests should be contextualized - these are specific tests in a controlled environment - but public benchmarks confirm impressive capabilities.

Main takeaways:

Anthropic's best model for tasks requiring deep reasoning
Excellent for code - debugging, review, architecture
Effective long context - improved recall in 200k tokens
High cost - use selectively for tasks that justify it
Complementary, not replacement - always validate output

For developers, the recommendation is:

Use Haiku for simple tasks (classification, parsing)
Use Sonnet for day-to-day code
Reserve Opus 4.5 for really complex problems

If you want to understand more about how AI models are transforming development, check out our article about OpenAI o3 and Code Benchmarks.

Let's go! 🦅

📚 Want to Leverage AI to the Fullest in Development?

To use AI tools effectively, you need to understand the code they generate well.

Complete Study Material

If you want to strengthen your foundation to evaluate and improve AI-generated code:

Investment options:

1x of $4.90 on card
or $4.90 at sight

👉 Learn About JavaScript Guide

💡 Solid fundamentals = AI as a powerful tool