MindriftEngineering

Freelance Agent Evaluation Engineer

United Kingdomup to $50 per hourPosted 20 days ago

Mindrift is connecting specialists with project-based AI opportunities focused on testing, evaluating, and improving AI systems. The role involves creating and evaluating tasks for AI coding agents in simulated environments, working collaboratively with AI to develop challenging scenarios.

Location: United Kingdom

Salary: up to $50 per hour

Responsibilities

Build virtual companies following a high-level plan, including codebase, infrastructure, and context that form realistic development environments.
Assemble and calibrate tasks from intermediate states of virtual companies, craft prompts, define evaluation criteria, and ensure tasks are solvable and fair.
Design tasks in isolated environments emulating developer workstations with Linux, development tools, servers, and web application codebases.
Write tests that accept all correct solutions and reject incorrect ones, balancing strictness and leniency.
Iterate with AI agents on tests to verify their effectiveness and robustness.
Review code written by AI agents, analyze success and failure cases, and design edge cases and adversarial scenarios.
Iterate based on feedback from QA reviewers who score work on quality criteria.

Requirements

Degree in Computer Science, Software Engineering, or related fields.
5+ years in software development, primarily Python (FastAPI, pytest, async/await, subprocess, file operations).
Background in full-stack development, with experience building React-based interfaces (JavaScript/TypeScript) and back-end systems.
Experience writing functional and integration tests.
Experience with Docker containers and infrastructure tools (Postgres, Kafka, Redis).
Understanding of CI/CD processes, especially GitHub Actions.
English proficiency - B2.

Additional Information

This is a project-based, part-time opportunity, not permanent employment.
The work involves significant collaboration with AI systems, making it challenging to create tasks that truly test frontier models.
Estimated effort is around 20 hours per task, with flexible scheduling.

Apply Now

Location

United Kingdom

Salary

up to $50 per hour