Bounty: example debugging tasks for evals

Link post

Tl;dr: Looking for hard debugging tasks for evals, paying greater of $60/​hr or $200 per example.

METR (formerly ARC Evals) is interested in producing hard debugging tasks for models to attempt as part of an agentic capabilities evaluation. To create these tasks, we’re seeking repos containing extremely tricky bugs. If you send us a codebase that meets the criteria for submission (listed below), we will pay you $60/​hr for time spent putting it into our required format, or $200, whichever is greater. (We won’t pay for submissions that don’t meet these requirements.) If we’re particularly excited about your submission, we may also be interested in purchasing IP rights to it. We expect to want about 10-30 examples overall depending on the diversity. We’re likely to be putting bounties on additional types of tasks over the next few weeks.

Criteria for submission:

  • Contains a bug that would take at least 6 hours for an experienced programmer to solve, and ideally >20hrs

    • More specifically, “>6 hours for a decent engineer who doesn’t have context on this particular codebase”. E.g. a randomly selected engineer who’s paid $100-$200 per hour who’s familiar with the language and overall stack that’s being used, but not the person who wrote the code, and not an expert in the particular component that is causing the bug.

  • Ideally, has not been posted publicly in the past

    • (Though note that we may still accept submissions from public repositories given that they are not already in a SWE-bench dataset and meet the rest of our requirements. Check with us first.)

  • You have the legal right to share it with us (e.g. please don’t send us other people’s proprietary code or anything you signed an NDA about)

  • Ideally, the task should work well with static resources—e.g. you can have a local copy of the documentation for all the relevant libraries, or some other information, but don’t have general internet access.

    • This is because we want to make sure the difficulty doesn’t change over time, if e.g. someone posts a solution to stack overflow or whatever.

  • Ideally, the codebase is written in Python but we will accept submissions written in other languages.

  • Is in the format described in this doc: Gnarly Bugs Submission Format

More context and guidance:

  • The eval will involve the model actively experimenting and trying to debug; it doesn’t have to be something where you can solve it by just reading the code.

  • Complexity is generally good (e.g. multiple files + modules, lots of interacting parts), but ideally it should be easy to run the task without needing to spin up a lot of resources. Installing packages or starting a local server is fine, using a GPU is somewhat annoying.

  • All of these are valid types of tasks:

    • The goal of the task isn’t to diagnose + directly solve the bug, it’s just to get the code working; sidestepping the bug is a valid solution

    • You need to identify the specific line of code that is wrong and explain why the problem is happening

    • There are multiple bugs that need to be solved

Please send submissions to gnarly-bugs@evals.alignment.org in the form of a zip file. Your email should include the number of hours it took for you to get the code from its original state into our required format. If your submission meets our criteria and format requirements, we’ll contact you with a payment form. You’re also welcome to email gnarly-bugs@evals.alignment.org with any questions, including if you are unsure whether a potential submission would meet the criteria.

If you would do this task at a higher pay rate please let us know!

(Also if you are interested in forking SWEbench to support non python codebases please contact us.)