AI Agents Battle: Hype or Foes?

April 7, 2025

The idea that AI will replace human jobs has been a topic of much debate, and the recent emergence of vibe coders has only added fuel to the fire. Whilst we're well aware that AI models are capable of crafting simple web apps and games like tic-tac-toe, the question remains: can they truly create something that transcends the realm of proof-of-concept?

The Setup

To answer this, I've devised an experiment consisting of three challenges with varying degrees of complexity, which will be presented to some of the most popular AI agents. For those unfamiliar with agentic AI, this refers to the process of giving AI models tasks that require multiple steps of cognitive reasoning.

*The Matrix™ (© Warner Bros. Entertainment) on* *GIPHY*

Given the time constraints of my Sunday afternoon, I'll be focusing on four prominent agents: Cursor (Pro with auto-select), Copilot Agent Mode, Roo Code (formerly Roo Cline), and Windsurf. Each agent will receive the exact same prompts, with no special setup or tweaks to influence the outcome.

I’m fully aware that there are numerous other tools, models, and settings that could potentially enhance the results, but the purpose of this article is not to extract the absolute best from AI agents and integrated development environments (IDEs). Rather, the aim is to provide a snapshot of the current state of AI development tools in their out-of-the-box configuration.

The agents will be working with Infinite OS, an open-source, metamorphic container image that allows users to deploy applications without the need for Dockerfiles, relying solely on user interfaces.

Infinite OS has been around for a while, it's likely that the agents have been trained on it, which may give them a slight advantage. However, the project's unique nature, combined with its use of Clean Architecture, Domain-Driven Design (DDD), and a niche front-end architecture, should provide a sufficiently challenging environment for the agents to operate in.

Task 1: Single Change

The first task is a relatively straightforward challenge, one that a junior coder might tackle manually within a few hours. Infinite OS currently employs BadgerDB, a key-value in-memory embedded database, to cache trivial information. Since we already use SQLite to persist data, we can simply add another in-memory SQLite implementation to replicate the functionality of BadgerDB without introducing new dependencies.

As SQLite is managed via GORM, an Object-Relational Mapping (ORM) tool, the agents will likely need to create a database model within the service file, although I won't provide explicit instructions to do so. Additionally, they will be asked to replace the Get method with Read, allowing us to assess their ability to identify the sole instance where this method is used. As a minor concession, I will provide the agents with the relevant files to modify, which will be selected on the context tab.

Initial Prompt

The project is using BadgerDB for the TransientDatabaseService. Your job is to refactor this file to use SQLite in in-memory mode with GORM similar to what was done in PersistentDatabaseService. The “Get” method must be renamed to “Read” and any occurrences of this method in the other files must be replaced with the new name. Once completed, BadgerDB may be removed from the project entirely. You should also consider creating a unit test for the TransientDatabaseService file as well as keeping the code style used in the project such as following Clean Architecture, DDD, Object Calisthenics etc keeping the code legible but not too verbose.

Results

Cursor

I was pleased with the Cursor UI, which seemed more user-friendly than VSCode alone. The initial setup was also easier than expected. The agent was able to complete the task relatively quickly, but it did make a few mistakes. For example, it added a new library to write unit tests, which wasn't necessary and actually went against the goal of reducing dependencies. However, I didn't provide enough context about using the native testing library, so I'll give Cursor the benefit of the doubt.

Aside from that, Cursor demonstrated a good understanding of the project's structure, even though I didn't explicitly select certain directories for it to work with. It created a database model and used the correct package to store it.

The main issue I had with Cursor's performance was that it didn't fully remove BadgerDB from the project. Even after I rejected the unnecessary library, Cursor tried to install it again, which showed that it didn't fully understand my intentions.

After I pointed out the mistake, Cursor tried to correct it, but the unit test it wrote was not in the same style as the rest of the project. This was a bit disappointing, as I had specified that I wanted the code to follow a certain methodology.

Despite these issues, I was impressed with Cursor's performance, especially considering its relatively low cost. I think it's already paid for itself in terms of the time it saved me. However, I'm curious to see how it will perform on more complex tasks, and whether it can truly replace human engineers. If you're interested in finding out, keep reading.

Copilot Agent Mode

Next, I tried Copilot Agent Mode, which is similar to Cursor, but the experience was distinct. It's possible that the AI model or interface is the reason for the difference. One of the first things I noticed was the layout of the main menu, which was located at the top instead of the sidebar. I also missed the "Review Next File" feature and a few other details that made the experience feel less polished.

When it came to refactoring the file, Copilot Agent Mode was successful, but it took a different approach than Cursor. Instead of using the dbModel, it opted for raw SQL, which wasn't what I was expecting. Additionally, it created a test file using the testify library, rather than the native testing method.

The experience took a strange turn when I tried to run the tests. Since the unit tests in this project rely on a container, they wouldn't have worked, and I had to cancel the attempt. After that, the chat cleared, making it seem like the task was complete, which was confusing.

Overall, my initial impression is that Cursor feels more advanced and refined compared to Copilot Agent Mode. However, Copilot was faster in its execution. One issue I encountered was that it didn't update the files that used the Get method to Read, possibly because the tests couldn't be run. Adding a "skip this step" button might have helped in this situation.

Roo Code

Next, I decided to try out Roo Code, an extension for VSCode that I had come across while researching comparisons between Cursor and Windsurf. I thought it was worth giving it a shot, especially since I could use it with DeepSeek and compare with the other two.

My experience with Roo Code was cut short. During the initial setup, I was positively advised to break down tasks into smaller pieces to get the most out of the agent. But when I sent the prompt, it seemed to get stuck and didn't work.

Windsurf

Moving on, I decided to try Windsurf with DeepSeek V3. If the output wasn't satisfactory, I could always try R1 or Claude, just like I did with Copilot. The setup for Windsurf was straightforward, and the UI was pleasant to use.

When I sent the prompt, it stopped without any error messages or alerts. I wondered if I had already run out of credits, but a quick check on the Windsurf website revealed that no credits had been used. It's possible that the DeepSeek servers were experiencing issues.

I decided to switch to Claude, using version 3.7. Fortunately, it worked, and Windsurf became the only agent that didn't add the testify library to write the tests. Although it didn't separate the tests into individual “t.Run()” blocks, it did provide a valid go test command that would only test the newly created unit test, without breaking anything. Windsurf took the lead in this round, and I was a bit impressed with the quality of the code it produced.

In fact, I liked the Windsurf code so much that I decided to fix the minor mistakes and commit it to the project. The autocomplete feature in Windsurf was also noteworthy.

I did notice that the usage costs were adding up quickly. According to the Windsurf website, I had been charged for only one prompt, but 14 out of 200 flow actions had been used. This made me a bit concerned about the limitations of the free plan and whether I would be able to complete the article without needing to upgrade.

Task 2: Multi-files Changes

Moving on to the next task, I've designed a challenge that I still consider to be at a rookie level, but it's a bit more tedious and time-consuming. A junior coder should be able to complete it in 6-8 hours, while a mid-level coder might finish it in 4-6 hours.

The task involves implementing pagination features for a few missing endpoints. The agents will be prompted to create the necessary request and response data transfer objects (DTOs) for pagination, similar to what's already been done for other endpoints. They'll also need to update the infrastructure implementation, API, CLI, and UI code to work with the new input and output.

To keep things fair, I've limited the task to just one endpoint, and I won't require any visual changes, such as adding pagination buttons. However, to make things a bit more interesting, the endpoint being modified is the Read Databases endpoint. This is a unique scenario because Infinite OS, being an infrastructure software, has both an internal database and manages user databases (external).

The distinction can be tricky, even for our own developers, as it requires understanding the nuances of how the project uses and manages databases. This challenge will put the AI models to the test, requiring them to demonstrate a deep understanding of the project's complexities.