Thoughts on using Devin after a month of executing 20+ tasks with Devin

24.1K 00

In March 2024, a new AI company entered the spotlight with impressive backing: a $21 million Series A led by Founders Fund and backed by industry leaders including the Collison brothers, Elad Gil, and other tech luminaries. The team behind it? International Olympiad in Informatics gold medalists-they're the kind of people who can solve programming problems that most of us can't even fathom. Their product, Devin, promises to be a fully autonomous software engineer that can chat with you like a human colleague, capable of doing everything from learning new technologies and debugging mature code bases to deploying full applications and even training AI models.

The early demos were compelling. Shows how Devin independently completed a bounty task on Upwork, installing and running a PyTorch project without human intervention. The company claims that Devin was able to solve the 13.86% real-world GitHub problem end-to-end in the SWE-bench benchmark - about 3x better than the previous system. It was initially accessible to only a small group of users, which led to exciting tweets about how it would revolutionize software development.

act as Answer.AI Being part of a team that often tries out AI development tools, we felt there was something different about Devin. If it delivers half of what it promises, it could change the way we work. But for all the enthusiasm on Twitter, we couldn't find many detailed accounts of people actually using it. So we decided to test it thoroughly, against a variety of real-world tasks. Here's our story - a thorough, real-world attempt to work with one of the most hyped AI products of 2024.

What is Devin?

What makes Devin unique is its infrastructure. Unlike typical AI assistants, Devin runs and launches his own compute environment via Slack. When you talk to Devin, you're talking to an AI that has access to a full compute environment - complete with a web browser, code editor, and shell - that can install dependencies, read documentation, and even preview the web applications it creates. Here's a screenshot of one way to start Devin on a task:

One way to launch Devin's tasks via Slack

This experience is designed to make you feel like you're chatting with a coworker. You describe what you want, and then Devin gets to work. Through Slack, you can watch it think through problems, request credentials when needed, and share links to completed work. Behind the scenes, it runs in a Docker container, which gives it the isolation it needs to experiment securely while protecting your system.Devin also provides a web interface, which also allows you to access its environment and watch it work in real time using an IDE, web browser, and more. Below is a screenshot of the web interface:

Early success

Our first task was simple but true: to move the data from the Notion The database was extracted into Google Sheets.Devin solved this problem with amazing competence. It navigated to the Notion API documentation, understood what it needed, and guided me through the process of setting up the necessary credentials in the Google Cloud Console. Instead of just throwing out API commands, it guided me through each menu and button click - saving time that would normally require a tedious documentation search. The whole process took about an hour (but only a few minutes of manual interaction). At the end, Devin shared a link to a perfectly formatted Google Sheet containing our data.

The code it generates is a bit lengthy, but it works. It felt like a glimpse of the future - an AI that could handle the "glue code" tasks that take up so much of a developer's time. Johno has had similar success in using Devin to create planet trackers to debunk claims about the historical locations of Jupiter and Saturn. What's particularly impressive is that he did it entirely from his cell phone, with Devin handling all the heavy lifting of setting up the environment and writing the code.

Expanding Our Tests

Building on our early success, we leveraged Devin's asynchronous capabilities. We imagined having Devin write documentation during meetings, or debugging issues while we focused on design work. But as we scaled up our testing, cracks appeared. Seemingly simple tasks often took days instead of hours, and Devin would hit technical dead ends or produce solutions that were too complex to use.

Even more concerning is Devin's tendency to push for tasks that are practically impossible to accomplish. When asked to deploy multiple applications to a single Railway deployment (a feature not supported by Railway), instead of recognizing this limitation, Devin spent more than a day trying various approaches and fudging features that didn't exist.

What's most frustrating is not the failures themselves - all tools have limitations - but the amount of time we spend trying to salvage these attempts.

Gain insight into what went wrong

In our journey, we were confused. We saw that Devin was able to competently integrate APIs and build functional applications, but it struggled with seemingly simpler tasks. Was this just bad luck? Were we using it wrong?

Over the course of a month, we systematically documented our attempts in the following categories:

Creating a new project from scratch
Implementation of the research mandate
Analysis and modification of existing projects

The results were discouraging. Out of 20 tasks, we had 14 failures, 3 successes (including our initial 2), and 3 inconclusive results. More tellingly, we were unable to discern any pattern to predict which Tasks would succeed. Tasks similar to our early successes would fail in unexpected ways.We provide more detailed information about these tasks in the appendix below. Below is a summary of our experience in each category during the process:

1. Creating a new project from scratch

This category should be Devin's forte. After all, the company's demo video shows how it autonomously completes bounty tasks on Upwork, and our own early successes suggest it can handle brand new development. But the reality proved more complicated.

Take, for example, our attempt to integrate with a large language model (LLM) observability platform called Braintrust. The task was clear: generate synthetic data and upload it. instead of providing a clearly focused solution, Devin generated what can only be described as a soup of code - multiple layers of abstraction that unnecessarily complicated simple operations. We eventually abandoned Devin's attempt and used the Cursor Building the integration step-by-step proved to be more efficient. Similarly, when asked to create an integration between our AI note-taker and Spiral.computer, Devin produced what one team member described as "spaghetti code that was harder to read than code I'd written from scratch". Despite having access to documentation for both systems, Devin seemed to over-complicate every aspect of the integration.

Perhaps most telling is our attempt at web crawling. We asked Devin to track Google Scholar links and crawl the authors' 25 most recent papers - a simple task for a tool like Playwright. Given Devin's ability to browse the web and write code, this should have been particularly easy to accomplish. Instead, it got stuck in an endless loop of trying to parse HTML and couldn't get out of its own way.

2. Research mandate

If Devin struggles with specific coding tasks, is it likely to do better in a research-oriented endeavor? The results are mixed at best. While it can handle basic document lookups (as we saw in the early Notion/Google Sheets integration), more complex research tasks proved challenging.

When we asked Devin to look at transcription summaries with precise timestamps (a specific technical challenge we faced), it merely repeated superficially relevant technical information rather than addressing the core problem. Instead of exploring potential solutions or identifying key technical challenges, it provided generic code examples that did not address the underlying problem. Even when Devin seems to make progress, the results are usually not what they seem. For example, when asked to create a minimal DaisyUI theme as an example, it generated a solution that seemed to work. Upon closer inspection, however, we found that the theme didn't actually do anything - the colors we saw were from the default theme, not our customizations.

3. Analysis and modification of existing code

Perhaps Devin's most worrisome failures occur when working with existing code bases. These tasks require understanding context and maintaining consistency with established patterns - skills that should be at the core of an AI software engineer's capabilities.

Our attempts to get Devin to handle nbdev projects were particularly enlightening. When asked to migrate a Python project to nbdev, Devin was unable to grasp the basic nbdev setup even when we provided him with comprehensive documentation. Even more perplexing was the way it handled notebooks - instead of editing them directly, it created Python scripts to modify them, adding unnecessary complexity to a simple task. While it occasionally provides useful comments or ideas, the actual code it generates is consistently problematic.

The security review revealed similar problems. When we asked Devin to assess a GitHub repository (less than 700 lines of code) for security vulnerabilities, it went overboard, flagging many false positives and fictionalizing problems that didn't exist. This kind of analysis is probably best handled by a targeted Large Language Model (LLM) call rather than Devin's more sophisticated approach.

This pattern continued during debugging tasks. When investigating why SSH key forwarding didn't work in the setup script, Devin focused on the script itself and never considered that the problem might lie elsewhere. This peek-a-boo effect meant that it didn't help us discover the actual root cause. Similarly, when asked to add a conflict check between user input and database values, one team member spent hours studying Devin's attempts before giving up and writing the feature himself in about 90 minutes.

Reflections as a team

After a month of intensive testing, our team got together to make sense of our experience. These quotes best express our feelings:

The tasks it can accomplish are so small and well-defined that I'm better off doing them my way faster. I think it will likely fail at larger, time-saving tasks that I might see. As a result, there are no real use cases where I really want to use it.- Johno Whitaker

I initially felt very excited because it was so close and I felt like I could tweak things. Then I slowly got frustrated as I had to change more and more and eventually got to the point where I was better off doing it incrementally from scratch.- Isaac Flath

Devin had difficulty using AnswerAI's internal tooling, which among other issues made it difficult to use. Despite the extensive documentation and examples we provide to Devin, this is still a problem. I've found that this is not an issue with tools like Cursor, where there are more opportunities to steer things in the right direction in a more incremental way.- Hamel Husain

Compared to Devin, we found that developer-driven more processes (like Cursor) avoided most of the problems we encountered in Devin.

reach a verdict

The partnership with Devin demonstrates the vision of autonomous AI development. The user experience is outstanding - chatting through Slack, watching it work asynchronously, seeing it set up environments and handle dependencies. When it works, it's impressive.

But therein lies the problem - it rarely works. Of the 20 tasks we attempted, we saw 14 failures, 3 uncertain results, and just 3 successes. Even more worrisome was our inability to predict which Tasks would succeed. Even Tasks similar to our early successes failed in complex and time-consuming ways. What seemed like promising autonomy became a burden - Devin would spend days pursuing impossible solutions rather than identifying the underlying obstacles.

This reflects a pattern we've observed repeatedly in AI tools. Social media excitement and company valuations have little to do with real-world utility. The most reliable signals we've found come from detailed stories of users delivering products and services. For now, we're sticking with tools that let us drive the development process while providing AI assistance.

Appendix: Devin's attempted mission

The following table lists the projects we gave Devin, categorized by the topics of (1) creating new projects, (2) researching, (3) analyzing existing codebases, and (4) modifying codebases.

1. Creation of new projects

Project name	state of affairs	descriptive	reassessment

planetary tracker	successes	I'd like to debunk some of the claims about the historical positions of Jupiter and Saturn.	Devin did a great job. I actually talked to Devin through Slack on my phone and it was done.

Migrating data from Notion to Google Sheets	successes	I told Devin to programmatically extract the information from the Notion document into a Google Sheet. This was the first project I executed using Devin, and it was done well. devin read the Notion and Google API documentation himself. devin also guided me to the Google Cloud console and provided me with instructions on all the different menus to click on, which would have taken me a fair amount of time! In the end, I got a reasonable Python script that performed the task.	This was my first interaction with Devin and it performed exactly the way I wanted it to, which was a new experience for me. At this point, I am very excited about Devin.
Multi-application deployment on Railway	inconclusive	I asked Devin to deploy multiple applications into a single Railway deployment so that I could have different applications share the same local database for testing.	The task proved to be poorly defined, as it was practically impossible to do this if I understood it correctly. Nevertheless, Devin went ahead and tried to do it, and fictionalized something about how to interact with Railway.
Generate synthetic data and upload it to Braintrust	fail (e.g. experiments)	I asked Devin to create synthetic data for a large language model (LLM) observability platform called Braintrust that I wanted to test.	Devin created code that was overly complex, difficult to understand, and bogged down in trying to fix bugs. We ended up iterating through this step using Cursor.
Create integrations between two applications	fail (e.g. experiments)	I asked Devin to create an integration between my AI note-taker, Circleback, and Spiral.computer with pointers to each document.	I got really bad spaghetti style code that was harder to read than the code I was trying to write from scratch. So I decided not to spend any more time using Devin for that particular task.
Web crawling papers by tracking Google Scholar links	fail (e.g. experiments)	I asked Devin to use playwright to programmatically crawl the last 25 papers by authors on Google Scholar and skip that particular document if it hits a paywall.	Devin gets into a rabbit hole of trying to parse HTML that it can't seem to get out of. It gets stuck and goes dormant.
Creating a minimal HTMX batch upload sample application	fail (e.g. experiments)	I asked Devin to read the bulk editing example on the HTMX documentation page and use that and the pseudo-server code to create a minimal FastHTML version of the example for the FastHTML Gallery.	The example doesn't work and isn't minimal.Devin uses objects that don't exist in the request object and adds a lot of unnecessary stuff like toasts (which also doesn't work) and inline css styles.
Create a DaisyUI theme that matches the FrankenUI theme.	fail (e.g. experiments)	I asked Devin to create DaisyUI and highlight.js themes so that they match the frankenui theme and can be used seamlessly in the same application!	Devin mapped the daisyUI pre-existing themes to frankenui themes, but in many cases they didn't match well. It also made tons of code changes that I couldn't understand and I ended up not using any of the code because I was too confused to know what to do with it.

2. Implementation studies

Project name	state of affairs	descriptive	reassessment

Researching how to make a Discord robot	successes	I asked Devin to look into building a Discord bot using Python that summarizes the daily messages and sends out emails. I also told him to use Claudette to do this whenever possible. Finally, I told him to write down his findings in a notebook with small code snippets that he could use for testing.	Devin generates research notes in the form of markdown files as an intermediate step in creating a notebook, which I didn't ask it to do. However, it is useful to see a step-by-step plan on how to accomplish this. The code it provides in the notebook is not 100% correct, but as pseudo-code it helps me understand how to glue this together. Considering this is more of a research project and I just want to know the general idea, I consider it a success.

Study of transcript summaries with precise time stamps	fail (e.g. experiments)	One of the problems I face when summarizing transcripts is that I want to have the exact timestamp associated with the notes so that I can use it for YouTube chapter summaries or similar. Specifically, getting accurate timestamps from the transcripts wasn't a problem, but it was hard to correlate the timestamps with the summaries because the timestamps were often messed up. So it's kind of like an AI engineering research task.	Devin repeats what is relevant to my problem, but it doesn't do a good job of researching or trying to solve the problem I'm trying to solve and gives me useless code and examples.
Creating a minimal DaisyUI theme as an example	fail (e.g. experiments)	I asked Devin to create a minimal DaisyUI theme as an example. My goal was to start from a starting point, as asking it to be done in a more complete way was unsuccessful.	Devin ignored requests to make it a FastHTML application, and it took some back-and-forth communication to get it to go down that path. Ultimately, it created an application that seemed to work with different button types. While the link it gave looked good, once I tried to modify the theme, it was clear that the theme didn't do anything. The other colors in the app come from the default theme. This is not a useful starting point.

3. Analysis of existing code

Project name	state of affairs	descriptive	reassessment

Perform a security review of the code base	inconclusive	For this task, I pointed Devin to a GitHub repository and told it to evaluate for security vulnerabilities. The repository is less than 700 lines of code. I tell Devin to document his comments in a markdown file and provide sample code if necessary.	Devin did identify some security holes, but it was overzealous and fictionalized problems that didn't exist. Perhaps this is not the ideal task for Devin, as this works just as well with a single call to my large language model (LLM).

Pull requests for reviewing blog posts and suggesting improvements	fail (e.g. experiments)	I asked Devin to review a blog post with a pull request for suggested changes. Ultimately, Devin failed because it couldn't figure out how Quarto, the static site generator I was using, worked.	I think this task would be successful in a tool like Cursor. It looks like Devin didn't learn from the project structure and existing documentation very well, so it screwed up things like the preamble and other conventions needed to properly edit a blog post.
Review the application and identify potential areas for improvement	fail (e.g. experiments)	I asked Devin to take a look at the time management app I mentioned earlier and made an open-ended task to ask it to suggest any improvements.	The advice it offers doesn't make any sense.
Debugging why SSH key forwarding doesn't work in setup scripts	inconclusive	I asked Devin to find out why SSH key forwarding doesn't work when I set it up with a script on the server.	The problem ultimately had nothing to do with the script, which I originally thought was the problem, but Devin never hinted or indicated that the problem might be elsewhere. This didn't help because it didn't help me figure out the root cause.

4. Modification of existing projects

Project name	state of affairs	descriptive	reassessment

Making changes to the nbdev project	fail (e.g. experiments)	I have a simple application for time tracking built with FastHTML and nbdev that I would like to integrate with Apple Shortcuts via API routing.	Even with the impressive progress, Devin was unable to run successfully in this environment. One oddity I noticed was that Devin created Python scripts to edit the notebooks, rather than attempting to edit the notebooks themselves. However, Devin gave me some useful comments and ideas here that I hadn't considered. However, the code it attempted to write didn't make sense. In the end, I used someone else's template instead of any of Devin's suggestions.

Migrating Python projects to nbdev	fail (e.g. experiments)	I asked Devin to migrate a project to nbdev [prompt details omitted for brevity].	It got bogged down and couldn't figure out the basic nbdev setup. It looks like it didn't read the nbdev documentation very well.
Integrating Style Packs into FastHTML	fail (e.g. experiments)	I asked Devin to integrate MonsterUI into one of my applications.	Devin couldn't figure out how to handle the nbdev repository.
Add functionality to check for conflicts between user input and the database	fail (e.g. experiments)	I asked Devin to add a feature to the application to compare user input values with values from a previously run database and provide a user interface if they don't match.	I spent hours slowly working out how to get it to work properly before giving up. I wrote the feature myself in about 90 minutes.
Generate Large Language Models (LLMs) context files using the content of each fasthtml Gallery example	fail (e.g. experiments)	I asked Devin to create Large Language Models (LLMs) text files for the fasthtml gallery.	I was pleased to see that it creates a separate markdown file for each example, and then initially tries to summarize them in the LLMs context file. I hadn't thought of doing that, and everything seemed to be there at first. As I downloaded it and started digging, I started to realize something I didn't like: the LLMs were not formatted correctly. Even when I gave it the information about using XML tags to separate the examples, it didn't use them. It added and fixed a specific version of the markdown package as a dependency and used that version instead of using the markdown2 package that was already in use and already a dependency. It does a bunch of pytest stuff and adds a dependency, even though the project doesn't use pytest.