CS 194/294-196 (LLM Agents) - Lecture 6, Graham Neubig

Berkeley RDI Center on Decentralization & AI

Hosts: Graham Neubig

📅October 15, 2024

⏱️01:00:41

🌐English

🤍0 likes

Disclaimer: The transcript on this page is for the YouTube video titled "CS 194/294-196 (LLM Agents) - Lecture 6, Graham Neubig" from "Berkeley RDI Center on Decentralization & AI". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.

Watch the original video here: https://www.youtube.com/watch?v=f9L9Fkq-8K4

00:00:05Speaker

Hi everyone. I'm happy to be here. Thank you for the invitation, and I'd like to talk about agents for software development. To give my profile very quickly to kind of ground the discussion, I'm a professor at CMU, and I've been there for about eight years, working on various research topics, including the ones that Shenyun mentioned. I'm also a chief scientist at All Hands AI, which is a company that started very recently, building open-source coding agents, and it's a maintainer of a software library called Open Hands, which used to be called Open Dev, and we are, um, so a lot of my discussion and examples will be based on this. I'm also a software developer. I am, I, I like my GitHub to be green, basically, so, um, today I'm very happy to talk about my two passions, which are research for artificial intelligence and software development, and I get to talk about both of them in one talk, so I'm very happy to do that.

🤍0 likes💬 0 comments

Add to My Notes

00:01:13Speaker

So to give some grounding of this discussion, I'd first like to talk about the importance of agents for software development in particular. And this is from a very famous essay by Mark Andreessen called, "Why Software Is Eating the World." Um, and this happened in 2011, and it basically says, uh, "More and more major businesses and industries are being run on software and delivered as online services from movies to agriculture to national defense. Over the next ten years, I expect many more industries to be disrupted by software." So this was written in 2011. I, I think this mostly came true. I think it would be very weird for us to think about movies not being delivered through software.

🤍0 likes💬 0 comments

Add to My Notes

00:01:54Speaker

Like, how many people have gone and rented a physical movie in the past, like, five years? Nobody, right? So, uh, you can see like more and more of our life is, uh, you know, happening through software. Um, so one of my motivating things is if we gave everybody the ability to quickly write software to achieve their goals, uh, what would people be able to do? And I think the answer, especially for people here in, you know, computer science, the answer is a lot.

🤍0 likes💬 0 comments

Add to My Notes

00:02:31Speaker

This is also timely, uh, because if you look at the Nobel Prize for Physics and the Nobel Prize for Chemistry this year, these two of these Nobel prizes were awarded for people who wrote software, right? And so if we think about this, this has pretty big implications, you know, like it used to be that this was given to people who were using test tubes and running cyclotrons and other things like this, and those are still important, of course. But now, uh, a lot more of our science and, you know, human progress is going to happen in software.

🤍0 likes💬 0 comments

Add to My Notes

00:03:06Speaker

I'm not just going to be talking about models for coding today, although that's mostly what I'm going to be talking about. Um, but do people know how much time the average software engineer spends coding, uh, in terms of like producing new code every day? Okay, it's, uh, it's 15%, um, and this is from a study in, uh, from Microsoft in 2019. Maybe, maybe this is more or less nowadays, but it's not just coding that people are doing. Um, they also do bug fix, bug fixing, testing, documentation, and reviews. Um, 36% of their time is spent on communication of some variety, and 17% of the time is spent on other things like going to the bathroom.

🤍0 likes💬 0 comments

Add to My Notes

00:03:56Speaker

Um, so another thing I'd like to point out is how, uh, can we support software, uh, support developers? And I, I came up with a kind of imprecise but maybe helpful categorization of automation with respect to things like self-driving, uh, where self-driving, um, we have this well-known classification of kind of no automation to, uh, full automation. And self-driving, we have manual driving at no automation, and then we have things like adaptive cruise control, braking, um, Tesla's autopilot, which is at the time of writing, uh, was partial automation. Um, conditional automation, automation in certain things, and then high automation, which would be like Cruise or WeMo self-driving vehicles.

🤍0 likes💬 0 comments

Add to My Notes

00:04:39Speaker

Um, and in terms of software development, I think there's also different kind of levels of automation that we could think about. Um, where, uh, zero would be just writing all your code by hand. Um, one is smart autocomplete like things like Copilot or Cursor. Um, number two would be like Copilot or Cursor Chat, uh, where you kind of refactor larger amounts of data.

🤍0 likes💬 0 comments

Add to My Notes

00:05:13Speaker

Um, conditional automation would be automating particular tasks that people want to do, and then high automation would be, um, kind of more, uh, autonomous things that solve full software development tasks, and this time I want to be, uh, talking about the last one. But, um, I think if people are familiar with, uh, kind of GitHub Copilot or Cursor, I think this is something that a lot of people use. It's something I use every day, and basically what this is doing is it's completing your next thought. So like as a programmer, you write a few, um, you write a few lines or something like this, and it goes in and like fills in the next line for you.

🤍0 likes💬 0 comments

Add to My Notes

00:05:38Speaker

Very, very useful. Um, I tried coding on the plane on the way over here, and I had no internet, and it was miserable, so I stopped because I, you know, can't code without autocomplete anymore. But at the same time, it works synchronously with developers.

🤍0 likes💬 0 comments

Add to My Notes

00:05:56Speaker

The thing I'm going to talk about more, um, this time is something like a more autonomous agent. So this is an example, example of our software Open Hands, and the way it works is you fill in a large, uh, kind of, uh, command of what you want to do. This is, uh, resolving a GitHub issue, and so, uh, basically, I told it it needs to go into LM. and, uh, essentially fix, uh, fix a bug in LM. So it goes and downloads the GitHub repo, um, makes additional tests.

🤍0 likes💬 0 comments

Add to My Notes

00:06:31Speaker

Oh, sorry, this is not fixing a bug. This is adding tests to a file. So it makes a repo, it creates some tests, uh, it created them for the init function of the LLM class. Now it's waiting for a little bit. Um, so it created the tests.

🤍0 likes💬 0 comments

Add to My Notes

00:06:50Speaker

Now what it's doing is it's installing the dependencies for the project. It is running the tests according to the directions that I gave it. It ran the test. The test did not, oh, sorry, it got an error in Poetry. It couldn't find the P project.toml file. It ran the test. The test works.

🤍0 likes💬 0 comments

Add to My Notes

00:07:14Speaker

So now it's checking out a branch. Um, it's committing the branch, uh, add unit tests. It's pushing using the credentials that I provided it, and, um, now starting here, this is under human control. So I, I click on the link, uh, that it gave to me, and I open up the, the page. I can review the code and stuff like this and make sure it's okay and not, you know, incorporated if it makes sense. So I think you can see the difference between like completion, uh, just, you know, completing individual tasks and actually solving tasks end to end.

🤍0 likes💬 0 comments

Add to My Notes

00:07:59Speaker

Another thing that is really popular nowadays is, uh, autonomous issue resolution, and, um, this is another example of autonomous issue resolution within GitHub where you basically issue with the words "fix me." The agent goes in and does its work, and if it works, if the agent decided that it did a good job, it would, uh, send it in a poll request, so you could go in and, um, and accept it or not accept it. So these are some examples.

🤍0 likes💬 0 comments

Add to My Notes

00:08:35Speaker

So, um, there's a question of how promising this is, and this is actually old data, but it's still relatively compelling data, and there was a study where GitHub Copilot did a study with a sample size of 95 developers, where they had some developers use GitHub Copilot and some developers not use GitHub Copilot. And 78% of the developers, or 8% more of the developers using Copilot, finished the task, and they finished it more, uh, 56% faster. So basically, this is doubling the speed of coding, uh, for, uh, people, people using GitHub Copilot versus not using Copilot. And now we have, you know, even better code completion.

🤍0 likes💬 0 comments

Add to My Notes

00:09:08Speaker

Like in my opinion, Cursor code completion is better than than GitHub's, so it might be even faster. Um, and that means you cut into a very big chunk of the like 15% of time developers spend on coding, which is great. Like, let's say we could make developers 7% more productive. Um, make them enjoy coding more, you know, I think, uh, that's wonderful, but that doesn't, you know, all of software development, as I mentioned before.

🤍0 likes💬 0 comments

Add to My Notes

00:09:44Speaker

So the rest of the time, uh, here I'd like to spend talking about challenges in coding agents and, um, how we evaluate coding agents and or how we build coding agents and how we evaluate them. So some challenges that I see in this area are number one, **defining an environment** that the, the agents can work in and be tested in. **Designing a space of observations and actions**, so, you know, this is an agents class, so I think you're very familiar with the fact that one of the things that defines agents is how they can interact with the environment, so how do we define that for code, uh, coding and software development tasks? More specifically, how do we specifically **generate code**?

🤍0 likes💬 0 comments

Add to My Notes

00:10:17Speaker

Like what are the language models we use for code? How do we build them? Um, **file localization**, so how do we identify which parts of a code base that we would want to be editing? This has parallels to reinforcement learning because in reinforcement learning we have the idea of exploring our environment, and, uh, I think file localization is a lot like this.

🤍0 likes💬 0 comments

Add to My Notes

00:10:44Speaker

Another thing is **planning and error recovery**, so how do we make plans about what we want to do and how do we recover from errors? And finally, **safety**, so if we're actually interacting with environments through software, how do we ensure that this, uh, happens in a safe way? Okay, so first I'd like to talk about software development environments, and there's a number of types of environments. If we think about the actual environments that, uh, software developers act in, they can be things like manipulating source repositories like, uh, GitHub, GitLab, other things like this.

🤍0 likes💬 0 comments

Add to My Notes

00:11:22Speaker

They can also be interacting with task management software, so this could be things like Jira or Linear or GitHub issues or other things like this. So we would want agents to be able to do that as well. Software developers also use office software. They use like Google Docs or Microsoft Office to, you know, exchange information about requirements and other things like this.

🤍0 likes💬 0 comments

Add to My Notes

00:11:35Speaker

So how could they, uh, how could they interact with this software? Um, communication tools like Gmail and Slack, so I think if I enumerate these four tools that covers, you know, most of the environments that I interact with in some way when I'm developing software. There might be other ones like interacting with like a server, uh, logging into servers, manipulating cloud infrastructure, other stuff like that, but, um, these are kind of main ones. Um, in contrast, if we look at the environments that are available now for testing, a lot of these are focused mostly on code, on generating code.

🤍0 likes💬 0 comments

Add to My Notes

00:12:11Speaker

Um, so I'm going to mostly be talking about these. Um, I'd like to note that developers do more and they do things like browsing the web, but I'm not going to talk about that because I know next session I think you're having somebody, uh, talk about web browsing agents, so that's super important for development, but I'll not focus on that here. So the first thing, uh, is **simple coding**, and simple coding basically what this is is it's just testing the ability of language models to go from a specification to code. And the most common examples of this are things like HumanEval and MBPP, and these are, uh, things like here you can look at, um, the, the problem they have here.

🤍0 likes💬 0 comments

Add to My Notes

00:13:01Speaker

"Given a non-list of integers, return the sum of all of the odd elements that are in even positions." So I think these are kind of good, uh, in some way because, you know, you should be able to solve these if you're a good language model. Um, but this is kind of like asking LeetCode questions on a coding interview, right? You know, um, I'm sure everybody, if you've done a coding interview, you've done LeetCode questions, and you've been thinking, "Why am I doing this?"

🤍0 likes💬 0 comments

Add to My Notes

00:13:27Speaker

You know, like I never do this in my everyday, you know, coding experience. It's testing like algorithmic knowledge, but not necessarily knowledge of, you know, software development or software engineering. Um, but yeah, anyway, these include the examples of usage of the Python standard library doc string, other things like this. This is probably a necessary condition for being a good coding model, but not a sufficient one.

🤍0 likes💬 0 comments

Add to My Notes

00:13:54Speaker

There's also some work on building coding benchmarks for broader domains, and both of these are, are papers that I, uh, I did at some point. But the first one is, uh, **Konoha**, and this is broader data that was scraped from Stack Overflow, and so basically what we did is we scraped, uh, data from Stack Overflow by getting questions from the Stack Overflow titles, um, context in the code snippet that answered, uh, that basically solved this context. Um, and originally we just measured this, uh, using kind of like the overlap between the generated code, but we also, uh, recently added execution-based evaluation, and this covers a relatively wide variety of libraries that people use in like actual, uh, kind of data science, uh, or other things. So you can see it covers like Nump, uh, Pandas, NumPy, uh, Regx, OS, Collections, other stuff like this, and that's in contrast to things like, um, uh, Apps or HumanEval or something like that that basically use only very standard libraries.

🤍0 likes💬 0 comments

Add to My Notes

00:15:03Speaker

So this, this moves a step beyond the existing ones and says, "Okay, can you interact with the whole Python ecosystem?" Another thing is **data science notebooks**, um, and so this is another, uh, this is a great paper by, uh, people at Google, and basically they use, uh, data science notebooks to allow for, um, incremental implementation in evaluation of code in context. And so, uh, you have a notebook, and you have people going through the notebook, and then you need to generate the next cells based on the context you have already. So this moves, you know, kind of another step, adding context, adding, you know, practical, uh, long implementations.

🤍0 likes💬 0 comments

Add to My Notes

00:15:41Speaker

Then another data set that's, um, relatively recent compared to these other ones is called **SweBench**. It's a very popular data set, uh, nowadays, so I think a lot of people have heard of it, uh, but the way it works is they scraped issues, uh, from GitHub and the code bases, and the output of the model is a pull request. Uh, so it's very similar to the kind of situation that I talked about before, where we have agents actually go in and, and, you know, send a pull request to your, solve your GitHub issues. But basically, here's an example, uh, you have like data leak and GBD, uh, due to warm start, uh, this is about non-histogram-based version of D.

🤍0 likes💬 0 comments

Add to My Notes

00:16:36Speaker

You get a code base, and then you want the language model to go in and modify a few files, and then your evaluation is measured based on whether you can pass tests that were introduced together with the pull request. So this is a really, really nice benchmark in some ways. Um, it requires long context understanding. It requires you to understand a whole code base.

🤍0 likes💬 0 comments

Add to My Notes

00:16:51Speaker

It requires you to implement things in a very precise way so that they can pass tests. Um, there are some problems with it though. Uh, so one problem is that it's kind of limited only to high-quality repos where on PRs where they introduced new tests. And so these are very heavily biased towards, for example, bug fixing PRs.

🤍0 likes💬 0 comments

Add to My Notes

00:17:19Speaker

They're very heavily biased towards non-documentation or, uh, non-refactoring or other stuff like this. Um, so it's a great data set, um, and people should definitely use it, but it's, it does have its limitations in covering software development tasks. Um, another problem is a lot of the data has leaked into language models because the language models are like trained on data from GitHub, so actually, um, there, there's some concerns about leakage there too. Given these environments, if we look at, um, how we measure success, uh, some of the most common, uh, metrics are something called **Pass@K**.

🤍0 likes💬 0 comments

Add to My Notes

00:18:03Speaker

Uh, so if this is basically if we generate K examples, will at least one of them pass unit tests? Um, and, uh, this is a little bit of a small detail, but if you generate only K, it will result in high variance because you might get like, if you generate only one, you might get it right like 70% of the time and wrong 30% of the time or something like this. So what they actually do is they generate N, and then they use kind of statistics to figure out what it would be at K. So this is, I think, by far the most common.

🤍0 likes💬 0 comments

Add to My Notes

00:18:32Speaker

Um, the only disadvantage to this is it requires unit tests, so you're limited to, um, doing things where you have unit tests. So there's also other methods based on **lexical or semantic overlap**, and what this means is, um, you basically look at how well, um, the generated code overlaps with gold standard code created by programmers, and there's a bunch of different ways to do this. Um, there's a method called **BLEU**, which has been used for a long time in machine translation, which basically looks at the engram overlap between the generated code and the, uh, and the reference code. And then you also have embedding-based methods, uh, like CodeBERT score or something like this.

🤍0 likes💬 0 comments

Add to My Notes

00:19:10Speaker

But, um, these are, uh, these are details. As I mentioned in SweBench, uh, **leakage of data sets** is a big problem for things that were created on, um, like public code. This is, nobody has really convincingly, or no, actually, sorry, there's a paper very, very recent called SweBench+, which evaluated this on SweBench, but I only learned about it since I created my slides. So, um, it's not on my slides.

🤍0 likes💬 0 comments

Add to My Notes

00:20:00Speaker

But another compelling example of this is, um, in **Arcade**, and Arcade basically, uh, it's the data science notebook, uh, problem, uh, data set that I talked about before, and they demonstrated that when they evaluated on existing data sets, uh, from the internet, they got kind of like scores in the 30s or 40s. But then they created new, uh, new data sets, uh, kind of just by like getting people at Google to create, uh, data science notebooks, and the accuracy dropped by like 20% or 40% or something like this. So, uh, leakage is definitely a big issue there. Another, um, example is **LiveCodeBench**, uh, which shows that some, uh, code LLMs outperform on a particular data sets, uh, that are very widely benchmarked on.

🤍0 likes💬 0 comments

Add to My Notes

00:20:56Speaker

So HumanEval is a very, uh, very widely used data set, and so, um, if you look at the scores on HumanEval on the Y axis, um, and compare them to LiveCodeBench on the X axis, you can see a cluster of data sets where kind of the scores on HumanEval and the scores on LiveCodeBench correlate. And then you can see another cluster where the scores on HumanEval are very high, but the scores on LiveCodeBench are not very high. So, um, these ones are ones that maybe in some way kind of overfit to the data set and, um, managed to get a very high score on it. A final exciting direction that people are working on recently is **multimodal coding models**, and there's a data set by people at Stanford called **Design to Code**, and, uh, the way it works is, uh, they do code generation from websites, so they have a bunch of websites that have various, uh, designs, uh, like the ones, uh, down here, and based on that, you need to, um, they, you know, generate output code.

🤍0 likes💬 0 comments

Add to My Notes

00:21:50Speaker

And also very recently, again, so recently that I haven't been able to add it to my slides, there's a multimodal version of SweBench that is also similar in that way. Um, one interesting thing that they do in this data set for Design to Code is they measure this, the **visual similarity** of the websites. So they generate the websites, they take screenshots of the websites, and then they do a visual similarity both on the high level according to embeddings over the whole image, and then on the low level, um, recalling, uh, each element's visual similarity. So this is kind of a whirlwind tour of how people are evaluating, uh, models in what environments people are testing, uh, these sorts of coding agents in.

🤍0 likes💬 0 comments

Add to My Notes

00:22:34Speaker

So now we'll move on to the like actual modeling part, and, um, I'd like to start out by talking about **observation and action spaces**. And so if we think about what coding agents must do, they must do things like, uh, understand repository structure, read an existing code, modify or produce code, uh, run code and debug. And this is separate from all of the other stuff that software engineers do, which might require web browsing or other things like this. Um, and I'd like to give, there's a bunch of coding agents.

🤍0 likes💬 0 comments

Add to My Notes

00:23:12Speaker

I'd like to give just a few examples of the observations and action spaces that are used in some of the popular ones. So one of the popular, uh, ones that's used pretty widely is, uh, one that I mentioned here, it's called **CodeAct**. It's a method that essentially interacts with, uh, environments by writing code. This is also the one we use in, uh, in Open Hands, our coding model, uh, our coding, uh, platform, by the way.

🤍0 likes💬 0 comments

Add to My Notes

00:23:43Speaker

Um, but the way it works is traditional agents, which I, I think you have probably talked about multiple times in this class already, one of the things they do is **tool use**. And typically when you use a tool, the idea is that at every time step you call a tool and you get the result back, and then based on that, you make the next action. So this is kind of the traditional paradigm, and so if we look at an example, sorry, this is very small on the screen, so I'll read it aloud. It says, "Determine the most cost-effective country to purchase the smartphone model, uh, Coda 1. The countries to consider are the US, Japan, Germany, and India."

🤍0 likes💬 0 comments

Add to My Notes

00:24:23Speaker

And so if you have these available APIs, um, what a traditional model would do is it would do like lookup rates Germany. It would do look up phone price, uh, Germany, um, convert in tax, look up rates Japan, and then it would step like over and over and over again to try to, um, you know, generate, do all of these API calls, and you might need, you know, 15 API calls before you can get the final answer. Uh, the idea of what CodeAct, uh, proposed is basically moving beyond just individual step-by-step tool use and actually allowing the models to act by writing programs. And if you do this, um, you can essentially, uh, write a for loop over all of the countries that you're supposed to be examining and run the for loop and get the answer.

🤍1 like💬 0 comments

Add to My Notes

00:25:13Speaker

And this doesn't mean you can't get like intermediate feedback, of course. Like, for example, let's say this for loop didn't work because instead of USA, um, instead of USA, you would need to write U.S.A. You would get an error when you tried to run the code, and you could go back and modify the code and then run it again. So you can get multi-step feedback, but it's at a much less granular level, so you can do more things in fewer steps.

🤍0 likes💬 0 comments

Add to My Notes

00:25:55Speaker

Um, in the original formulation of this, it did, uh, it executed batch commands and Jupiter commands, um, and they found that this gave a faster resolution, higher success rate than direct tool use. So this kind of sets the stage for coding agents. It allows you to run Bash. It allows you to generate code, but one other important thing that we need to do is not just generate code, but actually go in and modify code because most coding tasks, or most like any serious code repository, you spend much more time modifying code than you do generating code de novo.

🤍0 likes💬 0 comments

Add to My Notes

00:26:33Speaker

And, uh, **SweAgent**, which was kind of, uh, the follow-up modeling work after SweBench, uh, basically did this by defining specialized tools that make it possible to efficiently explore repositories and edit code. So what it does is you have the LLM agent. It is given a set of LLM-friendly commands. The LLM-friendly commands interact with the computer and terminal and file system, and you get LLM-friendly environmental feedback.

🤍0 likes💬 0 comments

Add to My Notes

00:26:57Speaker

And if you want to look at all the details, you can go through the paper, but here's, uh, an example. So you could get a, um, observation, which is, uh, you would get this observation from one of the, uh, one of the commands like show file. And let's say you called show file, and then, uh, 405 for line 405. If you do this, uh, basically, you will have a, a program that will go in and parse the file.

🤍0 likes💬 0 comments

Add to My Notes

00:27:38Speaker

It will get lines from 401 to 409 because you get a window around the file, uh, a window around the line that you selected within the file, and this will be returned back to the, um, back to the LLM. And in reality, this isn't, um, this isn't like four lines above and four lines below. It's actually like 50 lines above and 50 lines below, or 100 lines above and 100 lines below. But what this allows you to do is like even if you have files that are like thousands of lines in your code repository, you can still parse them in a reasonable size in the LLM context window without, uh, you know, expending your entire context length, without spending too much money, uh, other things like this.

🤍0 likes💬 0 comments

Add to My Notes

00:28:14Speaker

Um, then the basic, uh, response after doing this is you get a thought, so it's like Chain of Thought reasoning. You get a thought about what you need to do next, and, um, then you need to go in and edit, um, uh, the files. So you now call an edit command where you edit lines 404 to 407, and then you have, um, an end of edit, uh, thing here. Uh, so then this will edit the lines and insert this, uh, this information here, and there's actually a bunch of different ways to do this.

🤍0 likes💬 0 comments

Add to My Notes

00:28:56Speaker

Like the SweAgent, uh, SweAgent way is just one way, um, and then there's, uh, there's other ways as well, uh, which I'll probably talk about a second. Um, so within Open Hands, uh, we do something very similar. It's kind of like a combination of CodeAct and the SweAgent, uh, kind of editing and obser, uh, edit and observation actions. So, um, we execute all of the actions as calling code.

🤍0 likes💬 0 comments

Add to My Notes

00:29:35Speaker

So you can do something like, let's say you run an observation action, or you run a find command, and you find that there is a, a, you find that there's like 10 examples of something that you want to, uh, replace in various places in your code base. It could then execute 10 of the site agent actions and go in and replace all of the lines in the files. So this essentially gives you a programmatic interface. You call it by calling Python programs, which all LLMs are very familiar with, and it allows you to do things like write a for loop over file editing commands and other stuff like this.

🤍0 likes💬 0 comments

Add to My Notes

00:30:16Speaker

Um, so, uh, that's the general way that we, uh, we do this, and then you run this, and then you get, um, an observation from this, uh, that is like the results of executing the commands and other things like this. Okay, so, um, I've talked a little bit about the observation and action space. One really, really important thing, um, in order to make these models work well is to have good code-based LLMs, um, because all of the success of these models depends on the ability to write code. Um, and so I'd like to talk a little bit about the methods that people use for, uh, generating, uh, or for creating code-based LLMs.

🤍0 likes💬 0 comments

Add to My Notes

00:30:46Speaker

So, um, yeah, the basic method is you feed in the instructions or input code into the LLM. Um, recently, virtually all serious LLMs are trained on code nowadays, so it doesn't matter if you're talking about like GPT or Claude or Gemini or LLaMA or, um, whatever else. Um, but some are more specialized, and unfortunately, for closed models and even for some open models, there's not a whole lot of information about, you know, what goes into training these models. Um, but, uh, I can tell you about some things that they definitely are doing.

🤍0 likes💬 0 comments

Add to My Notes

00:31:23Speaker

So the first thing that everybody is doing is **training on lots and lots of code data**. Um, so you, you know, lots of code is added to the data mixtures. This is both to improve coding ability and because some prior research has demonstrated that adding code improves the reasoning ability of models, uh, in general. But, um, just to give one example of a coding data set that we know that's open and has actually been used in training open models, uh, this is The Stack 2, and this is a code pre-training data set.

🤍0 likes💬 0 comments

Add to My Notes

00:32:03Speaker

Um, they also looked carefully into **licensing considerations**, um, and the reason why is because like a lot of code is licensed, and if people don't want you to be using the code, uh, in training language models, it could theoretically cause legal problems later. So, um, they basically went through and, and did the license cleansing and other stuff like this. Um, so they checked if the licenses are available and other things. Um, and again, this is really tiny over here.

🤍0 likes💬 0 comments

Add to My Notes

00:32:24Speaker

I'm sure you probably can't see this on the right side of the page well. Um, but one interesting thing to look at is the **distribution of the programming languages** that are included in data sets like this. And the first thing is, um, if you look at the largest, um, number of, uh, data that's included in here, the data sets, uh, that, or the languages that are most well represented are Python, uh, PHP, Markdown, um, JavaScript, Java, C++, C, and C, um, which is, uh, great. You know, these are popular programming languages that a lot of people want to work on.

🤍0 likes💬 0 comments

Add to My Notes

00:33:12Speaker

Um, there's a lot of less well-represented languages that are really important, like Dockerfiles, um, you know, a lot of deployment works on Dockerfiles. Um, I don't even see Terraform, uh, represented here, and Terraform is like a really important thing for deploying cloud infrastructure, um, and stuff like this. So, um, there's kind of a, a very large difference between the availability of data for different programming languages. Another one is COBOL.

🤍0 likes💬 0 comments

Add to My Notes

00:33:38Speaker

Like nobody programs in COBOL anymore, but a lot of banks have legacy systems written in COBOL that they would love to have, you know, translated to new programming languages and stuff. So, uh, that, that's a very low resource. So another method that's kind of specific to code language models, uh, that is really, really important for code, but maybe a little bit less important for other places is **infilling**. And, um, in code generation, because we're not generating stuff de novo, we're doing lots of editing, uh, we often want to fill in, uh, code.

🤍1 like💬 0 comments

Add to My Notes

00:34:22Speaker

And so the way that this is done is you, uh, can take original documents, uh, like the documents here, and you can take like a coherent section of this document and basically mask out that section, and then move that mask zero to the end of the document and generate the masked, uh, data there. So this is kind of like, like a pre-training method that people use to train models to be good at infilling data. And so this is super important for code. Every code model does this nowadays.

🤍1 like💬 0 comments

Add to My Notes

00:34:40Speaker

Um, so, or every model that seriously wants to handle code does this nowadays. So this is, uh, pretty important. Another thing that's done in every code model, um, and also more and more models, almost every model is doing this now, is some variety of **long context extension**. And so language models, um, they, it typically train on relatively short contexts for kind of, uh, efficiency reasons.

🤍0 likes💬 0 comments

Add to My Notes

00:35:05Speaker

So they'll train on like 4,000 token chunks when they actually do the pre-training of the model. Um, another, but when you try to take the model and generalize beyond the context it was trained on at training time, it does a really poor job at generalizing. So, um, part of the reason why is because, um, in Transformer models, we have **positional encodings**. I think many people have heard of these, um, but just in case you haven't, uh, there are things that you append to each token, um, based on the position of the token in the sequence.

🤍0 likes💬 0 comments

Add to My Notes

00:35:46Speaker

And the most common variety of this is something called **RoPE**, and the way RoPE works is essentially you have a theta, um, where the theta is depends on the length of the output. And so if you train here, you would have, um, 40 like, let's say you were training on 4096 context links, you would have like one half of that, so it'd be 2048 different, uh, kind of theta values. Um, but then if you expanded beyond that, you would get a whole bunch of theta values that you've never seen before. So there's methods that fix this, um, and one way they do this is by multiplying this by a constant scaling factor.

🤍1 like💬 0 comments

Add to My Notes

00:36:26Speaker

And what this does is this makes sure that all of this, the, um, theta values that you've seen before get scaled, um, in a range that it saw at pre-training time. And then another, um, method is something called the **neural tangent kernel based method**. And basically what this does is, um, these are like cosine and sine values, so you have some high frequency components and some low frequency components, and so they scale the low frequency components, but maintain the high frequency components, which makes kind of attention over long distances work, uh, you know, work well, but keeps the like local context. So there's a lot of, um, you know, a lot of examples like this, and this paper, uh, L at all, goes into much more detail.

🤍1 like💬 0 comments

Add to My Notes

00:37:06Speaker

Um, it's a paper on long context extension, and I, um, you can take a look. So another thing I'd like to point out is that there's lots of context that's available for coding, and, um, coding models specifically, like in the Copilot setting, where you're interacting with a model within your IDE, um, pull in lots of different information. Um, so, uh, some examples of information that you could be pulling in are things like **current coding context**, um, the **description of an issue to fix**, uh, **context about the repo** in some way, um, also **open tabs in an IDE** or something like this. Um, and so what information should you be using to generate the next code?

🤍0 likes💬 0 comments

Add to My Notes

00:37:51Speaker

It depends a lot if you're working on agents or working on code, um, uncode like Copilot style models. But agents at the moment largely use the things that are, uh, kind of like in the SweBench, uh, SweAgent, uh, style, uh, style tools. But separately from this, there's this really good blog post on, um, **deconstructing Copilot** and what information Copilot is pulling in, and basically the way it works is it extracts a prompt from a current dock and cursor position, um, identifies the relative path and language that the model, uh, is occurring in. It finds the most recently accessed 20 files of the same language.

🤍0 likes💬 0 comments

Add to My Notes

00:38:33Speaker

So if you've been programming in Python, it will look at the most recently accessed 20 files. Um, it includes, uh, text before and text after, uh, the cursor, um, similar files, imported files, metadata about the language and path. And basically, long story short, they did a bunch of prompt engineering to pull in all of the relevant context, um, into the prompt, and that seems to work well for, uh, you know, recommending code to human coders who are working on it. I think the agent space has not quite gone this far in prompt engineering based on like the current contexts and things like this, so I think that's an area for potential improvement in our models, uh, going forward.

🤍0 likes💬 0 comments

Add to My Notes

00:39:14Speaker

So another thing I'd like to talk about is, um, **file localization**. So file localization is a big problem. It might even be the biggest problem in coding agents nowadays, and what I mean by this is that we want to find the correct files given a user intent. So like, let's say we have an intent that looks like, um, uh, this, so this is an actual example of an issue from our repo, and I think it's a reasonably good example of kind of the level of detail that people actually write issues in.

🤍0 likes💬 0 comments

Add to My Notes

00:39:56Speaker

So it says, "Um, when in confirmation mode, it's not possible to give instructions in between steps. You have to reject an action, and it seems like it doesn't know that the action was rejected." Um, "and then describe the UX of the solution that you'd like. We'd like to have a third option, confirm action and wait. This action is confirmed, but before it tries to take the next step, you are able to give some feedback." So this is a feature request. Basically, it's a request to add like another option that doesn't exist already, but it doesn't tell you anything about where confirmation mode is implemented.

🤍0 likes💬 0 comments

Add to My Notes

00:40:26Speaker

It doesn't tell you whether this needs to be implemented in the frontend code, the backend code, or both, or other things like this. And models and agents like really struggle to do, uh, very well at this. Um, and I think like one interesting thing is that this is analogous to like understanding the environment in embodied agents. So like embodied agents, um, let's say you have a robot that needs to work in your kitchen or something like this.

🤍1 like💬 0 comments

Add to My Notes

00:40:54Speaker

The first thing that it should do before it starts solving any problems that you have in your kitchen is go and like open all the kitchen drawers, right? Or drawers or cabinets or whatever else. And I feel like, um, you know, this localization problem is kind of that it needs to go in and either like go in and open all the kitchen doors or all the files and understand what's going on in them, or it needs to have a really good system to retrieve, uh, at retrieval time. But anyway, um, so solution one, uh, this is **offload to the user**, and basically what this requires is this requires that you have a user that's very familiar with the capabilities of agents, um, and what they can do and how you can, uh, help them out in doing their job well, which I think is reasonable if it's somebody who interacts with agents.

🤍0 likes💬 0 comments

Add to My Notes

00:41:48Speaker

So this issue was written by, uh, me, um, and I'm very familiar with like how agents can work, so I told it specifically which files it should be, you know, manipulating. But this isn't, you know, obviously a final solution because not all users are like that. They might not know the code base. They might know that there's a bug, but not know where the bug, what is causing the bug and other stuff like this.

🤍0 likes💬 0 comments

Add to My Notes

00:42:10Speaker

So this is limited in its value, even though it is helpful in cases where it's applicable. So the next is, uh, to **prompt the agent with search tools**. And so, um, SweAgent, uh, is one example of something that does this. Um, it provides a tool for searching repositories, and specifically this tool is, um, is named search.

🤍0 likes💬 0 comments

Add to My Notes

00:42:36Speaker

Uh, so it will search for the thing that says PV system, um, and it, uh, as a result of that, it will list out all the files that had PV system in them. Um, then if it doesn't find the thing it wants, it will click like next, uh, and search for the next one. Um, another thing is you can search and then summarize the search results for the agent and provide the information as necessary. So I think this is good.

🤍0 likes💬 0 comments

Add to My Notes

00:43:09Speaker

I think it's actually very nice for kind of agent-based systems that have, uh, the ability to use tools well, um, but it's not like the equivalent to going in and opening all the cabinets, uh, before you start. And so there's other methods, um, that do something like this. So, uh, the first thing is **creating a map of the repository** that you're working in, of the code base that you're working in, and prompting the agent with it. So, um, there's a tool that a lot of people, um, it's kind of an open source tool that people use called Aider, and the way Aider does this is it creates a tree-structured map of the repository a priori.

🤍1 like💬 0 comments

Add to My Notes

00:43:51Speaker

The human user has the ability to like add files to the map to make them more prominent, and then it summarizes the map, so it's a decent, uh, a decent size. Um, and then taking this a step further, there's, uh, a method called **Agentless**, and this does a hierarchical search for every issue. So the way this looks is, um, you essentially start out with a project code base. You feed the agent a very, very simple tree of all of the code that's in the code base, and this is fed into the language model.

🤍0 likes💬 0 comments

Add to My Notes

00:44:26Speaker

Then based on this, the agent basically will, or the language model will respond, "I want to open this file," and it will open all of the files, and in each of the files, it will get a summary like the methods and classes that are implemented in that file. And then after it does that, the agent then needs to localize to specific functions. It will, um, say the functions it needs to edit or modify, and then those functions will be displayed in full to the agent, uh, in order for it to finally generate the code. So, um, this, this is a pretty effective approach.

🤍0 likes💬 0 comments

Add to My Notes

00:45:11Speaker

Um, the final thing is, uh, **retrieval augmented code generation**, and the way this works is, uh, you basically retrieve similar codes, so you have like an embedding-based model, and, um, you can fill, uh, feed in the retrieved code to the model and then just generate the outputs. Um, particularly in code, there's also documentation which can be retrieved. So one difficulty in, uh, retrieval-based code generation is that the code itself is in Python, where the natural language description is in English, and there can be a disconnect between the two of them. And so, uh, sometimes it's more effective to directly try to retrieve the documentation and, and generate the output based on that.

🤍1 like💬 0 comments

Add to My Notes

00:45:37Speaker

So there are some references for this. Um, my impression is that there aren't a lot of approaches that have really effectively used this in the agentic setting yet. Um, I think people will probably be moving in this direction, and it's possible that there's some new archive papers that I've missed, but I think leveraging this appropriately is kind of like an unsolved problem, and, uh, we'll want to work towards, uh, towards improving that. Okay, so the next thing I'd like to talk about is **planning and error recovery**.

🤍0 likes💬 0 comments

Add to My Notes

00:46:21Speaker

So, um, solving coding tasks can be difficult, and in order to solve the tasks, you want to have some concept of, uh, you know, you might want to have some concept of a plan and, uh, keeping track of, uh, when you're on track with your plan or other things, uh, like this. And a lot of the top coding models, or a lot of the top code agent models, uh, that have high scores on leaderboards, uh, like SweBench, have a **hardcoded, uh, task completion process**. And, um, for example, Agentless is an example of this. Uh, it's not actually, uh, so much of an agent because it mostly just goes through this process, um, but it just contains two, uh, three steps.

🤍1 like💬 0 comments

Add to My Notes

00:47:04Speaker

So **file localization**, um, like I talked about before, it has this hierarchical localization process. Um, and then **function localization**, it then generates patches, uh, and it generates a whole bunch of patches, um, runs them against, uh, a linter or, um, runs tests against them to make sure that they are appropriately formatted, don't have syntax errors, pass existing tests, and then, uh, they take the best one, the one that passes the tests, and they, um, uh, they apply it. And if there's multiple ones that pass the tests, they use a similarity metric to try to pick the one that, that looks the best compared to the others. So this is very effective.

🤍0 likes💬 0 comments

Add to My Notes

00:47:42Speaker

Actually, Agentless gets good results for how simple of a method it is. It also costs less than some agent style planning methods, but the problem is it's very inflexible. So like, let's say now you wanted to go online and read the documentation before you solve the task. In some cases, you wouldn't be able to do this because this is hardcoded.

🤍0 likes💬 0 comments

Add to My Notes

00:48:08Speaker

Another thing that's like a little bit less inflexible, um, are kind of **multi-multi-agent style methods** or, um, methods that use an LLM-generated plan beforehand. Actually, it doesn't need to be multi-agent. It could be just a single agent that generates a plan beforehand and then steps through the plan. Um, this is an example from a multi-agent system called Coder, and basically the way it works is it generates a plan.

🤍0 likes💬 0 comments

Add to My Notes

00:48:39Speaker

This plan is based on the fact that this manager has a whole bunch of, um, kind of sub-agents like a reproducer, a fault localizer, an editor, and a verifier. And it has a semi-hardcoded structure between all of these, uh, agents where you first generate a plan that includes the reproducer step, the fault localization step, editor step, verifier step, and then you have kind of a hardcoded control flow where the reproducer runs it. If they're able to reproduce it, then, um, uh, they send it to the fault localizer. If not, they send it directly to the editor, other things like this.

🤍0 likes💬 0 comments

Add to My Notes

00:49:14Speaker

So there's a lot of works that have this kind of thing that's highly tailored towards, uh, solving, uh, GitHub issues. One problem with, uh, one problem with this sort of thing, though, is I've noticed that language model agents very often, they'll generate a plan, and then after they've generated the plan, they realize that their plan was not the greatest plan in the first place. Like, um, their underlying assumptions were not correct, like the test directory that they were supposed to use doesn't exist in the first place, so they need to go back and modify it or other things like this. So, um, there's an example of a work.

🤍0 likes💬 0 comments

Add to My Notes

00:49:56Speaker

This was tested on web agents, not coding agents, but I think it's applicable to coding agents as well, and we've tried it out a little bit. But basically, it has the ability to **kick the plan back to the planner**, so if the, um, agent that's executing the plan fails at doing any of the steps in the plan, it can kick it back to the planner, and the planner can say, "Oh, I need to generate a new plan." And I think if you're going to have this sort of like, uh, kind of structured architecture, this is really, really important in doing a good job. Another really important thing is, um, having the ability to **fix outputs based on error messages**.

🤍0 likes💬 0 comments

Add to My Notes

00:50:42Speaker

And like one of the advantages of using the agents versus using kind of like an Agentless, uh, framework is that the agents can actually run the code, see the results of the code, and go back and fix, uh, the code, uh, appropriately. And there's not a whole lot of work on doing this, especially with the kind of like strong modern language models, as far as I can tell. Um, I kind of hope that there's more work in this general direction. This InterCode is one example of this, um, but even like our strongest models are not super great at this.

🤍1 like💬 0 comments

Add to My Notes

00:51:25Speaker

Um, like, uh, just to give an example, GPT-4o, uh, one of the issues we found is it will attempt to fix an issue, um, it will fail at fixing an issue, then it will attempt to fix the issue in the same way and get stuck in a loop and just do this over and over and over again until you run out of turns. Um, for some reason, Claude is better, uh, at not doing this and trying different approaches. Um, I don't know why, because the Claude people have not told me, uh, what they did, so, uh, if any of the Claude people are listening, please tell me. Um, but, uh, I, I think methods that train models to do really well at error recovery are going to be like a big research topic going, uh, going forward, and I think people are working on this now, this well.

🤍1 like💬 0 comments

Add to My Notes

00:51:56Speaker

A final thing I'd like to talk about is, uh, **safety**, and this is, uh, really important because coding models can cause harm. Um, right now, at the moment, I'm most worried about, because I use coding models in my everyday coding, uh, experience, I'm most worried about them causing harm by accident because I, uh, told it to do something, it either made a mistake or it misinterpreted my, um, my commands, and it fixed, uh, and, you know, it, it did some damage. Uh, so to give some examples that have, like, actually happened to me before, um, you have a coding model, and you ask it to commit it to your GitHub repository, and you never said you want it to push it onto GitHub, uh, but it just assumes that you want to push it onto GitHub, so it will, uh, you know, commit things and push them directly to your main branch on GitHub, and then you need to go clean up your, uh, your code repository. Or in the worst case, let's say there's bugs, it causes, you know, major issues for people who would be using the repository downstream.

🤍0 likes💬 0 comments

Add to My Notes

00:53:12Speaker

Um, this was a really, uh, crazy one that I was very surprised by, which, um, was I told the model, uh, that the tests needed to pass, and it was having trouble making the test pass, so it just deleted the tests. Um, and, uh, this, this is a little bit scary, right? You know, it's, uh, it's doing something that's pretty obviously harmful in order to obtain its objectives, so, you know, we don't, uh, we don't want to do that. Another thing is **intentional misuse**, and, um, there have been, uh, essentially, papers have demonstrated that coding agents can be used for hacking and other, uh, things like this.

🤍0 likes💬 0 comments

Add to My Notes

00:53:50Speaker

And I think this is, you know, we need to be careful, uh, in, uh, making sure that we have tools that are able to mitigate these risks. I don't think that means, you know, don't build tools because if we don't build tools, then only hackers will have these tools. But, um, we, we do need to be very conscientious of that. Um, so the first safety mitigation one that I think is absolutely essential if you're using coding agents in your everyday life is **sandboxing**.

🤍0 likes💬 0 comments

Add to My Notes

00:54:16Speaker

And we can improve safety by basically limiting the environment that the agent has access to. Um, just to give one example, um, Open Hands, uh, executes all actions within a Docker environment that's isolated from the main system that using right then. So it only has access to the, um, it only has access to the files that you put into that Docker environment and other things like this. Um, also with Docker, you can limit access to the internet, you can limit access to any other variety of things.

🤍0 likes💬 0 comments

Add to My Notes

00:55:16Speaker

So I like one general theme with respect to safety is that we shouldn't, uh, according to my philosophy anyway, is we've already thought a lot about safety of software in kind of like non-agent scenarios, and so we can reuse a lot of the tools that we already, you know, know are important and useful in those sorts of, uh, you know, code security, uh, in non-agent scenarios and kind of reuse them for safety, and Docker is one example of that. Um, the next thing is **credentialing**, and, um, there's a very important, you know, concept in security that I think many or most people have heard of, called the **principle of least privilege**, which is that people should only have privileges that are necessary to do their job. Services should only have privileges that are necessary to do their jobs. And just to give an example, um, there are GitHub access, uh, tokens, um, that now have very fine-grained privileges and allow you to only do certain things on GitHub repositories.

🤍0 likes💬 0 comments

Add to My Notes

00:56:02Speaker

Um, the example on the right is very small, so, uh, let me read it out loud, but, um, the first thing it, it says is, um, repository access. So you can either make it a public, uh, give it access to all public repositories with only read-only access. You can give it access to all repositories that you have access to or only select repositories, so you can decide which like GitHub repositories it's allowed to access. Um, you can also figure out whether it can run GitHub actions, uh, do administration, um, for example, modify content, open issues, and other things like this.

🤍0 likes💬 0 comments

Add to My Notes

00:56:36Speaker

So what you can do here is you can basically prevent agents from, you know, having access to more privileges than they need, but still give them the privileges that they need. And like, actually, one thing I forgot to say is it's very easy to create a completely useless agent. You just put it in a sandbox and don't give it access to anything, and you'll have an agent that, you know, can't really do anything other than work as a calculator, um, or something like this. Um, but at the same time, we want to build agents that are helpful, uh, while also not causing, you know, undue, uh, harm, so this is another thing that we need to think about.

🤍0 likes💬 0 comments

Add to My Notes

00:57:42Speaker

So another, um, thing that we have looked at, um, is, uh, creating **post-auditing** for models, and post-hoc, uh, auditing for models, uh, the way you can do this is you can generate action, and then based on, um, the actions, you can either decide to execute them if they're deemed to be harmless actions, and, uh, if they are deemed to be harmful actions, you don't execute them, and you return basically a like, "This is dangerous, you shouldn't be executing that action," message back to the agent. Um, and so this can use LLMs, can use static analysis tools, vulnerability detection tools, or other things. Um, and I, I should mention that this is actually joint work with, um, uh, people at Invariant Labs and ETH.

🤍1 like💬 0 comments

Add to My Notes

00:58:06Speaker

But this is a conclusion, uh, in summary, I think we've already demonstrated that Copilots are very useful. Um, code agents are getting there. I have repositories where I have done half of the commits with a coding agent, or at least it's drafted half of the commits, and then I've gone in and fixed them and other stuff like this. So I think, um, you know, in the next one or two years, uh, there's going to be the ability to use coding agents in many of our mundane tasks, ones that don't require very much thought, and they'll gradually get better at doing things that require more thought and more care.

🤍0 likes💬 0 comments

Add to My Notes

00:58:46Speaker

I think, um, current challenges include coming up with good code LLMs, uh, editing methods, localization, planning, and safety. Um, in terms of future directions, I think there's a lot of interesting directions in this area. Um, the first one, uh, in addition to all of the current challenges, uh, the first one is **agentic training methods**, so training method training on agent style data, uh, that will give us the ability to adhere to agent formats, do kind of better planning, um, recover from errors better. Another thing is **human-in-the-loop, uh, methods for, uh, evaluating agents**.

🤍1 like💬 0 comments

Add to My Notes

00:59:29Speaker

So like one of the nice things about agents is they can communicate with humans, and I'm much more effective at solving tasks if I'm willing to spend the time interacting with the agent than just asking it to do it, um, on the fly. I can do things like watch the agent do its work and pause it when it seems to be making a mistake, and just go in and push it in the right direction, and it will correct course. But we really don't know like the best modality for this, and we don't have any good evaluation benchmarks to evaluate, uh, how well models are able to do at this. So I think that's another big thing.

🤍0 likes💬 0 comments

Add to My Notes

00:59:53Speaker

Um, also **broader software tasks than coding**. I think there's basically no, or at least very few, uh, benchmarks that tell us how good we're doing at this in the first place. And like software toolkits like Devin and Open Hands can do this. They can do web browsing, they can do other stuff like this, um, but it's not clear how well they can do at any of these tasks, and how much, you know, um, how much work there is still to be done.

🤍0 likes💬 0 comments

Add to My Notes

01:00:25Speaker

So yeah, that's my, that's my summary. Sorry, this is a very broad. I know I went through a lot of stuff in, uh, in an hour, um, but if you want to try it out yourself, you can download, uh, our software toolkit, and it does all the things I talked about here, um, and we'd love to have feedback or collaborate with people and things like that as well. So, uh, yeah, thank you.

🤍0 likes💬 0 comments

Add to My Notes

Video Player

My Notes📝

Highlighted paragraphs will appear here