Context Engineering & Coding Agents with Cursor
Disclaimer: The transcript on this page is for the YouTube video titled "Context Engineering & Coding Agents with Cursor" from "OpenAI". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.
Watch the original video here: https://www.youtube.com/watch?v=3KAI__5dUn0
[Applause] I'm Lee, and I'm on the Cursor team, and I'm going to talk about how building software has evolved. So, thanks for being here.
We started with punch cards and terminals back in the 60s where programming was this new superpower, but it was inaccessible to most people. And then in the 70s, programmers grew up writing BASIC on their Apple IIs and their Commodore 64s. Then in the 80s, GUIs started to get mainstream, but still most programming was done on text-based terminals. It wasn't until the 90s and the 2000s that we started to see programming shift to graphical interfaces. So FrontPage and Dreamweaver, which you might remember, allowed beginners to drag and drop and build websites. And new editors and IDEs like Visual Studio made it easier for professionals to work in very large codebases. And I of course had to add my favorite text editor, Sublime Text here. I'm sure some of you have used it before. It's a good one.
Now with AI, building software is becoming more accessible and powerful than ever. Unlike this slower shift from terminals to GUIs, the shift to write code with AI is really being speedrun. The progress of decades is happening in just a few years. And with each iteration, the interface and the UX is changing to allow the models to achieve more ambitious tasks. So I'd like to talk about context engineering and how coding agents have evolved over the past few years from the perspective of Cursor. I'll show how we've went from autocompleting your next action to fully autonomous coding agents. And finally, we'll have Michael, Cursor's CEO, talk about the future of where we believe software engineering is headed.
So, let's start with Tab. One of the products that inspired Cursor was GitHub Copilot. It showed that with improvements to the UX of autocomplete and with better models, we can make writing code much easier. We released the first version of Tab back in 2023, and the experience has evolved from predicting the next word to the next line and then ultimately to where your cursor is going to go next. Tab now handles over 400 million requests per day. And this means we have a lot of data about which suggestions users accept and reject. This led to us moving from an off-the-shelf model to training a model specialized for next action prediction. So to improve this model, we use data to positively reinforce behaviors that lead to accepted suggestions and then negatively reinforce rejected suggestions. And we're able to do this in near real-time. So you can accept a suggestion and then 30 minutes later the Tab model has been updated using online RL based on your feedback.
Getting this experience right has taken a lot of iterations. There's a delicate balance between the speed of the suggestion, the quality of the suggestion, and also just the general UX for how it's displayed. If it's slower than 200 milliseconds, it kind of takes you out of your flow. But you also don't want to see fast unhelpful suggestions. So with our latest release now, we show fewer suggestions, but we have higher confidence that they're going to be accepted.
We find Tab really helpful for domains where AI models just aren't as helpful yet. And the bottleneck here really is your own typing speed. Now, most people type at about 40 words per minute, even though I'm sure all of you type at 90 plus, right? We've got some amazing typists in here. So what would it look like if we allow the AI models to write more code for us? This is where coding agents come in and this is that next evolution of coding with AI. You can talk to models directly in products like Cursor or like we saw in Codex and have them create or update entire blocks of code.
Something we've tried really hard to make a focus in Cursor is giving you control over the level of autonomy of working with the models. So, one of the first features we added back in 2023 was prompting models to add inline suggestions. This would take your current line as well as the broader file context and then pass it to the model to suggest a diff. Shortly after, we released our first steps towards a coding agent, which was a feature called Composer, which some of the longtime Cursor user fans may remember. Uh, we even have a pixelated Twitter demo that I've included here of one of the first versions. This made it much easier to do multi-file edits with more of a conversational UI.
And then in 2024, we added a fully autonomous coding agent. This saw models use more tokens as they were getting better at tool calling and it allowed Cursor to self-gather its own context. So in the previous versions, you had to provide all of that context upfront, which was a bit more difficult. So let's talk about some of the ways that we've optimized the Cursor agent harness.
There's been a lot of talk recently about context engineering as an evolution of prompt engineering, which I personally find really helpful. As models are getting better, getting high-quality output is less about specific prompting tricks, although those can still help, but it's more about giving the models the right context. And not just any context, but intentional context. Models get worse at recalling information as the size of the context increases. And in reality, you don't want to push the limits of the context window. You want to use a minimal amount of high-quality tokens. And this is why the retrieval of code is actually really important and fundamental to context engineering.
So let's look at an example of searching code in a larger codebase. We found that when you give models very powerful tools, it can significantly improve the rate at which code is accepted. Many coding agents now use commands like grep or ripgrep to look for direct string matches across files and directories. And as new models are trained on tool calling and agents get better at using tools, the search quality does improve. However, we found that you can make searching even better by automatically indexing your codebase and creating embeddings. So this allows us to have semantic search. So I can ask the agent "update the top navigation," but if the file is actually called header.tsx, semantic search allows the agent to go and quickly and accurately find the correct code during the retrieval process.
For generating embeddings, we also moved from an off-the-shelf embedding model to training a custom model that helped us produce more accurate results and we constantly A/B test the performance of using semantic search. We found that in comparison to using grep alone, users would send more follow-up questions and also spend more tokens. So semantic search is really helpful. One of the biggest wins though is it shifts where the compute happens. You spend the compute and the latency upfront during the indexing rather than at inference time when the agent is actually being invoked. So in other words, you're doing the heavy lifting offline, which means you can get faster and cheaper responses at runtime without sacrificing performance and putting that on the user. So the takeaway here is you likely want both grep and semantic search for the best results. And we'll have a full blog post soon that talks about some of these results.
So giving the models better tools helps improve their quality. But what about the UX of actually using these coding agents? There's been a lot of exploration with coding CLI from OpenAI's Codex to Claude to Cursor's own CLI. And the idea here is to find the most minimal abstraction over the model, kind of iterate on the harness and then make the agent extensible. But we don't believe CLIs are the final state or the end goal of working with coding agents. What I like about the terminal is that it opens up a new surface for coding agents to run. So this can be in the CLI. It can also be on the web or from your phone. It can be from a bug report in Slack, which I use all the time. It can be from a backlog item in Linear just automatically triaged for you.
Because CLI-based agents are scriptable, you can use them in any type of environment which is really helpful. We use this internally to automatically write docs or update parts of our codebase. And it can be as simple as just doing cursor -p and then a prompt and having text or even structured formats like JSON come back.
We also believe that you'll need more specialized agents, which makes sense when you see the keynote today. Last year, we started experimenting with using AI models to read and review code instead of just writing and editing code. And we made an internal tool called Bugbot. It tried to help you find meaningful logic bugs in your code. And after using it internally for about 6 months, we found that it actually caught a lot of bugs that we missed on code reviews. So we decided to make it public and funnily enough it actually caught a bug which took down Bugbot itself which of course we accidentally ignored. So we learned to then really pay attention to those Bugbot comments.
Newer models are also getting very good at longer horizon tasks. So one way we've pushed agents to run longer inside of Cursor is having them plan and do more research upfront. This not only gives you a chance to verify the requirements of what you're trying to build and course correct along the way, but we've also seen it significantly improves the quality of the code generated, which makes sense, right? You're giving the models much higher quality input context. And to do this well, it's more than a simple prompt change like "plan better," but you actually need to have deeper product integration in how you store the plans, how you edit the files, and also giving the model new tools.
It also makes sense to allow the agent to create and manage a to-do list. This gives the model the critical context so they don't forget the task it's working on or waste tokens. And it's like they can have notes that they can constantly reference. One area we're still exploring is taking your to-dos and making them have the same source of truth, which is your codebase, which I know is something that I would personally use for smaller projects where maybe I don't need a fully featured task management tool.
Another important part of agent extensibility is allowing you to package up your workflows and then share them with your team. So custom commands are a way to share prompts and then rules allow you to include important context in every single agent conversation. One way our engineers have found this really helpful internally is packaging up our commit standards and guidelines, putting them in /commit and then being able to pass in tickets like you pass in the Linear ticket that you're working on. Another thing that I've noticed is that a lot of the context engineering breakthroughs actually happen in user space first. So all of you the power users figure out the workflows and the patterns that actually work really well and then as they get adopted they make their way back into the core product as features. So we see this with plans, memories and rules are really all like this.
Speaking of teams, you want to trust these agents to write code for you. But that requires keeping a human in the loop. Which is why when the agent tries to run shell commands, Cursor will ask you if you would like to run it just once or if you're comfortable, you can add it to the allow list to auto run in the future. And all these settings can be stored in code and explicitly shared with your team, including blocking certain shell commands or actions. Our latest release also has custom hooks, so you can tap into every part of the agents run. Maybe you want to have a shell script that runs when the agent finishes, for example.
So, we've covered a lot of ground here. Coding agents have evolved quite a bit in the past year, and they're getting better and better when you give them very powerful tools. And as the models have got more capable, we've actually been able to remove overly precise instructions from our system prompts that just weren't necessary anymore. So, what would it look like if we allowed agents to run for significantly longer? What is the right interface for managing multiple coding agents?
If you're just getting started coding with agents, I don't recommend immediately trying to juggle multiple agents. I mean, let's be honest, are we really being productive running nine CLIs in parallel? Probably not. Probably not yet, though. I mean, not only do you need to set up your local machine for running parallel agents, but it's also kind of hard to review the output of all of these agents. So, we don't think that this form factor is the end goal or the end state, but there is promise here. One thing we've been dogfooding over the past few months is a new type of interface for managing multiple coding agents. And we found this really helpful internally when maybe you have an agent in the foreground, but you need to ask questions about the codebase or maybe do some research about tools you want to integrate or small refactors. When you have this fast coding model in the foreground, you can really stay in the flow and then you have your parallel agents kind of run other tasks in the background which could run for much longer.
Those could be in the foreground on your machine. They can be in the background on the cloud. Each one of these decisions has unique constraints that right now you have to think about and spend a lot of time on. If you're in the cloud, you get these sandbox virtual machines, which are really nice for very long horizon tasks, but the trade-off is that it usually takes longer to boot up and you have to set up some initial configuration with the environment that you're working in. But running agents locally in parallel is kind of a different type of isolation. If you have multiple agents that are trying to modify the same set of files on your local machine, you need to have tools like git worktrees that allow you to have different copies of your codebase where you can run independently. And then you also have to think about all the other parts of local dev like managing accessing your database and viewing the worktrees on different ports today. And I talked to some developers early like a lot of this is happening in userland and people are writing scripts and hacks to make this work really well. And what we're working on and exploring is actually building this natively into the Cursor product.
Another idea that we've started to explore for multiple agents is being able to have the models compete against each other. So what if you had GPT-5 high reasoning versus medium or low reasoning and then you can pick the best result or compare results across different model providers with Cursor's agent. This will soon be an option to go from one to n for any given prompt and any models.
Part of context engineering for agents is making it so they can check their own work. So the agent needs to be able to run the code, test it, and then verify it's actually working correctly, which is why we're exploring giving the agent computer use. They can then control a browser to view network requests or take snapshots of the DOM and even give feedback about the design of the page.
As you can tell, there's still a lot to figure out on the right interface, the right product experience for managing multiple coding agents. Some of the things I just showed are available in Cursor today in beta. So go try them out if you're curious. And we'll have a stable release later this month. But I would love to hear your feedback on how you want to work with coding agents in the future. So come find me later and we can talk about it. And speaking of the future, I'd like to welcome Michael to the stage to talk about where software engineering is headed next.
Thanks, Lee. Our goal with Cursor is to automate coding. We think that half of that is a model problem and an autonomy problem. And we think that half of that is a human-computer interaction problem of what the act of building software looks like. We want engineers to be more ambitious, more inventive, and more fulfilled. And today I want to hint a little bit at the picture of the future that I think we can create together. One where AI frees up more time to work on the parts of building software that you love.
Imagine waking up in the morning, opening Cursor, and seeing that all of your tedious work has already been handled. On-call issues were fixed and triaged overnight. Boilerplate you never wanted to write was generated, tested, and ready to merge. A world where code review is actually fun, too. Instead of being buried in your busy work, your energy goes toward the things that drew you to engineering in the first place: solving hard problems, designing beautiful systems, and building things that matter.
Imagine agents that deeply understand your codebase, your team style, and your product sense. Agents that come back to you after working for long, long, long periods of time and show their work in higher-level programming languages. Agents that propose ideas, help you explore new directions, break down complex projects into pieces you can accept, reject, or refine. Ones that extend your ambition, but never take away your thinking and judgment. When you have a problem too complex for agents, they show you what they tried. Pulling in runtime logs or debugging tools. You'll never start from scratch.
This is the future we're working towards. A world where building software feels less like toil and much more like play and where creativity is the focus. Uh, and I think it's possible sooner than even some of the most ambitious people in this room think. Uh, if this vision excites you, we'd love to chat. And if you haven't tried Cursor, we've been shipping lots of improvements to our agent and to our editor. We'd love to hear what you think. Thank you.