Code retrieval logic disclosed from the official Cursor security documentation

AI Knowledge Base9mos agorelease AI Sharing Circle

1.6K 00

Infrastructure security

We rely on the following sub-processors, listed in descending order of criticality. Please note that the code data is uploaded to our servers to support the Cursor All of the AI features (see AI request component), while in privacy mode the user's code data is not retained (for details see Privacy Mode Guarantee (Partial).

Privacy mode is enabled (click to see how privacy mode affects code data transfer)

AWS See code data: our infrastructure is primarily hosted on AWS. most servers are located in the U.S., with some latency-sensitive servers located in AWS regions in Asia (Tokyo) and Europe (London).
Fireworks See Code Data: We host custom models on Fireworks' servers in the US, Asia (Tokyo), and Europe (London). If privacy mode is not enabled, Fireworks may store some code data in order to speed up model inference.
OpenAI See code data: we rely on many of OpenAI's models to provide AI responses. Requests may be sent to OpenAI even if Anthropic's (or other) model is selected in the chat (e.g. for summarization).We have a zero data retention agreement with OpenAI.
Anthropic See code data: we rely on many Anthropic We have a zero data retention agreement with Anthropic. Requests may be sent to Anthropic even if an OpenAI (or other) model is selected in the chat (e.g., for summarization.) We have a zero-data retention agreement with Anthropic.
Google Cloud Vertex API See code data: we rely on some Google Cloud-provided Gemini model to provide the AI response. Requests may be sent to the Google Cloud Vertex API even if an OpenAI (or other) model is selected in the chat (e.g. for summarization).
Turbopuffer Storing obfuscated code data: Embeddings of indexed codebases and the metadata (obfuscated filenames) associated with the embeddings are stored on Turbopuffer in the Google Cloud, located in the United States. You can find more information about the obfuscated code data in the Turbopuffer Security Page Learn more. Users can disable codebase indexing; for details, see this document's Codebase Index Part.
Exa cap (a poem) SerpApi See Search Requests (which may originate from code data): for web search functionality. Search requests may originate from code data (e.g., when using "@web" in a chat, a separate language model determines what to search for based on your messages, session history, and current documents, and Exa/SerpApi will see the search query generated).
MongoDB Unseen code data: We use MongoDB to store some of our analytics data for users who do not have privacy mode enabled.
- Datadog No access to code data: we use Datadog for logging and monitoring. As shown in the Privacy Mode Assurance Section The logs involving privacy mode users do not contain any code data, as described in the
Databricks No access to code data: we use Databricks MosaicML to train some custom models. Privacy model users' data is not passed to Databricks.
Foundry No access to code data: We use Foundry to train some custom models. Privacy mode user data is not passed to the Foundry.
Slack No access to code data: We use Slack as an internal communication tool. In internal chats, snippets of hints for non-private users may be sent for debugging purposes.
Google Workspace No access to code data: we use Google Workspace for collaboration. In internal emails, snippets of hints for non-private users may be sent for debugging.
Pinecone No access to code data: embedded and metadata for indexed documents are stored in Pinecone. these documents are fetched from the public web. We are currently migrating these to Turbopuffer.
Amplitude No access to code data: We use Amplitude for some of our data analysis; no code data is stored in Amplitude; only event data is logged, such as the number of "Cursor Tab requests".
HashiCorp No access to code data: we use HashiCorp Terraform to manage the infrastructure.
Stripe No access to coded data: we use Stripe to process bills.Stripe stores your personal data (name, credit card, address).
Vercel No access to code data: We use Vercel to deploy the site. The site does not have access to code data.
WorkOS No access to code data: we use WorkOS to handle authentication. workOS may store some personal data (name, email address).

None of our infrastructure is located in China. We do not directly use any Chinese companies as sub-processors, nor, to the best of our knowledge, do any of our sub-processors.

We follow the principle of least privilege in assigning infrastructure access and implement multi-factor authentication for AWS. We limit resource access through network-level controls and secret management.

Client Security

Cursor is an open source project maintained by Microsoft Visual Studio Code (VS Code), a branch of Microsoft. GitHub Security Page Security Bulletins are published on. After every other major release of VS Code, we will post upstream ‘microsoft/vscode’ The code base is merged into Cursor. You can see the version of VS Code that your version of Cursor is based on by clicking "Cursor > About Cursor" in the application. If there is a serious security patch in the upstream VS Code, we will pick that patch before the next merge and release it immediately.

We use ToDesktop to distribute our apps and perform automatic updates. Several widely used applications (such as Linear cap (a poem) ClickUp) also trust the platform.

Our application sends requests to the following domains to communicate with our backend. If you are in an enterprise proxy environment, please whitelist these domains to ensure that Cursor is functioning properly.

‘api2.cursor.sh’: Used for most API requests.
‘api3.cursor.sh’: For Cursor Tab requests (HTTP/2 only).
‘repo42.cursor.sh’: For codebase indexing (HTTP/2 only).
‘api4.cursor.sh’(math.) genus‘us-asia.gcpp.cursor.sh’(math.) genus‘us-eu.gcpp.cursor.sh’(math.) genus‘us-only.gcpp.cursor.sh’: Used for Cursor Tab requests (HTTP/2 only) depending on your location.
‘marketplace.cursorapi.com’(math.) genus‘cursor-cdn.com’: Used to download extensions from the Extension Marketplace.
‘download.todesktop.com’: For checking and downloading updates.

One security feature to note that differs from VS Code:

Workspace Trust This is disabled by default in Cursor. You can set the ‘security.workspace.trust.enabled’ set to ‘true’ to enable it. It is disabled by default to avoid confusion between Workspace Trust's "Restricted Mode" and Cursor's "Privacy Mode", and because its trust properties are complex to understand (e.g., even if Workspace Trust is enabled, you are not protected from malicious extensions, only from malicious folders). We welcome community feedback on whether Workspace Trust should be enabled by default.

AI requests

To provide its functionality, Cursor makes AI requests to our servers. This happens in a variety of scenarios. For example, we send an AI request when you ask a question in a chat, we send an AI request every time you type so that the Cursor Tab can make suggestions for you, and we may also send an AI request in the background to build context or find errors to alert you.

AI requests typically contain contextual information such as your recently viewed documents, conversation logs, and relevant code snippets based on language server information. This code data is sent to our AWS-based infrastructure and then forwarded to the appropriate language model inference provider (Fireworks/OpenAI/Anthropic/Google). Note that even if you configure your own OpenAI API key in your settings, the request will always be delivered to our infrastructure on AWS first.

At this time, we do not yet support connecting directly from the Cursor app to your enterprise deployment of OpenAI/Azure/Anthropic, as our cue builds are done on servers and custom models on Fireworks are critical to providing a good user experience. We do not yet offer a self-hosted server deployment option.

You own all the code generated by Cursor.

Codebase Index

Cursor allows you to semantically index the codebase, making it possible to answer questions with the context of the entire codebase and write better quality code by referencing existing implementations. Codebase indexing is enabled by default, but can be turned off in the bootstrap process or settings.

Our codebase indexing feature works as follows: when enabled, it scans the folders you have open in Cursor and computes a Merkle tree of hashes of all files.‘.gitignore’ maybe ‘.cursorignore’ The files and subdirectories specified in the Merkle tree are ignored. The Merkle tree is then synchronized to the server. Every 10 minutes the hash is checked for matches and the Merkle tree is used to identify changed files and upload only those.

On our server, we chunk and embed the files and store the embedding in the Turbopuffer in. To allow filtering of vector search results by file path, we store an obfuscated relative file path for each vector, as well as the line range corresponding to the chunk. We also index embeddings by chunk's hash in the AWS cache to ensure that indexing the same codebase a second time is faster (this is especially useful for teams).

In the inference phase, we compute an embedding, have Turbopuffer perform a nearest-neighbor search, send the obfuscated file paths and line ranges back to the client, and read these file chunks locally on the client. We then send these blocks back to the server to answer the user's questions. This means that no plaintext code is stored on our server or in Turbopuffer.

Some notes:

even though ‘.cursorignore’ Files can be prevented from being indexed, but they may still be included in AI requests, such as when you have recently viewed a file and subsequently asked a question in chat. We are considering adding ‘.cursorban’ file to deal with scenarios where you wish to prevent a file from being sent in any request - if you are interested in this feature, please post in the forums or via the hi@cursor.com Contact Us.
File path obfuscation details: paths are passed through ‘/’ cap (a poem) ‘.’ Split and encrypt each fragment using a key stored on the client and a deterministic 6-byte short random number. This leaks some information about the directory hierarchy and has some random number conflicts, but hides most of the information.
Embedding Reversal: Academic research has shown that it is possible to reverse embeddings in some cases. The current attack relies on accessing the model and embedding short strings into large vectors, which leads us to believe that it is more difficult to perform this attack in this context. However, if an adversary breaks into our vector database, they may have some knowledge of the indexed codebase.
When codebase indexing is enabled in a Git repository, we also index the Git history. Specifically, we store commit SHAs, parent information, and obfuscated filenames (same as above). To allow sharing of data structures within the same Git repository and team, the key for filename obfuscation is derived from a hash of the most recent commits. Commit messages and file contents or differences are not indexed.
Our indexing function is usually under heavy load, which can cause many requests to fail. This means that sometimes files need to be uploaded multiple times before they are fully indexed. One manifestation of this problem is that if you look at the ‘repo42.cursor.sh’ of network traffic, you may see bandwidth usage that exceeds expectations.