Cloudflare Crawl API '/crawl' Explained with Examples
#api
#crawler
#cloudflare
Web crawling is an essential part of modern software development. Applications that rely on AI, search systems, research tools and analytics platforms often need structured information gathered from websites. Traditionally, developers built their own crawling systems using headless browsers and distributed infrastructure.
To simplify this process, Cloudflare Browser Rendering introduced the /crawl endpoint. This API allows developers to crawl an entire website starting from a single URL while automatically following links and extracting content. The results can be returned in multiple formats including HTML, Markdown and structured JSON.
Cloudflare Crawl API Endpoint
The Crawl API endpoint is the main URL developers use to start a website crawl using Cloudflare Browser Rendering. It allows you to send a request that begins crawling a website from a starting URL and automatically follows links across the site.
Endpoint URL
https://api.cloudflare.com/client/v4/accounts/<account_id>/browser-rendering/crawl
This endpoint is part of the Cloudflare Browser Rendering REST API and is used to initiate a crawl job that runs asynchronously in Cloudflare's infrastructure.
Required Fields in Cloudflare Crawl API
When using the Cloudflare Crawl API, the request must include required field in the request body. Without this field, the API cannot start a crawl job.
url (string)
The url parameter is the only required field needed to start a crawl job. It specifies the starting page that the crawler will visit first. From this URL, the crawler automatically discovers and follows links across the website based on your configuration.
How the Cloudflare Crawl API Works (Short Summary)
Using the Cloudflare Crawl API involves two simple steps:
-
Initiate the crawl job: You send a POST request with the starting URL to the /crawl endpoint. Cloudflare immediately returns a job ID that identifies your crawl task.
-
Request the crawl results: With the job ID, you send a GET request to check the status or retrieve the results of the crawl. The API response includes pages crawled and their extracted content.
Crawl jobs run asynchronously, meaning they run in the background. You don't wait for the crawl to finish in a single request. Instead, you check back with the job ID until the crawl completes.
Cloudflare applies certain limits:
-
A crawl can run for up to 7 days. If it doesn't finish in that time, the job is cancelled due to timeout.
-
Once the crawl completes, its data stays available for 14 days before being deleted.
This two‑step process lets you start crawls quickly and retrieve results whenever they're ready, without managing crawling infrastructure yourself.
How to Initiate a Crawl Job (Cloudflare /crawl API)
To start crawling a website using Cloudflare's Crawl API, you send a POST request to the /crawl endpoint with the URL you want to begin crawling. The API immediately responds with a job ID that you will use later to check the status and retrieve the results of the crawl.
Curl Example : Start a Crawl Job
curl -X POST "https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl" \
-H "Authorization: Bearer <apiToken>" \
-H "Content-Type: application/json" \
-d '{
"url": "https://developers.cloudflare.com/workers/"
}'
In this request:
-
You call the /crawl endpoint under your Cloudflare account.
-
You include your API token in the Authorization header.
-
You send a JSON body with the required url field pointing to the page you want to crawl.
Example Response
After successful submission, Cloudflare returns a response like this:
{
"success": true,
"result": "c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e"
}
Here:
-
success: true means the crawl job was created successfully.
-
The result value is the crawl job ID : a unique identifier for your crawl.
You use this job ID in a later GET request to check the crawl status or retrieve the final crawl results.
Requesting Crawl Job Results (Cloudflare /crawl API)
After you start a crawl job with a POST request, Cloudflare processes the job in the background and returns a job ID. You use that job ID to check the
crawl status or get the actual results by sending a GET request to the same /crawl endpoint with the job ID appended.
Example GET Request to Fetch Crawl Status
curl -X GET "https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e" \
-H "Authorization: Bearer YOUR_API_TOKEN"
In this request:
- You replace {account_id} with your Cloudflare account ID.
- You use the job ID you received from the initial POST request.
- You include your API token in the Authorization header.
When you send this request, the API returns a JSON response that includes a status field. This field tells you the current state of the crawl job.
Possible Crawl Job Statuses
The status field can contain any of the following values:
- running : The crawl is still in progress.
- cancelled_due_to_timeout : The job ran longer than the seven‑day limit and was automatically cancelled.
- cancelled_due_to_limits : The job was cancelled because it reached your Cloudflare plan's resource limits.
- cancelled_by_user : The job was manually cancelled by you.
- errored : An error occurred during the crawl.
- completed : The crawl finished successfully and you can retrieve the final data.
Once the job reaches the completed status, you can request the full crawl results (including crawled URLs and extracted content) from the same endpoint. You can also use optional query parameters like cursor, limit, or status to filter or paginate results.
Polling for Completion (Cloudflare /crawl API)
Cloudflare crawl jobs run asynchronously, so they don't finish immediately after you send the POST request. To check when your crawl job is done, you
can poll the endpoint periodically using the job ID you received. Adding ?limit=1 helps keep the response small because you only need the status,
not all crawled content.
JavaScript Example: Poll Until Crawl Finishes
async function waitForCrawl(accountId, jobId, apiToken) {
const maxAttempts = 60;
const delayMs = 5000;
for (let i = 0; i < maxAttempts; i++) {
const response = await fetch(
`https://api.cloudflare.com/client/v4/accounts/${accountId}/browser-rendering/crawl/${jobId}?limit=1`,
{
headers: {
Authorization: `Bearer ${apiToken}`,
},
},
);
const data = await response.json();
const status = data.result.status;
if (status !== "running") {
return data.result;
}
// Wait before next check
await new Promise((resolve) => setTimeout(resolve, delayMs));
}
throw new Error("Crawl job did not complete within timeout");
}
In this function:
- The loop checks the crawl status every 5 seconds.
- It stops when the crawl job is no longer running.
- You can adjust maxAttempts and delayMs based on your preferences.
Fetch Full Crawl Results
Once the job reaches a terminal status (like completed), you can fetch the full results without the limit parameter. The API supports additional query options so you can control how you retrieve the results:
- cursor : Used for pagination if results exceed 10 MB.
- limit : Number of records to return per request.
- status : Filter results by URL status such as completed, queued, skipped etc.
Example: Fetch Paginated Crawl Results
curl -X GET "https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}?cursor=10&limit=10&status=completed" \
-H "Authorization: Bearer YOUR_API_TOKEN"
This request retrieves results starting at the cursor position 10, with up to 10 records per page and only includes URLs whose status is completed.
Sample Paginated Response
The API returns a JSON object listing the crawl results and metadata:
{
"result": {
"id": "c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e",
"status": "completed",
"browserSecondsUsed": 134.7,
"total": 50,
"finished": 50,
"records": [
{
"url": "https://developers.cloudflare.com/workers/",
"status": "completed",
"markdown": "# Cloudflare Workers\nBuild and deploy serverless applications...",
"metadata": {
"status": 200,
"title": "Cloudflare Workers · Cloudflare Workers docs",
"url": "https://developers.cloudflare.com/workers/"
}
},
{
"url": "https://developers.cloudflare.com/workers/get-started/quickstarts/",
"status": "completed",
"markdown": "## Quickstarts\nGet up and running with a simple Hello World...",
"metadata": {
"status": 200,
"title": "Quickstarts · Cloudflare Workers docs",
"url": "https://developers.cloudflare.com/workers/get-started/quickstarts/"
}
}
// …more entries
],
"cursor": 10
},
"success": true
}
This contains:
- metadata about the crawl job
- list of records with extracted content
- a cursor value for pagination when there are more results to fetch.
This step ensures your application can detect when a crawl job finishes and then fetch all the crawled data efficiently.
Cancel a Crawl Job (Cloudflare /crawl API)
If you've started a crawl job with the Cloudflare Crawl API and decide you no longer want it to continue, you can cancel the job while it's in progress. Cancelling stops any future pages that were queued for crawling and updates the job status to indicate it was cancelled by the user.
How to Cancel a Crawl Job
To cancel a crawl job, send a DELETE request to the /crawl endpoint using the job ID you received when you initiated the crawl.
Example Terminal Command
curl -X DELETE "https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}" \
-H "Authorization: Bearer YOUR_API_TOKEN"
In this example:
- Replace {account_id} with your Cloudflare account ID.
- Replace {job_id} with the crawl job ID you received from your earlier POST request.
- Include your API token in the Authorization header.
What Happens After Cancellation
- A successful cancellation returns a 200 OK response.
- The crawl job's status is updated to cancelled_by_user.
- All URLs that were queued but not yet crawled are dropped.
- You can still check the job status later if needed.
Cancelling a crawl is useful when you've queued a large crawl by mistake, want to save quota or no longer need the data before the crawl completes.
Optional Parameters for Cloudflare Crawl API
In addition to the required url field, Cloudflare's /crawl endpoint supports several optional parameters that let you customize the crawl behavior, limit scope, and control output formats. You include these in the same JSON body when you start a crawl job with a POST request.
| Optional Parameter | Type | Description |
|---|---|---|
limit |
Number | Maximum number of pages to crawl (default 10, max 100,000). |
depth |
Number | Maximum link depth to crawl from the starting URL (default 100,000, max 100,000). |
source |
String | Source for discovering URLs: all, sitemaps, or links. Default: all. |
formats |
Array of strings | Response formats. Default: ["html"]. Other options: markdown, json. JSON uses Workers AI extraction by default. |
render |
Boolean | Controls browser rendering: true executes JavaScript (default), false does a fast HTML fetch without rendering. |
jsonOptions |
Object | Required if formats includes "json". Contains properties like prompt, response_format, and custom_ai (same as /json endpoint). |
maxAge |
Number | Max seconds the crawler can reuse a cached resource before re‑fetching it (default 86,400s, max 604,800s). |
modifiedSince |
Number | Unix timestamp (seconds). Crawl only pages modified since this time. |
options.includeExternalLinks |
Boolean | If true, follows links to external domains (default false). |
options.includeSubdomains |
Boolean | If true, follows links on subdomains of the starting URL (default false). |
options.includePatterns |
Array of strings | Only visits URLs matching any of these wildcard patterns; supports * and **. |
options.excludePatterns |
Array of strings | Does not visit URLs matching these wildcard patterns; supports * and **. |
Example with all optional parameters
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
-H 'Authorization: Bearer <apiToken>' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://www.exampledocs.com/docs/",
"limit": 50,
"depth": 2,
"formats": ["markdown"],
"render": false,
"maxAge": 7200,
"modifiedSince": 1704067200,
"source": "all",
"options": {
"includeExternalLinks": true,
"includeSubdomains": true,
"includePatterns": [
"**/api/v1/*"
],
"excludePatterns": [
"*/learning-paths/*"
]
}
}'
Real‑World Use Cases
1. Documentation Site Crawl With Filters
When you need to crawl only the documentation section of a website and deliberately skip unnecessary sections like changelog or archive you can use
includePatterns and excludePatterns in your crawl request. These pattern filters let you precisely control which parts of a site are indexed, making the
crawl more efficient and targeted.
This is particularly useful when building:
- Technical knowledge bases
- AI training datasets (e.g., for RAG systems)
- Documentation monitoring tools
The example below shows how to start a crawl that scans only the site's docs and omits irrelevant sections.
Example: Crawl Only Documentation Pages
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
-H 'Authorization: Bearer <apiToken>' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com/docs",
"limit": 200,
"depth": 5,
"formats": ["markdown"],
"options": {
"includePatterns": [
"https://example.com/docs/**"
],
"excludePatterns": [
"https://example.com/docs/changelog/**",
"https://example.com/docs/archive/**"
]
}
}'
What This Script Does
- Starting URL: Begins crawling at the documentation home (https://example.com/docs).
- limit: Sets a maximum of 200 pages to crawl so you don't exceed quota or get unnecessary content.
- depth: Limits how many levels deep the crawler will follow links (here up to 5).
- formats: Requests results in Markdown, which is ideal for docs and AI training.
- includePatterns: Ensures only URLs under the /docs/ path get crawled.
- excludePatterns: Prevents crawling of specific sub‑sections that you don't need (like “changelog” and “archive”).
Why Use Pattern Rules
Pattern matching helps you avoid crawling irrelevant or sensitive parts of a site. Even if a page is linked from the docs, exclude patterns ensure it's skipped if it matches one of the defined rules. This keeps your crawl results focused and reduces unnecessary processing.
2. Product Catalog Extraction with AI
When building e‑commerce tools, price trackers, catalogs or AI‑driven product analytics systems, developers often need structured product data (such as product names, prices, descriptions, availability and currency). Cloudflare's Crawl API makes this easier by combining crawling, rendering and AI‑powered JSON extraction into one call.
Instead of crawling the site and then running a separate extraction step, you can tell the Crawl API exactly what fields you want to extract and it will return structured JSON using Workers AI.
Crawl Request: Extract Product Data
Below is an example curl command that starts a crawl on a hypothetical shop's products page and extracts key product information using a JSON schema:
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
-H 'Authorization: Bearer <apiToken>' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://shop.example.com/products",
"limit": 50,
"formats": ["json"],
"jsonOptions": {
"prompt": "Extract product name, price, description, and availability",
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "product",
"properties": {
"name": "string",
"price": "number",
"currency": "string",
"description": "string",
"inStock": "boolean"
}
}
}
},
"options": {
"includePatterns": [
"https://shop.example.com/products/*"
]
}
}'
What This Does
- Starting URL: Begins crawling at the products listing page.
- limit: Limits the crawl to 50 pages, enough to cover most basic product catalogs.
- formats: Specifies ["json"], so results are returned as structured JSON rather than plain HTML or Markdown.
- jsonOptions: Provides a prompt and a JSON schema to guide Workers AI in extracting exactly the fields you care about: product name, price, description, currency and availability.
- includePatterns: Ensures only URLs that match the “products” path pattern are included in the crawl.
Why This Matters
In traditional web scraping setups, you would need to:
- Crawl pages separately.
- Parse HTML with custom extraction logic.
- Normalize the data into a structured format.
With Cloudflare's Crawl API, you combine crawling, JavaScript rendering, link discovery and AI‑powered extraction in one API call, drastically simplifying your workflow. It's ideal for:
- E‑commerce analytics dashboards
- Price comparison tools
- AI‑powered product search & recommendations
- Automated catalog generation pipelines
By providing a prompt and schema, the Crawl API returns clean, typed JSON without custom scraping logic, which saves time and reduces engineering complexity.
3. Fast Static Content Fetch : Crawl Static Sites without Rendering
When you know that the content you need is already present in the initial HTML, such as on blogs, brochure sites or static documentation. you can tell the Crawl API to skip JavaScript rendering. This makes the crawl much faster and more efficient because Cloudflare doesn't launch a headless browser for each page.
Here's how you perform a fast static content fetch using the /crawl endpoint:
Example : Crawl Static HTML Pages
curl -X POST "https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl" \
-H "Authorization: Bearer <apiToken>" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"limit": 100,
"render": false,
"formats": ["html", "markdown"]
}'
What This Request Does
- url: The starting page for the crawl.
- limit: Tells the API to crawl up to 100 pages.
- render: false: Disables JavaScript execution and performs a simple HTML fetch instead of spinning up a browser.
- formats: Requests results in both raw HTML and Markdown for easier content consumption or analysis.
By disabling rendering (render: false), the crawl becomes faster and more cost‑efficient, especially for static sites where the HTML doesn't depend on
client‑side JavaScript to load content.
When to Use render: false
- Your pages are fully rendered on the server
- You don't need the crawler to execute JavaScript
- You want quicker results and lower compute usage
- You're crawling simple static content like blogs or documentation
Cloudflare's /crawl endpoint will still follow links and extract content. it simply skips the headless browser step unless you specifically enable it.
4. Crawl with Authentication (Cloudflare /crawl API)
When you need to crawl content that is behind HTTP authentication or requires custom headers (for example, API‑key access), Cloudflare's Crawl endpoint lets you include authentication credentials directly in your crawl request. This makes it possible to crawl protected pages (like internal docs or API endpoints) that would otherwise be inaccessible with a simple anonymous crawl.
I. Crawl with Basic HTTP Authentication
If the site you're crawling requires a username and password (standard HTTP basic auth), you can include an authenticate object in your request body. This tells the crawler to send the credentials as part of the request to the target site.
Example : Basic Auth Crawl
curl -X POST "https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl" \
-H "Authorization: Bearer <apiToken>" \
-H "Content-Type: application/json" \
-d '{
"url": "https://secure.example.com",
"limit": 50,
"authenticate": {
"username": "user",
"password": "pass"
}
}'
In this example:
- url is the protected page you want to crawl.
- limit determines how many pages are crawled.
- authenticate contains the basic authentication credentials that Cloudflare will send when requesting the page.
This allows the crawler to access content that is normally blocked by a login prompt or HTTP basic authentication.
II. Crawl with Token‑Based or Custom Header Authentication
Some APIs or services require a token or custom HTTP header instead of basic auth. You can include these headers in the crawl request with the setExtraHTTPHeaders option.
Example : Token‑Based Auth Crawl
curl -X POST "https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl" \
-H "Authorization: Bearer <apiToken>" \
-H "Content-Type: application/json" \
-d '{
"url": "https://api.example.com/docs",
"limit": 100,
"setExtraHTTPHeaders": {
"X-API-Key": "your-api-key"
}
}'
How this works:
- The crawler includes the X-API-Key header in every request it makes during the crawl.
- This allows it to authenticate with APIs or endpoints that expect token‑based authentication.
This technique is especially useful for crawling internal documentation portals, private APIs or content that returns data only when an API token or session token is present.
Why Authentication Options Matter
Many real‑world websites and APIs are protected behind authentication mechanisms. Without supporting credentials or headers, crawlers only return the public content. By providing credentials or custom headers:
- You can crawl login‑protected documentation
- You can index internal knowledge bases for AI systems
- You can extract data from API‑protected endpoints
These features make the Crawl API much more flexible and suitable for enterprise workflows where most content isn't publicly accessible.
5. Wait for Dynamic Content
Some modern websites, especially single‑page applications (SPAs) load their content after the initial HTML response through JavaScript. To ensure the Cloudflare crawler captures the fully rendered page instead of just an empty shell, you can use the gotoOptions and waitForSelector options in your crawl request.
Example : Crawl with Dynamic Content Wait
curl -X POST "https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl" \
-H "Authorization: Bearer <apiToken>" \
-H "Content-Type: application/json" \
-d '{
"url": "https://app.example.com",
"limit": 50,
"gotoOptions": {
"waitUntil": "networkidle2",
"timeout": 60000
},
"waitForSelector": {
"selector": "[data-content-loaded]",
"timeout": 30000,
"visible": true
}
}'
What This Does
- gotoOptions.waitUntil: "networkidle2" - Tells the crawler to wait until network activity is mostly finished (no more than two pending requests). This helps ensure content loaded via JavaScript is fully available before extraction.
- gotoOptions.timeout: 60000 - Allows up to 60 seconds for the page to finish loading dynamic content.
- waitForSelector - Instructs the crawler to wait until a specific DOM element appears (in this case, one marked by data-content-loaded) before proceeding. This helps avoid capturing incomplete pages.
This approach is especially useful when crawling JavaScript heavy applications where important content is loaded asynchronously and not available in the initial HTML.
6. Block Unnecessary Resources (Speed Up Crawl)
When crawling a site where you only need text content (not images, videos, fonts,or CSS), you can instruct Cloudflare's Crawl API to block specific resource types. This reduces network requests, speeds up crawling and lowers resource usage, especially helpful for large crawls focused on text extraction or structured data. (developers.cloudflare.com)
Example : Crawl While Blocking Images & Media
curl -X POST "https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl" \
-H "Authorization: Bearer <apiToken>" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"limit": 100,
"rejectResourceTypes": [
"image",
"media",
"font",
"stylesheet"
]
}'
What This Does
rejectResourceTypes tells the crawler to block requests for specific kinds of resources like:
- image : prevents loading JPG, PNG, GIF, SVG, etc.
- media : blocks audio and video files.
- font : stops web fonts from downloading.
- stylesheet : skips CSS files, which can speed up the crawl.
By blocking these, the crawler only requests text and essential document content, speeding up crawling and reducing processing costs. Blocking unnecessary resources is particularly useful when you only care about HTML structure, text or JSON output and not how the page looks visually. That way, the crawl focuses on what matters the content without wasting time fetching images or style files.
