Bring Back and Improve Scheduled Scraping for Knowledge Files (with Webhook Support)

taedog2020 · July 31, 2025, 4:35am

Summary:
Reintroduce and upgrade the ability to schedule website scraping for Knowledge Files. Enhance it with better control, visibility, and webhook support—making it a reliable, flexible system for keeping AI agents synced with live external data.

The Problem:
Today, scraping a URL into a Knowledge File is a one-time event. If the source changes, the data goes stale and must be manually refreshed. That’s unsustainable for creators using Pickaxe with:

Dynamic web content (blogs, SOPs, live docs)
AI agents that depend on accuracy
Workflows that rely on data staying current

The old scheduling feature helped—but it disappeared. Now we’re asking not just for its return, but for it to be rebuilt right.

Requested Features:

1. Scheduling Options (Per URL):

When uploading or managing a website Knowledge File:

Add scrape frequency: Manual, Daily, Weekly, Monthly
Allow creators to set time of day for scraping
Optionally retain or overwrite previous content (with version tagging)

2. Webhook Trigger Support (New):

After a successful or failed scrape, Pickaxe should offer an outbound webhook option:

Trigger a custom URL (e.g., n8n, Zapier, Make, or internal system)
Include payload: file ID, timestamp, scrape status, and diff summary if applicable
Supports automation like:
- Logging to a Notion dashboard
- Sending a Slack message
- Re-running an agent or regenerating a report

Example Use Case:
Scrape site at 6am → webhook hits n8n → n8n notifies Slack + updates a Google Sheet + pings an AI workflow.

3. After-Scrape Diff Reporting (Optional but Strongly Recommended):

Generate and optionally send to Studio Owner/Controller:

How many chunks were added/changed/removed
Diff summary in plain text
Timestamp + link to updated Knowledge File

4. Scrape Health Monitoring:

Log scrape attempts with status (success, fail, skipped)
Retry logic (e.g., 3x backoff)
Email or in-app alert if a scrape fails repeatedly

Why This Matters:
This feature turns static data into living knowledge, essential for agents supporting real-time or frequently updated domains. Adding webhook support also unlocks serious integration power for automation-minded users—without forcing them into brittle scraping workarounds outside Pickaxe.

Final Thought:
Bring scraping back—but make it programmable, transparent, and reliable. Knowledge is only useful when it’s current. Let us keep it that way, on our own terms.

Ned.Malki · July 31, 2025, 6:35am

Hey @taedog2020, here’s more insight into why the scheduled scraping was rolled back in V2:

taedog2020 · August 1, 2025, 12:01pm

That’s why I said add webhook. have n8n or something to do the scrape and update the KB.

Ned.Malki · August 1, 2025, 12:03pm

@taedog2020 You can do that now. Just add a webhook or connect an MCP server and configure your n8n or Make scenario. It’s already possible within Pickaxe.

taedog2020 · August 1, 2025, 12:05pm

so it will dynamically update the KB in the studio?

Ned.Malki · August 1, 2025, 12:13pm

That depends on your scenario. You can set up a scenario with a proxied scraper and set the refresh intervals that will trigger a website scrape.

For example:

Topic		Replies	Views
Knowledge base refresh Bugs / Site Issues	3	46	October 2, 2024
How do I auto-update website content in my AI knowledge base? Prompt Help knowledge-base	15	457	August 4, 2025
Central Knowledgebase of information Feature Requests	5	52	February 28, 2025
Knowledge Base Suggestion General	3	49	August 26, 2024
Scraping URL's to create business context General	1	133	July 24, 2024

Bring Back and Improve Scheduled Scraping for Knowledge Files (with Webhook Support)

1. Scheduling Options (Per URL):

2. Webhook Trigger Support (New):

3. After-Scrape Diff Reporting (Optional but Strongly Recommended):

4. Scrape Health Monitoring:

Related topics