Regarding web pages that cannot be imported as knowledge

I can’t scrape the client’s website. Is there a good solution?

Hi @avakero ,
Thank you for reaching out. I took a closer look at the issue you are seeing with importing https://kuyou-qa.com/ into your knowledge base.

When adding the homepage URL directly, the importer returns a “failed to fetch” error. This typically happens when the root page is slow to respond or times out during the initial single-page fetch request.

However, when I tested one of the subpages instead, the import completed successfully. For example, using:

https://kuyou-qa.com/compare-funeral-types-and-services/

allowed the importer to scrape that page and also retrieve the homepage as part of the crawl.

Sometimes if the homepage scrape fails, starting from a subpage still enables the knowledge base to gather the main site’s content.

Please try importing one of the subpages rather than the homepage URL. If you continue to run into issues, please let us know.

2 Likes

Hi Danny,

Thank you for the detailed investigation.

The page I’d like pickaxe to learn is this one: (https://kuyou-qa.com/all-qa/)

I created a simple page based on your advice and added a link to it, but unfortunately it seems to be timing out and failing. (Knowledge base for AI - 葬Qナビ/葬儀・供養・終活の総合Q&Aサイト)

Many articles are added to this homepage every day. Do you have any advice on the best way to manage the knowledge base, including future maintenance?

According to the client, they’re anticipating around 20,000 articles over the next two years. While it’s quite surprising, is it realistically possible to handle this volume? (Is there actually a way to manage this…?)

Here’s a status update.

When I enter the URL in column 1, the title in column 2, and the content in column 3 in the spreadsheet, it now responds correctly. (It didn’t work when columns 1 and 2 were reversed.)

I’ve decided to give up on scraping URLs from large-scale websites.

Thank you for your help.

Hi @avakero,
Thank you for the follow-up message. I want to clarify a few things, especially in case my earlier explanation caused confusion.

When I suggested starting from a subpage rather than the homepage, I meant only that a slow or timing-out homepage can sometimes block the initial fetch, while a fast-loading subpage may allow the scraper to begin its crawl. In my case, adding:

https://kuyou-qa.com/compare-funeral-types-and-services/

successfully allowed the importer to collect many pages across the site, including /all-qa/ and the homepage.


From your latest message, it looks like you are now experimenting with managing content using a spreadsheet. This approach can work well if you paste the actual content of the subpages into the spreadsheet, because Pickaxe will read the text directly. However, if the spreadsheet relies on the URLs themselves to fetch page content, it will not bypass the same loading and timeout limitations that affect web scraping.