Is there any hidden setting, or can we get one for recursive search like of documentation sites and/or YouTube channels, like wget has had for decades?
I added a docs site to my KB, and it came out 3 chunks, which I can only imagine means that only that one page got read, and the low quality of the answers seems to confirm it.
@kenlyle can you explain what you mean when you say you added a docs site to your KB? Would love to help troubleshoot/provide clarification!
Hey Lindsay, Thanks for asking.
I added a docs site, generically, a documentation site, specifically https://docs.nocode-x.com/
And I really thought…it only makes sense to me that it MUST be recursive, because that’s the only thing that makes sense in a lot of cases where there is truly a knowledge base. It’s rare that a KB be one page.
So, I figured, since there is no option…obviously, it’s going to ask me how many levels deep, like the wget command has done since forever, back almost 3 years to site grabber software.
But it seems like I only got one page, 3 chunks…
I expected a lot more from this much vaunted feature, but maybe I missed a configuration switch or something?
Hi @kenlyle, thanks for the additional information! I took a look at this specific situation - our Knowledge Base is generally able to scrape multiple site pages using the sitemap, but it looks like it had some issues doing so with your document site, so I’m going to pass this along to our engineering team to see if they can get a fix
I will keep you posted!
Thanks very much! That’s great followup.
The issue to me is that this feature, this issue of recursiveness as I mentioned - here is the reference:
Wget recursive download is controlled primarily with the -r or --recursive option, which enables downloading of a specified URL and all its linked pages and resources recursively. There are several important options commonly used with recursive downloads:
-r or --recursive: Enables recursive fetching.
-l or --level=: Sets the maximum recursion depth (default is 5).
It’s only sensible and reasonable, since resources are finite, that Pickaxe should give us, the creators and clients, the opportunity to opine what the -r and/or -l should be.
To be clear- on the converse- NOT allowing the user control, by not having this setting, when you guys should be using wget for fetching the files, is nonsensical.
Please ask your devs to implement this during their review.
@kenlyle thanks for the additional context! Generally, the KB is able to use an individual link to scrape the pages visible on a website’s sitemap; it seems as though that didn’t happen in this case, so I’ve asked our engineers to figure out why and I’ll let you know when I have more information!
I know @nathaniel could solve this very quickly. Consumerish versions of these tools have existed for about 30 years. It seems that wget would solve it, too…and each Pickaxe seems to run in a container where wget is installed…
It’s been over a week on this.
Looking at my Pickaxe, this Knowledge item for https://docs.nocode-x.com/ dropped from 3 chunks to 1, so was moving backward.
By deleting the item and readding it, I was able to get back to 3 chunks, but that can’t include the child pages. A couple of minutes later, refreshing the Editor, it’s still at 3 chunks, and there is no indication that there’s some kind of crawler running.
I could scrape the sitemap, and either manually or via RPA, get the pages imported into the KB, but I am sure nobody wants that.
What’s the exact issue with this particular site? Or is it actually with the Pickaxe platform?
@abhi or @lindsay_support or @danny_support
I did find that the target site did not have a robots.txt file, and had the vendor add that.
But on re-adding the site, https://docs.nocode-x.com/ it still comes in at only 3 chunks.
Hi @kenlyle,
You should update https://docs.nocode-x.com/sitemap.xml and fill it with the doc urls or remove it completely.
The pages here returns 404 status code:
Also our scrawler doesn’t work asynchronously, it scrapes urls from sitemap on the spot.
Thanks, Stephen. I never imagined such an issue. I have contacted the guilty party.
Can you please take a feature request that the crawler be a bit more informative? Could it report back on issues like no robots.txt and/or sitemap.xml, even though those should be extremely rare issues? From the user side, it would be really nice to know why the scrape failed. I would have loved to have had the information to get this solved a week ago.
@stephenasuncion Thanks. Now, with the proper sitemap.xml in place, I see 50 docs, which is a kind of magic number, so even though it doesn’t say so in the UI, I guess that’s the maximum number of docs scraped by the crawler?
There is a little oddity in how the files are processed…it goes from 50 to 41 on the first click of the Add Docs button, then seems to add the rest on the next click.
Then, when the batch finishes processing, none of the new files is selected. I clicked them all manually, and there were only 31.
There’s no UI that reports on any of the actions mentioned.
It would be a kindness from the team if the files added in a batch scraper were selected by default, I think, but I don’t know the processing/indexing or storage costs, so maybe it’s not practical.
Overall, I have to give my experience to date with the KB/crawler a C-, and that’s just to avoid saying “sucks”. It’s another half-implemented, undocumented, non-communicative feature.
I am not an expert UI designer, but there are some obvious opportunities here for improvement, starting with -
-”We only scrape 50 documents at a time, would you like us to attempt the next 50?”
-”You are requesting to add 14.7 MB to your KB and your maximum per Pickaxe is XYZ, of which you have #####.# MB remaining. Continue?”
-”31 documents of the 50 that you scraped (yeah, looks like we dropped 19 for some reason) are pending addition to the KB- Do you want to add all of them?”
-”Would you be interested in our AI assessment of the relevance of each of these documents that you might omit some of the less relevant?”
not an exhaustive list…
Hi @kenlyle! Yes, to clarify, 50 is the max number of pages you can scrape with one link through the sitemap. If you’re not seeing all of these pages in your chunks, could you provide documentation of that so we can take a look?
As for a potential pop-up, it’s an interesting idea - feel free to add to the Feature Requests area and see if you can get some upvotes!
Thanks, @lindsay_support
I created Fix the Knowledge Base UI when adding a website
And this, which is a separate issue.
