Some Background (For Context :): I’ve been stress testing the new RSS import feature. I have 500 blog posts in my RSS feed and I am working to import them into the Knowledge section of a pickaxe.
My web host limits concurrent connection to 100. So for 400 of the imported pages imports a one chunk that says “{message’: ‘Maximum concurrency allowed 100’}.
So I imported the feed again, this time it scrapped more pages successfully, but it still got errors and it scraped some pages again. After a few rounds of this, I have most of my blog posts imported, but I also have a bunch of duplicates.
Question: Is there an automated way to remove duplicates, or to not scrape URLs already in knowledge?