Crawlee seems to be getting a cached version of a xml file

I'm starting my crawler with the first request being a https://site.com/sitemap.xml. Then I read all the URLs in sitemap and check for the modified date (The website does update the modified that in the sitemap), and only crawl the pages that were modified. The problem is that the crawler in production is doing that once every hour, and it's always getting the same version of the sitemap.xml. If I run it after a while in my PC, it finds modified URLs, crawl the pages and get the updates. I'm enqueuing the XML with await crawler.run([{url: "sitemap.xml", "label": "SITEMAP"}]); Is there a way to add headers and prevent caching here?
8 Replies
sensitive-blue
sensitive-blue3y ago
trty using residential proxy maybe
unwilling-turquoise
unwilling-turquoiseOP3y ago
Isn't there a way to set headers for the crawlee request?
fair-rose
fair-rose3y ago
You can do that in pre navigation hooks or add skipNavigation: true to request object when enqueuing and manually send request in route handler via sendRequest from context object provideded in handler arguments Not sure what is a production env in your case, is that apify platform? The easiest way to verify whether it is a cache issue is buy adding query string to the request - this way cache will be invalidated in most cases.
unwilling-turquoise
unwilling-turquoiseOP3y ago
Sorry I didn't say that. But that's what I did to validate it's a cache issue. I added a sitemap.xml?random=RAMDOM And it worked. However not all websites I crawl support adding random query strings. Some give me an error if I try to add a query string that the site is not expecting.
fair-rose
fair-rose3y ago
Have you already tried setting cache control headers?
fair-rose
fair-rose3y ago
Found interesting info in docs (https://playwright.dev/docs/api/class-page#page-route): "Enabling routing disables http cache."
Not sure if that works. You may try it this as well: page.route('**', route => route.continue());
Page | Playwright
* extends: [EventEmitter]
fair-rose
fair-rose3y ago
But use your glob pattern instead of wildcard if that approach works of course.
correct-apricot
correct-apricot3y ago
In Apify cloud every run getting non-cached results from the start, since actor instance created on each run and destroyed on finish, there is no "cache". If you getting cached output in your own server ensure that actor executed by Apify CLI as "apify run -p"

Did you find this page helpful?