Crawlee & Apify•3y ago

Crawlee seems to be getting a cached version of a xml file

I'm starting my crawler with the first request being a https://site.com/sitemap.xml. Then I read all the URLs in sitemap and check for the modified date (The website does update the modified that in the sitemap), and only crawl the pages that were modified. The problem is that the crawler in production is doing that once every hour, and it's always getting the same version of the sitemap.xml. If I run it after a while in my PC, it finds modified URLs, crawl the pages and get the updates. I'm enqueuing the XML with await crawler.run([{url: "sitemap.xml", "label": "SITEMAP"}]); Is there a way to add headers and prevent caching here?

8 Replies

sensitive-blue•3y ago

trty using residential proxy maybe

unwilling-turquoiseOP•3y ago

Isn't there a way to set headers for the crawlee request?

fair-rose•3y ago

You can do that in pre navigation hooks or add skipNavigation: true to request object when enqueuing and manually send request in route handler via sendRequest from context object provideded in handler arguments Not sure what is a production env in your case, is that apify platform? The easiest way to verify whether it is a cache issue is buy adding query string to the request - this way cache will be invalidated in most cases.

unwilling-turquoiseOP•3y ago

Sorry I didn't say that. But that's what I did to validate it's a cache issue. I added a sitemap.xml?random=RAMDOM And it worked. However not all websites I crawl support adding random query strings. Some give me an error if I try to add a query string that the site is not expecting.

fair-rose•3y ago

Have you already tried setting cache control headers?

fair-rose•3y ago

Found interesting info in docs (https://playwright.dev/docs/api/class-page#page-route): "Enabling routing disables http cache."
Not sure if that works. You may try it this as well: page.route('**', route => route.continue());

Page | Playwright

* extends: [EventEmitter]

fair-rose•3y ago

But use your glob pattern instead of wildcard if that approach works of course.

correct-apricot•3y ago

In Apify cloud every run getting non-cached results from the start, since actor instance created on each run and destroyed on finish, there is no "cache". If you getting cached output in your own server ensure that actor executed by Apify CLI as "apify run -p"

Gaming

Programming

Crawlee seems to be getting a cached version of a xml file

Did you find this page helpful?