crawlee-js
apify-platform
crawlee-python
π»hire-freelancers
πactor-promotion
π«feature-request
π»devs-and-apify
π£general-chat
πgiveaways
programming-memes
πapify-announcements
π·crawlee-announcements
π₯community
Concurrent crawlers or maxRequests per Queue?
accessing RequestQueue/RequestList for scraper
taking list of scraped urls and conducting multiple new scrapes
PlaywrightCrawler New Instance unexpected result
getAvailableURLs
function is called, a new instance of the PlaywrightCrawler
class is created and used to crawl the provided URL.
...push Dataset but got nothing
browserType.launchPersistentContext: Browser closed
executablePath: '/tmp/chromium/chrome-linux/chrome',
. This chrome executable was downloaded from Playwright's hosted file, so I didn't think there would be a compatibility issue: https://playwright.azureedge.net/builds/chromium/1060/chromium-linux.zip
Extra context: I'm running this in an AWS Lambda (x86_64).
```{...help on doing a cheeriocrawler scrape and then taking that list of urls and conducting a scrape
change proxies while running
PlaywrightCrawler in AWS Lambda
Is the Playwright Firefox Docker image usable with PlaywrightCrawler?
What optimizations work for you?
Cherrio's innerText sometimes returns corrupted content
$('body').prop('innerText')
. Namely, the returned content is not always the same.
I've opened an github issue for this and created a separate repository for easy reproduction. I wanted to mention this issue here on Discord as well. Maybe we can discuss possible solutions in an informal manner more easily.
Link to the issue: https://github.com/apify/crawlee/issues/1898
Link to the repo for reproduction steps: https://github.com/tsopeh/crawlee-innertext-repro...Failed to parse URL from [object Object]
getting ERR_CERT_AUTHORITY_INVALID with Playwright
ERROR PlaywrightCrawler: Request failed and reached maximum retries. page.goto: net::ERR_CERT_AUTHORITY_INVALID at 'MY_URL'
...map maximum size exceeded
WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Map maximum size exceeded
WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Map maximum size exceeded
Crawlee doesn't process newly enqueued links via enqueueLinks
?page=1
gets processed but the enqueued page (via enqueueLinks
) doesn't; Crawlee states that it has processed all links (1 of 1).
I have confirmed that has_next
is indeed true
and that enqueueLinks
gets called.
Am I missing something obvious?...Getting the parent URL while executing inside the requestHandler for Crawlee
networkidle2 option
I am looking for python & data processing expert (long term)
Got captha and HTTP 403 using PlaywrightCrawler
