Geonode Proxies
Hey, having some troubles trying to use the proxies provided by Geonode
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
"http://{username}:{password}@rotating-residential.geonode.com:9010"
],
});
logger.info("Setting up crawler.");
const crawler = new PlaywrightCrawler({
proxyConfiguration,
This is the error I'm getting when Crawlee tries to enqueue the first url fed to it
Error in default url (array urls
) Expected property values to be of type string
but received type null
in object options
When try to run without proxies everything works fine. Also the username and password variables are replaced with proper data5 Replies
fascinating-indigo•3y ago
Are you running it on the platform?
Can you share the link, please?
Otherwise, please provide a screenshot of the log with the full error description.
crude-lavenderOP•3y ago
Could you explain what running on the platform would mean?
This is the complete log:
error: Error in default url (array
urls) Expected property values to be of type
string but received type
null in object
options {"crawlUUID":"b7fdb85d-26a6-46f2-9c0d-d3874426b459","name":"ArgumentError","stack":"ArgumentError: (array
urls) Expected property values to be of type
string but received type
null in object
options\n at ow (/Users//Desktop/REBI-scraper/node_modules/ow/dist/index.js:33:28)\n at enqueueLinks (/Users//Desktop/scraper/node_modules/@crawlee/core/enqueue_links/enqueue_links.js:93:22)\n at browserCrawlerEnqueueLinks (/Users/Desktop/REBI-scraper/node_modules/@crawlee/browser/internals/browser-crawler.js:409:37)\n at runNextTicks (node:internal/process/task_queues:60:5)\n at process.processImmediate (node:internal/timers:442:9)\n at process.callbackTrampoline (node:internal/async_hooks:130:17)\n at async Object.enqueue (file:///Users//Desktop/REBI-scraper/dist/crawler/sites/alo/scrape.js:15:9)\n at async file:///Users//Desktop/scraper/dist/crawler/routes.js:51:13\n at async wrap (/Users//Desktop/REBI-scraper/node_modules/@apify/timeout/index.js:52:21)","timestamp":"2023-04-14T14:35:59.812Z","validationErrors":{}}
Also, it is important to mention
await page.waitForSelector(
".PaginationContainer--bottom .Pagination-item--next > .Pagination-link"
);
const nextButton = page.locator(
".PaginationContainer--bottom .Pagination-item--next > .Pagination-link"
);
const nextHref = await nextButton.getAttribute("data-href");
console.log("href", nextButton, nextHref);
await enqueueLinks({
// CONSIDER REPLACING URLS WITH SELECTOR
urls: [nextHref],
label: "LIST",
// limit: 6,
});
This is the element I'm extracting the hrefs from and the value of nextHref is null if I try to approach it with proxy. In contrast, the value of nextHref variable is valid when not using proxy
What weirds me out is the fact when I run the crawler with headfull browser I can see the first page actually opening and all the elements on it clearly visible but the href still doesn't get foundfascinating-indigo•3y ago
Error is not about proxy.
it's about adding links to the queue:
Expected property values to be of type string but received type null in object options\n at ow (/Users//Desktop/REBI-scraper/node_modules/ow/dist/index.js:33:28)\n at enqueueLinks
Possibly, the website blocks your proxy (and you need to test another proxy/proxy group).
Try make a screenshot of the loaded page and check what you have there.
Also, try to use Playwright + firefox. It can help with blocks.
crude-lavenderOP•3y ago
But as previously mentioned - the headfull browser opens and loads the page
This specific page I'm targeting asks for captcha when it suspects a harmful behaviour and it does not happen in this case
fascinating-indigo•3y ago
headful mode also can have impact on the block possibility.
If it works with, just keep using it.
if the href still doesn't get found > re-check your selectors. Maybe check it even in the window of Puppeteer. Maybe it has different selector. Just add some sleep() to have some time to check the page.