Trying out Crawlee, etsy not working..

Hi Apify,
Thank you for this fine auto-scraping tool Crawlee! I wanted to try out along with the tutorial but with different url e.g. https://www.etsy.com/search?q=wooden%20box but it failed with PlaywrightCrawler.
// For more information, see https://crawlee.dev/
import { PlaywrightCrawler } from 'crawlee';
import { firefox } from 'playwright';


// PlaywrightCrawler crawls the web using a headless
// browser controlled by the Playwright library.
const crawler = new PlaywrightCrawler({
launchContext: {
launcher: firefox,
},
maxRequestRetries: 1,
// Use the requestHandler to process each of the crawled pages.
async requestHandler({ request, page, enqueueLinks, log, pushData }) {
await page.waitForTimeout(5000);
const title = await page.title();
log.info(`Title of ${request.loadedUrl} is '${title}'`);

// Save results as JSON to ./storage/datasets/default
await pushData({ title, url: request.loadedUrl });

// Extract links from the current page
// and add them to the crawling queue.
// await enqueueLinks();
},
// Comment this option to scrape the full website.
maxRequestsPerCrawl: 1,
// Uncomment this option to see the browser window.
headless: false,
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://www.etsy.com/search?q=wooden%20box']);
//await crawler.run(['https://www.etsy.com']); //works
//await crawler.run(['https://www.amazon.com']); //works
// For more information, see https://crawlee.dev/
import { PlaywrightCrawler } from 'crawlee';
import { firefox } from 'playwright';


// PlaywrightCrawler crawls the web using a headless
// browser controlled by the Playwright library.
const crawler = new PlaywrightCrawler({
launchContext: {
launcher: firefox,
},
maxRequestRetries: 1,
// Use the requestHandler to process each of the crawled pages.
async requestHandler({ request, page, enqueueLinks, log, pushData }) {
await page.waitForTimeout(5000);
const title = await page.title();
log.info(`Title of ${request.loadedUrl} is '${title}'`);

// Save results as JSON to ./storage/datasets/default
await pushData({ title, url: request.loadedUrl });

// Extract links from the current page
// and add them to the crawling queue.
// await enqueueLinks();
},
// Comment this option to scrape the full website.
maxRequestsPerCrawl: 1,
// Uncomment this option to see the browser window.
headless: false,
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://www.etsy.com/search?q=wooden%20box']);
//await crawler.run(['https://www.etsy.com']); //works
//await crawler.run(['https://www.amazon.com']); //works
It seems to fail at Checking device, I thought it injected TLS fingerprint and Browser fingperint but it seems Etsy still blocks it with 403! Thank you!
4 Replies
Hall
Hall4mo ago
Someone will reply to you shortly. In the meantime, this might help:
azzouzana
azzouzana4mo ago
Try to use proxies and bump retries a little bit
rare-sapphire
rare-sapphireOP4mo ago
Thanks azzouz, I don't think it helped as I can hit the url with my real ip. It's 403 everytime: ERROR PlaywrightCrawler: Request failed and reached maximum retries. Error: Request blocked - received 403 status code.
I read it further, I think it's interstitial state device checking that blocked Crawlee.
sensitive-blue
sensitive-blue4mo ago
1、You need to confirm that you have correctly set and replaced the proxy in your scraper script. 2、Try changing the User-Agent. 2、Check if it is related to headless browser characteristics.

Did you find this page helpful?