CA
foreign-sapphire
robots.txt Compatibility
Hi guys 👋 , my Apify actor can pull data from the website even though the robots.txt setting is “TRUE”. When I test it on my own server, it complies with robots.txt rules. Doesn't Apify automatically follow robots.txt rules? Can't we set it manually? I haven't found any documentation on this.
2 Replies
View post on community site
This post has been pushed to the community knowledgebase. Any replies in this thread will be synced to the community site.
Apify Community
rare-sapphire•9mo ago
Apify does not automatically enforce robots.txt rules by default. This is because Apify focuses on providing flexibility for web scraping and automation, and some use cases may require bypassing these rules (within the bounds of legality and ethics). Therefore, even if the robots.txt setting is "TRUE," it might not be enforced automatically unless explicitly handled in your code.
You can manually enforce robots.txt rules by adding logic to your actor. For example, you can use libraries like robots-txt-guard in Node.js to parse and respect robots.txt restrictions before pulling data from a website.
Here's a basic approach:
- Parse the robots.txt file from the target site.
- Check whether your actor is allowed to scrape specific endpoints.
- Proceed based on the result of the check.