scraping at scale
How should I structure my crawler when scraping possibly 100s of different sites with different structures, handling multiple requests at once in Crawlee
3 Replies
flat-fuchsia•3y ago
Well, I am implementing something similar... 30-40 sites but with SIMILAR structure (if the structure of your sites is different -> you are implementing something like google/bing - king of generic web crawler)
1. You might use something like an external message queue, we discussed it here and in few other places:
https://discord.com/channels/801163717915574323/1056348705407651941
beanstalkd if just fine for these purposes
2. you can create one big config file (YML, JSON...) describing "where-to-find-what on each site"
Example:
abc123.com:
listOfTopics: h1 > div.list > div
...
xyz987.com:
listOfTopics: div.bigListClass > div > p
....
wise-whiteOP•3y ago
thank you
@harish just advanced to level 1! Thanks for your contributions! 🎉