CA
xenial-black
Apify not executing scrapy's close_spider function and going insane after it finishes scraping.
Hello, I got a little problem. As I said in the title, my script does not execute the close_spider function and when the scraping finishes, it goes in an infinite loops. I guess that's why close_spider doesn't get executed. Can anyone help?

17 Replies
@Kirada just advanced to level 4! Thanks for your contributions! 🎉
sensitive-blue•2y ago
Hi,
Please provide some reproduction (code snippet).
It's hard to help without seeing it.
Possibly You have a bug somewhere.
xenial-blackOP•2y ago
xenial-blackOP•2y ago
the spider code
the main function code
the problem is that it infinites run for a reason even after finishing the crawling
xenial-black•2y ago
Hi, I was not able to reproduce it since the provided code snippets appear to be incomplete. Despite filling in the missing imports and attempting to execute the code, I encountered the following error:
Could you please provide the complete and functional code of your Actor?
Additionally, providing a link to the run of your Actor would be helpful as well.
@Vlada Dusek just advanced to level 1! Thanks for your contributions! 🎉
xenial-blackOP•2y ago
xenial-blackOP•2y ago
xenial-blackOP•2y ago
it does an infinite run for a reason that I ignore after finishing scraping everything
xenial-black•2y ago
Hey, I did some investigation... If you add an
Item Pipeline
to your Scrapy-Apify project (based on the Scrapy Actor template) it works, and close_spider
method is correctly called after the spider finishes its work. I even tried to use your DataCleaningPipeline
pipeline, and it works, there is not a bug in it.
The problem has to be somewhere in your Spider and/or in the main.py
. I suggest you to keep the main.py
as simple as possible, e.g. like this:
And move the start_urls
, domain
, and other related logic to the Spider (class attributes). And then try to debug your Spider code.
xenial-blackOP•2y ago
The problem is thzt
I dont know what the allowed domains will be
Nor the start urls
No, the allowed domains
Is by user input so I dont know what will the user input be
I removed the apify.scrapy.pipelines.ActorDataSet
Because it was pushing the non cleaned items
And even with that it was still infinite running
Perhaps should I put everything on one file ? @Vlada Dusek
Instead of a project
xenial-black•2y ago
I removed the apify.scrapy.pipelines.ActorDataSet Because it was pushing the non cleaned itemsI believe for this use case you don't have to remove ActorPushDatasetPipeline, you should rather implement your own cleaning pipeline, which will be executed before the Push pipeline (which should be the latest one). Example of such CleaningPipeline:
xenial-blackOP•2y ago
There's another problem
I have to do the way I do it
xenial-black•2y ago
I dont know what the allowed domains will be Nor the start urls No, the allowed domains Is by user input so I dont know what will the user input beI got it. But for the purpose of debugging you can select one possible input and hard code it into Spider.
xenial-blackOP•2y ago
I'll do that. Ty
xenial-black•2y ago
Yeah, I believe the problem has to be in the Spider, because if I try it with other one, it works.
xenial-blackOP•2y ago
I believe too
On local run without apify, it works very well