CA
Crawlee & Apify•2y ago
xenial-black

Apify not executing scrapy's close_spider function and going insane after it finishes scraping.

Hello, I got a little problem. As I said in the title, my script does not execute the close_spider function and when the scraping finishes, it goes in an infinite loops. I guess that's why close_spider doesn't get executed. Can anyone help?
No description
17 Replies
MEE6
MEE6•2y ago
@Kirada just advanced to level 4! Thanks for your contributions! 🎉
sensitive-blue
sensitive-blue•2y ago
Hi, Please provide some reproduction (code snippet). It's hard to help without seeing it. Possibly You have a bug somewhere.
xenial-black
xenial-blackOP•2y ago
xenial-black
xenial-blackOP•2y ago
the spider code
async def main() -> None:
async with Actor:
Actor.log.info('Actor is being executed...')

# Process Actor input
actor_input = await Actor.get_input() or {}
max_depth = actor_input.get('max_depth', 1)
start_urls = ['https://gelsf.com/', 'vowconstruction.com/', 'https://prosperdevelopment.com/', 'https://missionhomeremodeling.com/', 'https://www.leefamilycorp.com/', 'https://www.a2zremodelingcal.com/', 'https://lemusco.com/', 'https://www.agcsf.com/', 'https://www.goldenheightsremodeling.com/']
settings = _get_scrapy_settings(max_depth)
domain = []
def get_domain(url):
try:
if not urlparse(url).scheme:
url = 'http://' + url
parsed_url = urlparse(url)
domain = parsed_url.netloc
if domain.startswith('www.'):
domain = domain[4:]
return domain
except:
print(f'invalid url : {url}')
for i in start_urls:
a = get_domain(i)
domain.append(a)


process = CrawlerProcess(settings, install_root_handler=False)
process.crawl(Spider, domain=domain, urls=start_urls)
process.start()
print('Finished scraping. Cleaning data...')
async def main() -> None:
async with Actor:
Actor.log.info('Actor is being executed...')

# Process Actor input
actor_input = await Actor.get_input() or {}
max_depth = actor_input.get('max_depth', 1)
start_urls = ['https://gelsf.com/', 'vowconstruction.com/', 'https://prosperdevelopment.com/', 'https://missionhomeremodeling.com/', 'https://www.leefamilycorp.com/', 'https://www.a2zremodelingcal.com/', 'https://lemusco.com/', 'https://www.agcsf.com/', 'https://www.goldenheightsremodeling.com/']
settings = _get_scrapy_settings(max_depth)
domain = []
def get_domain(url):
try:
if not urlparse(url).scheme:
url = 'http://' + url
parsed_url = urlparse(url)
domain = parsed_url.netloc
if domain.startswith('www.'):
domain = domain[4:]
return domain
except:
print(f'invalid url : {url}')
for i in start_urls:
a = get_domain(i)
domain.append(a)


process = CrawlerProcess(settings, install_root_handler=False)
process.crawl(Spider, domain=domain, urls=start_urls)
process.start()
print('Finished scraping. Cleaning data...')
the main function code the problem is that it infinites run for a reason even after finishing the crawling
xenial-black
xenial-black•2y ago
Hi, I was not able to reproduce it since the provided code snippets appear to be incomplete. Despite filling in the missing imports and attempting to execute the code, I encountered the following error:
AttributeError: 'Testing' object has no attribute 'is_valid_url'
AttributeError: 'Testing' object has no attribute 'is_valid_url'
Could you please provide the complete and functional code of your Actor? Additionally, providing a link to the run of your Actor would be helpful as well.
MEE6
MEE6•2y ago
@Vlada Dusek just advanced to level 1! Thanks for your contributions! 🎉
xenial-black
xenial-blackOP•2y ago
def is_valid_url(self, url):
try:
parsed = urlparse(url)
return True
except Exception as e:
print(f"Error validating URL: {e}")
return False
def is_valid_url(self, url):
try:
parsed = urlparse(url)
return True
except Exception as e:
print(f"Error validating URL: {e}")
return False
from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from urllib.parse import urlparse
from apify import Actor

from apify import Actor

from ..items import Items
import scrapy
import re
import json
from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from urllib.parse import urlparse
from apify import Actor

from apify import Actor

from ..items import Items
import scrapy
import re
import json
xenial-black
xenial-blackOP•2y ago
Apify
Apify Console
Manage the Apify platform and your account.
xenial-black
xenial-blackOP•2y ago
it does an infinite run for a reason that I ignore after finishing scraping everything
xenial-black
xenial-black•2y ago
Hey, I did some investigation... If you add an Item Pipeline to your Scrapy-Apify project (based on the Scrapy Actor template) it works, and close_spider method is correctly called after the spider finishes its work. I even tried to use your DataCleaningPipeline pipeline, and it works, there is not a bug in it. The problem has to be somewhere in your Spider and/or in the main.py. I suggest you to keep the main.py as simple as possible, e.g. like this:
from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from apify import Actor
from .spiders.title import TitleSpider as Spider

def _get_scrapy_settings() -> Settings:
settings = get_project_settings()
settings['ITEM_PIPELINES']['apify.scrapy.pipelines.ActorDatasetPushPipeline'] = 1000
settings['DOWNLOADER_MIDDLEWARES']['scrapy.downloadermiddlewares.retry.RetryMiddleware'] = None
settings['DOWNLOADER_MIDDLEWARES']['apify.scrapy.middlewares.ApifyRetryMiddleware'] = 1000
settings['SCHEDULER'] = 'apify.scrapy.scheduler.ApifyScheduler'
return settings

async def main() -> None:
async with Actor:
Actor.log.info('Actor is being executed...')
settings = _get_scrapy_settings()
process = CrawlerProcess(settings, install_root_handler=False)
process.crawl(Spider)
process.start()
from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from apify import Actor
from .spiders.title import TitleSpider as Spider

def _get_scrapy_settings() -> Settings:
settings = get_project_settings()
settings['ITEM_PIPELINES']['apify.scrapy.pipelines.ActorDatasetPushPipeline'] = 1000
settings['DOWNLOADER_MIDDLEWARES']['scrapy.downloadermiddlewares.retry.RetryMiddleware'] = None
settings['DOWNLOADER_MIDDLEWARES']['apify.scrapy.middlewares.ApifyRetryMiddleware'] = 1000
settings['SCHEDULER'] = 'apify.scrapy.scheduler.ApifyScheduler'
return settings

async def main() -> None:
async with Actor:
Actor.log.info('Actor is being executed...')
settings = _get_scrapy_settings()
process = CrawlerProcess(settings, install_root_handler=False)
process.crawl(Spider)
process.start()
And move the start_urls, domain, and other related logic to the Spider (class attributes). And then try to debug your Spider code.
...

class TestSpider(Spider):
name = 'test'
second_pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-]+')
email_pattern = re.compile(r'(?:mailto:)?[A-Za-z0-9._%+-]+@[A-Za-z.-]+\.[A-Za-z]{2,}')
links_pattern = re.compile(r'(twitter\.com|facebook\.com|instagram\.com|linkedin\.com)/')
phone_pattern = re.compile(r'tel:\+\d+')

start_urls = [
'https://gelsf.com/',
'https://vowconstruction.com/',
'https://prosperdevelopment.com/',
'https://missionhomeremodeling.com/',
'https://www.leefamilycorp.com/',
'https://www.a2zremodelingcal.com/',
'https://lemusco.com/',
'https://www.agcsf.com/',
'https://www.goldenheightsremodeling.com/',
]

allowed_domains = [
'gelsf.com',
'vowconstruction.com',
'prosperdevelopment.com',
'missionhomeremodeling.com',
'www.leefamilycorp.com',
'www.a2zremodelingcal.com',
'lemusco.com',
'www.agcsf.com',
'www.goldenheightsremodeling.com',
]

headers = {
...
}

...
...

class TestSpider(Spider):
name = 'test'
second_pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-]+')
email_pattern = re.compile(r'(?:mailto:)?[A-Za-z0-9._%+-]+@[A-Za-z.-]+\.[A-Za-z]{2,}')
links_pattern = re.compile(r'(twitter\.com|facebook\.com|instagram\.com|linkedin\.com)/')
phone_pattern = re.compile(r'tel:\+\d+')

start_urls = [
'https://gelsf.com/',
'https://vowconstruction.com/',
'https://prosperdevelopment.com/',
'https://missionhomeremodeling.com/',
'https://www.leefamilycorp.com/',
'https://www.a2zremodelingcal.com/',
'https://lemusco.com/',
'https://www.agcsf.com/',
'https://www.goldenheightsremodeling.com/',
]

allowed_domains = [
'gelsf.com',
'vowconstruction.com',
'prosperdevelopment.com',
'missionhomeremodeling.com',
'www.leefamilycorp.com',
'www.a2zremodelingcal.com',
'lemusco.com',
'www.agcsf.com',
'www.goldenheightsremodeling.com',
]

headers = {
...
}

...
xenial-black
xenial-blackOP•2y ago
The problem is thzt I dont know what the allowed domains will be Nor the start urls No, the allowed domains Is by user input so I dont know what will the user input be I removed the apify.scrapy.pipelines.ActorDataSet Because it was pushing the non cleaned items And even with that it was still infinite running Perhaps should I put everything on one file ? @Vlada Dusek Instead of a project
xenial-black
xenial-black•2y ago
I removed the apify.scrapy.pipelines.ActorDataSet Because it was pushing the non cleaned items
I believe for this use case you don't have to remove ActorPushDatasetPipeline, you should rather implement your own cleaning pipeline, which will be executed before the Push pipeline (which should be the latest one). Example of such CleaningPipeline:
class CleaningPipeline:
def process_item(self, item: BookItem, spider: Spider) -> BookItem:
number_map = {
'one': 1,
'two': 2,
'three': 3,
'four': 4,
'five': 5,
}
return BookItem(
title=item['title'],
price=float(item['price'].replace('£', '')),
rating=number_map[item['rating'].split(' ')[1].lower()],
in_stock=bool(item['in_stock'].lower() == 'in stock'),
)
class CleaningPipeline:
def process_item(self, item: BookItem, spider: Spider) -> BookItem:
number_map = {
'one': 1,
'two': 2,
'three': 3,
'four': 4,
'five': 5,
}
return BookItem(
title=item['title'],
price=float(item['price'].replace('£', '')),
rating=number_map[item['rating'].split(' ')[1].lower()],
in_stock=bool(item['in_stock'].lower() == 'in stock'),
)
xenial-black
xenial-blackOP•2y ago
There's another problem I have to do the way I do it
xenial-black
xenial-black•2y ago
I dont know what the allowed domains will be Nor the start urls No, the allowed domains Is by user input so I dont know what will the user input be
I got it. But for the purpose of debugging you can select one possible input and hard code it into Spider.
xenial-black
xenial-blackOP•2y ago
I'll do that. Ty
xenial-black
xenial-black•2y ago
Yeah, I believe the problem has to be in the Spider, because if I try it with other one, it works.
xenial-black
xenial-blackOP•2y ago
I believe too On local run without apify, it works very well

Did you find this page helpful?