xenial-black

Apify not executing scrapy's close_spider function and going insane after it finishes scraping.

Hello, I got a little problem. As I said in the title, my script does not execute the close_spider function and when the scraping finishes, it goes in an infinite loops. I guess that's why close_spider doesn't get executed. Can anyone help?

17 Replies

MEE6•2y ago

@Kirada just advanced to level 4! Thanks for your contributions! 🎉

sensitive-blue•2y ago

Hi, Please provide some reproduction (code snippet). It's hard to help without seeing it. Possibly You have a bug somewhere.

xenial-blackOP•2y ago

message.txt

xenial-blackOP•2y ago

the spider code

async def main() -> None:
    async with Actor:
        Actor.log.info('Actor is being executed...')
        
        # Process Actor input
        actor_input = await Actor.get_input() or {}
        max_depth = actor_input.get('max_depth', 1)
        start_urls = ['https://gelsf.com/', 'vowconstruction.com/', 'https://prosperdevelopment.com/', 'https://missionhomeremodeling.com/', 'https://www.leefamilycorp.com/', 'https://www.a2zremodelingcal.com/', 'https://lemusco.com/', 'https://www.agcsf.com/', 'https://www.goldenheightsremodeling.com/']
        settings = _get_scrapy_settings(max_depth)
        domain = []
        def get_domain(url):
            try:
                if not urlparse(url).scheme:
                    url = 'http://' + url 
                parsed_url = urlparse(url)
                domain = parsed_url.netloc
                if domain.startswith('www.'):
                    domain = domain[4:]
                return domain
            except:
                print(f'invalid url : {url}')
        for i in start_urls:
            a = get_domain(i)
            domain.append(a)
    

        process = CrawlerProcess(settings, install_root_handler=False)
        process.crawl(Spider, domain=domain, urls=start_urls)
        process.start()
        print('Finished scraping. Cleaning data...')

async def main() -> None:
    async with Actor:
        Actor.log.info('Actor is being executed...')
        
        # Process Actor input
        actor_input = await Actor.get_input() or {}
        max_depth = actor_input.get('max_depth', 1)
        start_urls = ['https://gelsf.com/', 'vowconstruction.com/', 'https://prosperdevelopment.com/', 'https://missionhomeremodeling.com/', 'https://www.leefamilycorp.com/', 'https://www.a2zremodelingcal.com/', 'https://lemusco.com/', 'https://www.agcsf.com/', 'https://www.goldenheightsremodeling.com/']
        settings = _get_scrapy_settings(max_depth)
        domain = []
        def get_domain(url):
            try:
                if not urlparse(url).scheme:
                    url = 'http://' + url 
                parsed_url = urlparse(url)
                domain = parsed_url.netloc
                if domain.startswith('www.'):
                    domain = domain[4:]
                return domain
            except:
                print(f'invalid url : {url}')
        for i in start_urls:
            a = get_domain(i)
            domain.append(a)
    

        process = CrawlerProcess(settings, install_root_handler=False)
        process.crawl(Spider, domain=domain, urls=start_urls)
        process.start()
        print('Finished scraping. Cleaning data...')

the main function code the problem is that it infinites run for a reason even after finishing the crawling

xenial-black•2y ago

Hi, I was not able to reproduce it since the provided code snippets appear to be incomplete. Despite filling in the missing imports and attempting to execute the code, I encountered the following error:

AttributeError: 'Testing' object has no attribute 'is_valid_url'

AttributeError: 'Testing' object has no attribute 'is_valid_url'

Could you please provide the complete and functional code of your Actor? Additionally, providing a link to the run of your Actor would be helpful as well.

MEE6•2y ago

@Vlada Dusek just advanced to level 1! Thanks for your contributions! 🎉

xenial-blackOP•2y ago

 def is_valid_url(self, url):
        try:
            parsed = urlparse(url)
            return True
        except Exception as e:
            print(f"Error validating URL: {e}")
            return False

 def is_valid_url(self, url):
        try:
            parsed = urlparse(url)
            return True
        except Exception as e:
            print(f"Error validating URL: {e}")
            return False

from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from urllib.parse import urlparse
from apify import Actor

from apify import Actor

from ..items import Items
import scrapy
import re
import json

from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from urllib.parse import urlparse
from apify import Actor

from apify import Actor

from ..items import Items
import scrapy
import re
import json

xenial-blackOP•2y ago

https://console.apify.com/view/runs/RfltIyAhCoccch9Vu

Apify

Apify Console

Manage the Apify platform and your account.

xenial-blackOP•2y ago

it does an infinite run for a reason that I ignore after finishing scraping everything

xenial-black•2y ago

Hey, I did some investigation... If you add an Item Pipeline to your Scrapy-Apify project (based on the Scrapy Actor template) it works, and close_spider method is correctly called after the spider finishes its work. I even tried to use your DataCleaningPipeline pipeline, and it works, there is not a bug in it. The problem has to be somewhere in your Spider and/or in the main.py. I suggest you to keep the main.py as simple as possible, e.g. like this:

from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from apify import Actor
from .spiders.title import TitleSpider as Spider

def _get_scrapy_settings() -> Settings:
    settings = get_project_settings()
    settings['ITEM_PIPELINES']['apify.scrapy.pipelines.ActorDatasetPushPipeline'] = 1000
    settings['DOWNLOADER_MIDDLEWARES']['scrapy.downloadermiddlewares.retry.RetryMiddleware'] = None
    settings['DOWNLOADER_MIDDLEWARES']['apify.scrapy.middlewares.ApifyRetryMiddleware'] = 1000
    settings['SCHEDULER'] = 'apify.scrapy.scheduler.ApifyScheduler'
    return settings

async def main() -> None:
    async with Actor:
        Actor.log.info('Actor is being executed...')
        settings = _get_scrapy_settings()
        process = CrawlerProcess(settings, install_root_handler=False)
        process.crawl(Spider)
        process.start()

from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from apify import Actor
from .spiders.title import TitleSpider as Spider

def _get_scrapy_settings() -> Settings:
    settings = get_project_settings()
    settings['ITEM_PIPELINES']['apify.scrapy.pipelines.ActorDatasetPushPipeline'] = 1000
    settings['DOWNLOADER_MIDDLEWARES']['scrapy.downloadermiddlewares.retry.RetryMiddleware'] = None
    settings['DOWNLOADER_MIDDLEWARES']['apify.scrapy.middlewares.ApifyRetryMiddleware'] = 1000
    settings['SCHEDULER'] = 'apify.scrapy.scheduler.ApifyScheduler'
    return settings

async def main() -> None:
    async with Actor:
        Actor.log.info('Actor is being executed...')
        settings = _get_scrapy_settings()
        process = CrawlerProcess(settings, install_root_handler=False)
        process.crawl(Spider)
        process.start()

And move the start_urls, domain, and other related logic to the Spider (class attributes). And then try to debug your Spider code.

...

class TestSpider(Spider):
    name = 'test'
    second_pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-]+')
    email_pattern = re.compile(r'(?:mailto:)?[A-Za-z0-9._%+-]+@[A-Za-z.-]+\.[A-Za-z]{2,}')
    links_pattern = re.compile(r'(twitter\.com|facebook\.com|instagram\.com|linkedin\.com)/')
    phone_pattern = re.compile(r'tel:\+\d+')

    start_urls = [
        'https://gelsf.com/',
        'https://vowconstruction.com/',
        'https://prosperdevelopment.com/',
        'https://missionhomeremodeling.com/',
        'https://www.leefamilycorp.com/',
        'https://www.a2zremodelingcal.com/',
        'https://lemusco.com/',
        'https://www.agcsf.com/',
        'https://www.goldenheightsremodeling.com/',
    ]

    allowed_domains = [
        'gelsf.com',
        'vowconstruction.com',
        'prosperdevelopment.com',
        'missionhomeremodeling.com',
        'www.leefamilycorp.com',
        'www.a2zremodelingcal.com',
        'lemusco.com',
        'www.agcsf.com',
        'www.goldenheightsremodeling.com',
    ]

    headers = {
        ...
    }

    ...

...

class TestSpider(Spider):
    name = 'test'
    second_pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-]+')
    email_pattern = re.compile(r'(?:mailto:)?[A-Za-z0-9._%+-]+@[A-Za-z.-]+\.[A-Za-z]{2,}')
    links_pattern = re.compile(r'(twitter\.com|facebook\.com|instagram\.com|linkedin\.com)/')
    phone_pattern = re.compile(r'tel:\+\d+')

    start_urls = [
        'https://gelsf.com/',
        'https://vowconstruction.com/',
        'https://prosperdevelopment.com/',
        'https://missionhomeremodeling.com/',
        'https://www.leefamilycorp.com/',
        'https://www.a2zremodelingcal.com/',
        'https://lemusco.com/',
        'https://www.agcsf.com/',
        'https://www.goldenheightsremodeling.com/',
    ]

    allowed_domains = [
        'gelsf.com',
        'vowconstruction.com',
        'prosperdevelopment.com',
        'missionhomeremodeling.com',
        'www.leefamilycorp.com',
        'www.a2zremodelingcal.com',
        'lemusco.com',
        'www.agcsf.com',
        'www.goldenheightsremodeling.com',
    ]

    headers = {
        ...
    }

    ...

xenial-blackOP•2y ago

The problem is thzt I dont know what the allowed domains will be Nor the start urls No, the allowed domains Is by user input so I dont know what will the user input be I removed the apify.scrapy.pipelines.ActorDataSet Because it was pushing the non cleaned items And even with that it was still infinite running Perhaps should I put everything on one file ? @Vlada Dusek Instead of a project

xenial-black•2y ago

I removed the apify.scrapy.pipelines.ActorDataSet Because it was pushing the non cleaned items

I believe for this use case you don't have to remove ActorPushDatasetPipeline, you should rather implement your own cleaning pipeline, which will be executed before the Push pipeline (which should be the latest one). Example of such CleaningPipeline:

class CleaningPipeline:
    def process_item(self, item: BookItem, spider: Spider) -> BookItem:
        number_map = {
            'one': 1,
            'two': 2,
            'three': 3,
            'four': 4,
            'five': 5,
        }
        return BookItem(
            title=item['title'],
            price=float(item['price'].replace('£', '')),
            rating=number_map[item['rating'].split(' ')[1].lower()],
            in_stock=bool(item['in_stock'].lower() == 'in stock'),
        )

class CleaningPipeline:
    def process_item(self, item: BookItem, spider: Spider) -> BookItem:
        number_map = {
            'one': 1,
            'two': 2,
            'three': 3,
            'four': 4,
            'five': 5,
        }
        return BookItem(
            title=item['title'],
            price=float(item['price'].replace('£', '')),
            rating=number_map[item['rating'].split(' ')[1].lower()],
            in_stock=bool(item['in_stock'].lower() == 'in stock'),
        )

xenial-blackOP•2y ago

There's another problem I have to do the way I do it

xenial-black•2y ago

I dont know what the allowed domains will be Nor the start urls No, the allowed domains Is by user input so I dont know what will the user input be

I got it. But for the purpose of debugging you can select one possible input and hard code it into Spider.

xenial-blackOP•2y ago

I'll do that. Ty

xenial-black•2y ago

Yeah, I believe the problem has to be in the Spider, because if I try it with other one, it works.

xenial-blackOP•2y ago

I believe too On local run without apify, it works very well

Gaming

Programming

Apify not executing scrapy's close_spider function and going insane after it finishes scraping.

Did you find this page helpful?