Not able to Crawl the website using Include Paths(Filter).

Hi, I am trying to crawl the webpages from this domain https://www.peigenesis.com/ using path filters. But i am not able to get the associated web pages with that path. I have been trying it from a long time. Can someone pls help me. params2={ 'limit': 4, 'maxDepth': 10, 'includePaths': ["/part-information/*"], #"excludePaths": [], #"ignoreSitemap": True, "allowBackwardLinks": True, "allowExternalLinks": False, #"webhook": "<string>", "scrapeOptions": { "formats": ["markdown","html"], #"headers": {}, #"includeTags": ['#parts a','h1 span','h2 span','#tools a','#tools .fieldvalue','#mates a','#mates .fieldvalue'], "excludeTags": ['img'], "onlyMainContent": True, "waitFor": 2000 } } this is the params that i am using while sending the api request Thank you.
5 Replies
Caleb
Caleb12mo ago
Hey Krishna! Sorry about this. Looping in @rafaelmiller and creating a ticket
Krishna
KrishnaOP12mo ago
@Caleb is there any update ?
rafaelmiller
rafaelmiller12mo ago
Hi @Krishna , I tested the page you sent, and it looks like the base URL doesn’t contain any child links matching the includePaths pattern (/part-information/*). Is there a specific URL you’re expecting to see in the crawl response?
Krishna
KrishnaOP12mo ago
Thank you for your prompt reply. I understand there may have been some confusion due to the domain address I initially mentioned, so let me clarify with a simpler example: 1. Starting URL: I am starting the crawl at the following URL: https://www.peigenesis.com/en/shop/f/TVP00RW2519PA.html When you visit this page, you will notice it lists about eight parts. Each part has its own detailed URL with the following pattern: https://www.peigenesis.com/en/shop/part-information/{part-id}/APH/EACH/{id}.html 2. Example: For instance, the first part on this page points to the URL: https://www.peigenesis.com/en/shop/part-information/TVP00RW2519PA/APH/EACH/1365462.html 3. My Workflow: The workflow I’m trying to achieve is quite straightforward: - Starting URL: https://www.peigenesis.com/en/shop/f/TVP00RW2519PA.html - Target URL Pattern: https://www.peigenesis.com/en/shop/part-information/{part-id}/APH/EACH/{id}.html
I would like to crawl only the URLs matching the above pattern using the includePaths filters. (mostly by part-information/) 4. Next Steps: We are very interested in moving forward with the subscription plan, but we’re encountering these issues. I would greatly appreciate it if this could be addressed at your earliest convenience. Thank you for your assistance! Paths Tested: 1. /en/shop/part-information/: For this path, only one page (the base URL) is scraped, and no further links are followed. 2. /en/shop/*: For this path, only two pages are returned, but they are sourced from the navigation bar. In both cases, the scraper seems unable to follow the links within the main content of the page. Despite setting maxDepth and testing both values of allowBackwardLinks, the tool is not progressing beyond the initial or navigational pages. Thank you, look forward to hearing from you. @rafaelmiller @Caleb
rafaelmiller
rafaelmiller12mo ago
Hi Krishna. I responded you via email. Do you have any other questions? I'm closing this conversation for now.

Did you find this page helpful?