Not able to Crawl the website using Include Paths(Filter).
Hi,
I am trying to crawl the webpages from this domain https://www.peigenesis.com/ using path filters.
But i am not able to get the associated web pages with that path. I have been trying it from a long time. Can someone pls help me.
params2={
'limit': 4,
'maxDepth': 10,
'includePaths': ["/part-information/*"],
#"excludePaths": [],
#"ignoreSitemap": True,
"allowBackwardLinks": True,
"allowExternalLinks": False,
#"webhook": "<string>",
"scrapeOptions": {
"formats": ["markdown","html"],
#"headers": {},
#"includeTags": ['#parts a','h1 span','h2 span','#tools a','#tools .fieldvalue','#mates a','#mates .fieldvalue'],
"excludeTags": ['img'],
"onlyMainContent": True,
"waitFor": 2000
}
} this is the params that i am using while sending the api request
Thank you.
5 Replies
Hey Krishna! Sorry about this. Looping in @rafaelmiller and creating a ticket
@Caleb is there any update ?
Hi @Krishna , I tested the page you sent, and it looks like the base URL doesn’t contain any child links matching the
includePaths
pattern (/part-information/*
). Is there a specific URL you’re expecting to see in the crawl response?Thank you for your prompt reply. I understand there may have been some confusion due to the domain address I initially mentioned, so let me clarify with a simpler example:
1. Starting URL:
I am starting the crawl at the following URL:
I would like to crawl only the URLs matching the above pattern using the
https://www.peigenesis.com/en/shop/f/TVP00RW2519PA.html
When you visit this page, you will notice it lists about eight parts. Each part has its own detailed URL with the following pattern:
https://www.peigenesis.com/en/shop/part-information/{part-id}/APH/EACH/{id}.html
2. Example:
For instance, the first part on this page points to the URL:
https://www.peigenesis.com/en/shop/part-information/TVP00RW2519PA/APH/EACH/1365462.html
3. My Workflow:
The workflow I’m trying to achieve is quite straightforward:
- Starting URL: https://www.peigenesis.com/en/shop/f/TVP00RW2519PA.html
- Target URL Pattern: https://www.peigenesis.com/en/shop/part-information/{part-id}/APH/EACH/{id}.html
I would like to crawl only the URLs matching the above pattern using the
includePaths
filters. (mostly by part-information/)
4. Next Steps:
We are very interested in moving forward with the subscription plan, but we’re encountering these issues. I would greatly appreciate it if this could be addressed at your earliest convenience.
Thank you for your assistance!
Paths Tested:
1. /en/shop/part-information/: For this path, only one page (the base URL) is scraped, and no further links are followed.
2. /en/shop/*: For this path, only two pages are returned, but they are sourced from the navigation bar.
In both cases, the scraper seems unable to follow the links within the main content of the page. Despite setting maxDepth and testing both values of allowBackwardLinks, the tool is not progressing beyond the initial or navigational pages.
Thank you, look forward to hearing from you. @rafaelmiller @CalebHi Krishna. I responded you via email. Do you have any other questions? I'm closing this conversation for now.