Question about Includes param on crawl
I am trying to crawl specific sections of https://www.lsu.edu by using the includes param. When I do this some of the pages I expect data for are returned but others are not:
e.g. in this code:
crawl_url = 'https://www.lsu.edu/'
params = {
'crawlerOptions': {
'limit': 100,
'maxDepth': 6,
'includes': [
'/cas',
'/testing',
'/cmda/theatre/resources/student/advising/index.php',
'/science/student-services/advising/',
'/registrar/academics/academic-calendars/index.php',
'/financialaid/types_of_scholarships/academic_common_market/index.php',
'/majors/fast-tracks.php',
'/eng/current/advising/index.php',
'/cce/academics/undergraduate/advising.php',
'/agriculture/students/student-services/advisors.php',
'/financialaid/apply_for_scholarships/'
]
},
'pageOptions': {
'onlyMainContent': True,
'parsePDF': True,
}
}
crawl_result = app.crawl_url(crawl_url, params=params)
everything I expect in /cas , /testing and /majors/fast-tracks.php but I don't get anything back for: '/cmda/theatre/resources/student/advising/index.php'
But if I crawl for https://www.lsu.edu/cmda/theatre/resources/student/advising/index.php directly - I get back the page I am expecting.
3 Replies
What can I do to get the results for /cmda/theatre/resources/student/advising/index.php in my first example?
FWIW - The first scenario doesn't seem to work in the playground either
Out of curiosity, how many pages in total are you gathering? Are you hitting the 100 page limit you set?
If thats not, I wouldn't be suprised if there was some funny business going on with the sitemap 🤔
I'm not hitting the 100 pages - I get back 27. Interesting about the sitemap. I will try my run again without using the sitemap. Thank you!
You were correct about the sitemap. Thank you very much for your help.