Question about Includes param on crawl
I am trying to crawl specific sections of https://www.lsu.edu by using the includes param. When I do this some of the pages I expect data for are returned but others are not:
e.g. in this code:
crawl_url = 'https://www.lsu.edu/'
params = {
'crawlerOptions': {
'limit': 100,
'maxDepth': 6,
'includes': [
'/cas',
'/testing',
'/cmda/theatre/resources/student/advising/index.php',
'/science/student-services/advising/',
'/registrar/academics/academic-calendars/index.php',
'/financialaid/types_of_scholarships/academic_common_market/index.php',
'/majors/fast-tracks.php',
'/eng/current/advising/index.php',
'/cce/academics/undergraduate/advising.php',
'/agriculture/students/student-services/advisors.php',
'/financialaid/apply_for_scholarships/'
]
},
'pageOptions': {
'onlyMainContent': True,
'parsePDF': True,
}
}
crawl_result = app.crawl_url(crawl_url, params=params)
everything I expect in /cas , /testing and /majors/fast-tracks.php but I don't get anything back for: '/cmda/theatre/resources/student/advising/index.php'
But if I crawl for https://www.lsu.edu/cmda/theatre/resources/student/advising/index.php directly - I get back the page I am expecting.
e.g. in this code:
crawl_url = 'https://www.lsu.edu/'
params = {
'crawlerOptions': {
'limit': 100,
'maxDepth': 6,
'includes': [
'/cas',
'/testing',
'/cmda/theatre/resources/student/advising/index.php',
'/science/student-services/advising/',
'/registrar/academics/academic-calendars/index.php',
'/financialaid/types_of_scholarships/academic_common_market/index.php',
'/majors/fast-tracks.php',
'/eng/current/advising/index.php',
'/cce/academics/undergraduate/advising.php',
'/agriculture/students/student-services/advisors.php',
'/financialaid/apply_for_scholarships/'
]
},
'pageOptions': {
'onlyMainContent': True,
'parsePDF': True,
}
}
crawl_result = app.crawl_url(crawl_url, params=params)
everything I expect in /cas , /testing and /majors/fast-tracks.php but I don't get anything back for: '/cmda/theatre/resources/student/advising/index.php'
But if I crawl for https://www.lsu.edu/cmda/theatre/resources/student/advising/index.php directly - I get back the page I am expecting.