F
Firecrawl15mo ago
Janice

Question about Includes param on crawl

I am trying to crawl specific sections of https://www.lsu.edu by using the includes param. When I do this some of the pages I expect data for are returned but others are not: e.g. in this code: crawl_url = 'https://www.lsu.edu/' params = { 'crawlerOptions': { 'limit': 100, 'maxDepth': 6, 'includes': [ '/cas', '/testing', '/cmda/theatre/resources/student/advising/index.php', '/science/student-services/advising/', '/registrar/academics/academic-calendars/index.php', '/financialaid/types_of_scholarships/academic_common_market/index.php', '/majors/fast-tracks.php', '/eng/current/advising/index.php', '/cce/academics/undergraduate/advising.php', '/agriculture/students/student-services/advisors.php', '/financialaid/apply_for_scholarships/' ] }, 'pageOptions': { 'onlyMainContent': True, 'parsePDF': True, } } crawl_result = app.crawl_url(crawl_url, params=params) everything I expect in /cas , /testing and /majors/fast-tracks.php but I don't get anything back for: '/cmda/theatre/resources/student/advising/index.php' But if I crawl for https://www.lsu.edu/cmda/theatre/resources/student/advising/index.php directly - I get back the page I am expecting.
3 Replies
Janice
JaniceOP15mo ago
What can I do to get the results for /cmda/theatre/resources/student/advising/index.php in my first example? FWIW - The first scenario doesn't seem to work in the playground either
Caleb
Caleb15mo ago
Out of curiosity, how many pages in total are you gathering? Are you hitting the 100 page limit you set? If thats not, I wouldn't be suprised if there was some funny business going on with the sitemap 🤔
Janice
JaniceOP15mo ago
I'm not hitting the 100 pages - I get back 27. Interesting about the sitemap. I will try my run again without using the sitemap. Thank you! You were correct about the sitemap. Thank you very much for your help.

Did you find this page helpful?