CA
Crawlee & Apify•2y ago
passive-yellow

start urls input

I can get the input from Apify in my Crawlee Playwright code and console.log() the start urls, but I am not sure how to access them because it says the start urls are of type any instead of an array of strings. Can you provide some example code for this to be extracted so I can use them as start urls in my code?
19 Replies
Pepa J
Pepa J•2y ago
Hi @Casper , I am not sure if I understand, You should be able to set the type of the input as:
interface InputSchema {
startUrls: string[],
}

// ...

const input = await Actor.getInput<InputSchema>();
// input.startUrls is type of string[]
interface InputSchema {
startUrls: string[],
}

// ...

const input = await Actor.getInput<InputSchema>();
// input.startUrls is type of string[]
passive-yellow
passive-yellowOP•2y ago
thanks, how can I add the start urls to this code?
await crawler.run([
{
url: startUrls,
label: "companyInfo",
},
]);
await crawler.run([
{
url: startUrls,
label: "companyInfo",
},
]);
I read input like this at the moment
const input = (await Actor.getInput()) as Record<string, any>;
const input = (await Actor.getInput()) as Record<string, any>;
Then convert to Array of strings:
var companyWebsites = input.companyWebsites as Array<string>;
var companyWebsites = input.companyWebsites as Array<string>;
Pepa J
Pepa J•2y ago
You mean sometihng like this? I would not recomend using as operator when it is not necessary.
interface InputSchema {
startUrls: string[],
}

// ...

const input = await Actor.getInput<InputSchema>();
// input.startUrls is type of string[]

await crawler.run(input.startUrls.map((startUrl) => ({
url: startUrl,
label: "companyInfo",
})));
interface InputSchema {
startUrls: string[],
}

// ...

const input = await Actor.getInput<InputSchema>();
// input.startUrls is type of string[]

await crawler.run(input.startUrls.map((startUrl) => ({
url: startUrl,
label: "companyInfo",
})));
passive-yellow
passive-yellowOP•2y ago
yes, something like this:
await Actor.main(async () => {
const crawler = new PlaywrightCrawler({
requestHandler: router,
});
await crawler.run(
companyWebsites.map((startUrl) => ({
url: startUrl,
label: "companyInfo",
}))
);
});
await Actor.main(async () => {
const crawler = new PlaywrightCrawler({
requestHandler: router,
});
await crawler.run(
companyWebsites.map((startUrl) => ({
url: startUrl,
label: "companyInfo",
}))
);
});
` I will convert to the schema interface as well later I have converted to input schema now but it does not work RROR Received one or more errors Error: Received one or more errors at ArrayValidator.handle (C:\Development\ApifyWebscrapers\crawlee-trustpilot-review-actor\node_modules@sapphire\shapeshift\src\validators\ArrayValidator.ts:21:14)
Pepa J
Pepa J•2y ago
what do you have in INPUT.json and input_schema.json?
passive-yellow
passive-yellowOP•2y ago
INPUT.json
{
"runMode": "PRODUCTION",
"companyWebsites": [
"shopwagandtail.com",
"trustpilot.com"
],
"sortBy": "recency",
"filterByStarRating": "5",
"filterBylanguage": "en",
"filterByVerified": "yes",
"startFromPageNumber": "2",
"endAtPageNumber": "3"
}
{
"runMode": "PRODUCTION",
"companyWebsites": [
"shopwagandtail.com",
"trustpilot.com"
],
"sortBy": "recency",
"filterByStarRating": "5",
"filterBylanguage": "en",
"filterByVerified": "yes",
"startFromPageNumber": "2",
"endAtPageNumber": "3"
}
`Error: Input schema is not a valid JSON (SyntaxError: Unexpected token } in JSON at position 458)
`Error: Input schema is not a valid JSON (SyntaxError: Unexpected token } in JSON at position 458)
passive-yellow
passive-yellowOP•2y ago
Pepa J
Pepa J•2y ago
So maybe you need:
await crawler.run(
companyWebsites.map((startUrl) => ({
url: startUrl.url,
label: "companyInfo",
}))
);
});
await crawler.run(
companyWebsites.map((startUrl) => ({
url: startUrl.url,
label: "companyInfo",
}))
);
});
passive-yellow
passive-yellowOP•2y ago
apify vis
NOT SUPPORTED: option cache. Map is used as cache, schema object as key. Error: Input schema is not a valid JSON (SyntaxError: Unexpected token } in JSON at position 458)
Pepa J
Pepa J•2y ago
And the companyWebsites should be array of objects: in INPUT.json
"companyWebsites": [{ url: "shopwagandtail.com" }, { url: "trustpilot.com"}],
"companyWebsites": [{ url: "shopwagandtail.com" }, { url: "trustpilot.com"}],
passive-yellow
passive-yellowOP•2y ago
arh, can I change it to just array of strings? I fixed schema now but still not array
Pepa J
Pepa J•2y ago
The attribute in input_schema.json requires the format with objects:
"editor": "requestListSources",
"editor": "requestListSources",
If you want to use array of string you have to use different editor like:
"editor": "json"
"editor": "json"
passive-yellow
passive-yellowOP•2y ago
ok what would be the easiest for me and my customer?
Pepa J
Pepa J•2y ago
I am not sure if I follow. Using
"editor": "requestListSources",
"editor": "requestListSources",
Is totally fine but it requires, the specific format of input. You mentioned you need different input, so I suggested you to use plain JSON editor for it. I don't know your customer, there are plenty options, you may check https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1
Specification (v1) | Apify Documentation
Learn how to define and easily validate a schema for your Actor's input with code examples. Provide an autogenerated input UI for your Actor's users.
passive-yellow
passive-yellowOP•2y ago
thanks it works now, but how can I use label with startUrl()? await crawler.run(startUrls);
Pepa J
Pepa J•2y ago
Hmm.. you may use the .map to hardcode the single label it as I already mentioned.
passive-yellow
passive-yellowOP•2y ago
map just skips the next url in the array
Pepa J
Pepa J•2y ago
What do you mean by that?
await crawler.run(
companyWebsites.map((startUrl) => ({
url: startUrl.url,
label: "companyInfo", // sets the label
}))
);
});
await crawler.run(
companyWebsites.map((startUrl) => ({
url: startUrl.url,
label: "companyInfo", // sets the label
}))
);
});
passive-yellow
passive-yellowOP•2y ago
hmm it seems to work, I will check tomorrow, thanks 🙂

Did you find this page helpful?