Pokey5324 - Hello all - I am looking for guidan...
Hello all - I am looking for guidance with this task - Any tips on how to use Firecrawl to crawl a website and turn it into RDF that is stored in Amazon Neptune
4 Replies
cc @Harsh
Hi @Ckilborn
Here is a high-level overview
1. Use Firecrawl to crawl the site and get structured page content (JSON / markdown).
2. Convert each page’s structured content to RDF triples (map fields to triples - pick a vocabulary like schema.org or dcterms).
3. Write the triples in a Neptune-supported RDF file (Turtle, N-Triples, or N-Quads).
4. Neptune accepts Turtle, N-Triples, N-Quads, RDF/XML. A small validation step is recommended. This prevents ingestion failures.
5. Upload those files to an S3 bucket.
6. Give Neptune permission to read the S3 objects
7. Use Neptune’s bulk loader to import the files from S3 into your Neptune cluster - use AWS CLI for simplicity.
8. Verify the import by running SPARQL queries against Neptune’s SPARQL endpoint.
hey @Ckilborn Did that work, or can I help further? I'm around if you need me 🙂
Appreciate the follow up. I am slowly coding it. Do you have any examples to share? Even if the example is a small part of the process it would help
sure, here are some resources. Hope these help
https://docs.firecrawl.dev/features/crawl
Neptune RDF load data formats (Turtle, N-Triples, N-Quads, RDF/XML): https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-tutorial-format-rdf.html
Neptune “Using the bulk loader” overview: https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load.html
Neptune Loader API details: https://docs.aws.amazon.com/neptune/latest/userguide/load-api-reference-load.html