I need some info around the security due diligence etc around vectorize, like is the data encrypted, is there encryption at rest, GDPR, CCPA, HIPAA compliance.
Is there a doc or something we can give to our infosec guys?
Assuming that Vectorize is built on top of the existing primitives like Workers, Durable Objects and Workers KV then there will be information that already covers those.
Generally all of Cloudflare's offerings are covered under Standard Contractual Clauses with regards to GDPR.
not sure if you guys have any guidance on optimal chunk size, think langchain textsplitter defaults to 1000 with 200 overlap. Some supabase pgvector articles kinda saying the same, 500 - 1000 chars..
Just remember that your chunk size is going to be limited by the input token max on the embedding model if you're using CF hosted models for that - generally, it appears to be around 170 words based on 1-3 tokens per word: https://developers.cloudflare.com/workers-ai/models/embedding/
Hi, first of all, thank you very much for your new AI direction! I have a question about Vector database best practices. I added some information about a company and tried to search for it using different questions. For example, one of the embeddings is "Store opening hours: Monday to Saturday: 9:00 AM to 9:00 PM, Sunday: 10:00 AM to 6:00 PM". If I ask different questions like "Are you open on Sundays", the similarity score is quite low, about 0.62 and it doesn't pass SIMILARITY_CUTOFF. Then the final answer is wrong. Do you have any advice on how should I prepare my texts before making embeddings? Should I create separate embeddings for every day of the week? But in that case the question "What are opening hours?" might not return the correct answer. Or do I need to add every day separately, plus my initial embedding with all days together? If I try to add every combination separately, then later I will have a mess in my database and I will lose control. Please share any suggestions based on your experience. Thanks a lot.
Not sure what you are thinking here - the vectors are a numerical representation of the data, so you need to store the actual data in say a D1 instance (could be other data sources as well such as R2 or KV) - but the point of metadata is to allow you to connect those embedding back to a datasource on retrieval. i.e. get back the vector response, look at the metadata and then look up the data in D1 using SQL
Short answer is that you need to consider strategy and how to shape the data in your training phase. Similarity is exactly what the name suggests - mathematical similarity in terms of position, generally returned from cosine (or dot product) equations - giving you an approximation of similarity based on vector positions and or directions and or magnitudes of said vectors etc The values that make the calculation possible are from the embeddings model you use - which is where the real magic takes place.
The only way for similarity to happen accurately is for the initial embedding pass to reflect tokens with spatial efficiency in terms of relationships. So when you're asking for a similarity score return of 'What are the opening hours' - the similarity of that ENTIRE sentence to the ENTIRE embedding sentence is what is being computed - you're comparing 5 tokens in your query against 33 tokens in your embedding, so you must expect a much lower score.
The 'similarity' is not some magical logic that is reasoning over what you have asked, it is a pure math calculation.
If it’s a once off thing for a client, I’d put the opening hours in the system message. E.g. if the question relates to the opening hours, respond based on the following opening hours..