I'm using langchain and vectorize with

I'm using langchain and vectorize with ada embedding from openAI. When I do a similarity search I'm getting repeated entries returned. What am I missing?
15 Replies
Jerome
Jerome5mo ago
To clarify, the repeated entries returned are different vectors present multiple times in your index, correct?
Basil 巴兹尔
Basil 巴兹尔5mo ago
I don't think so. I was using a hashing function to create the ids. I hashed what came back and each repeat was identical. I deleted the first database thinking it was something on my end but it happened again.
Basil 巴兹尔
Basil 巴兹尔5mo ago
Here's the script if you want to check it's not a mistake on my end. I'm no python programmer. https://github.com/anaxios/text_prepare_for_vectorizing in main.py lines 21 and 22 are the only changes I made. when it's "split by pages" is when I was getting multiples.
GitHub
GitHub - anaxios/text_prepare_for_vectorizing
Contribute to anaxios/text_prepare_for_vectorizing development by creating an account on GitHub.
Basil 巴兹尔
Basil 巴兹尔5mo ago
Actually I just tested the endpoint I setup and it's still returning repeats. https://langchain-workers.derelict.workers.dev/vector?get=mankind
Jerome
Jerome5mo ago
could you please verity that the vectorize results for repeated entries have different IDs or same IDs?
Basil 巴兹尔
Basil 巴兹尔5mo ago
how do I go about that?
Jerome
Jerome5mo ago
by querying the vectorize index directly I suppose, or if you can observe raw vectorize results pulled by langchain
Basil 巴兹尔
Basil 巴兹尔5mo ago
I noticed my script it outputting the entries twice, but they all have the same IDs. I need to lookup how to query vectorize directly. langchain doesn't support returning ids from search queries AFAIK I was just looking over the api and it wasn't returning anything from the ids and I noticed the keys are too long. Looks like I have a few kinks to work out. Does the api truncate the ids if they are too long? I didn't get an error response.
Jerome
Jerome5mo ago
no truncation, inserts/upserts containing IDs that are too long are rejected
Basil 巴兹尔
Basil 巴兹尔5mo ago
hmm are the entries I made put with a random id or something?
Jerome
Jerome5mo ago
ids are optional, if not provided a random UUID is plugged in their place yes
Basil 巴兹尔
Basil 巴兹尔5mo ago
This makes sense now! Thank you for your help. This whole domain is really new to me. Might I ask, what is a normal way to make an id for an entry? Is hashing the text a decent way to go?
Jerome
Jerome5mo ago
usually you'd want to use the ID to tether your vector to the original data it derives, likely an ID from an external system; like a product ID, the title of a book, the path of an image, ... if your data is new content and is not tethered to an external system's ID, like embeddings provided by LLMs, it's a matter of finding a value that uniquely identifies the data that this vector derives; in which case a hash of the text provided to the LLM is fine for instance
Basil 巴兹尔
Basil 巴兹尔5mo ago
Thank you!
Jerome
Jerome5mo ago
anytime!