Short answer is that you need to consider strategy and how to shape the data in your training phase.

Short answer is that you need to consider strategy and how to shape the data in your training phase. Similarity is exactly what the name suggests - mathematical similarity in terms of position, generally returned from cosine (or dot product) equations - giving you an approximation of similarity based on vector positions and or directions and or magnitudes of said vectors etc The values that make the calculation possible are from the embeddings model you use - which is where the real magic takes place.

The only way for similarity to happen accurately is for the initial embedding pass to reflect tokens with spatial efficiency in terms of relationships. So when you're asking for a similarity score return of 'What are the opening hours' - the similarity of that ENTIRE sentence to the ENTIRE embedding sentence is what is being computed - you're comparing 5 tokens in your query against 33 tokens in your embedding, so you must expect a much lower score.

The 'similarity' is not some magical logic that is reasoning over what you have asked, it is a pure math calculation.
Was this page helpful?