More

mish15 · on Sept 5, 2023

Yes it's all wrong, because: a) recall is designed to measure binary relevance, but vector scores are not good relevance judgments and they aren't binary. b) most models optimise purely for distance, which makes nDCG look great, but causes content to clump together. This loses local ranking precision and the noise from embedding order is significantly greater than the approximation in the ANN system c) bi-encoders have significantly greater error than cross-encoders. Basically every vector DB is blowing at least one order of magnitude more resources than they need to to optimise bi-encoding efficiency which is wrong anyway.

Disclaimer: I work at Algolia.

mish15 · on Sept 5, 2023

Yeah basically all the vector "database" solutions in market have chosen data-dependent indexes, so you need the data upfront. Imagine if regular databases needed all data upfront before they could build indexes. It's kind of crazy...

mish15 · on Sept 16, 2022

On hybrid indexes with full text and vector support?

mish15 · on Sept 16, 2022

Last I heard Pinecone doesn’t even support full text search, let alone hybrid indexes, what do you think you are disproving exactly?

Real-time upserts on hybrid and vector indexes is very unusual, please link to how you do this.

mish15 · on Sept 29, 2021

You pay a decent cost to do the hash, it’s a compression algorithm of sorts. But the data is a fraction of the size and comparison is way faster. If you do many of these or compare the same ones more than once you amortise the cost very quickly

mish15 · on Sept 29, 2021

We will add neural hash based ANN to that repo when we get time. I expect HNSW to get pushed out in time for many reasons.

mish15 · on Sept 28, 2021

I was part of the above article. Happy to answer questions.

In terms of accuracy, it totally depends on the resolution needed. We can get >99% accuracy of L2 waaaaay faster with 1/10 of the memory overhead. For what we are doing that is the perfect trade off.

In terms of LSH, we tried projection hashing and quantization and were always disappointed.

sdenton4 · on Sept 28, 2021

So it seems like the neural network producing the neural hash is still a standard CNN operating on the usual vector representations? And then the learned hash gets used in a downstream problem...

Or is there actually some interesting hash-based neural algorithm lurking around somewhere?

mish15 · on Sept 29, 2021

Yes and yes.

Network based hashing is great to maximise information quality of the hash (compared to other LSH methods). It works to compress existing vectors super efficiently.

Very soon things like language embeddings will skip the vectors and instead networks output hashes directly. These are much faster as the network can learn where to use more bits where it needs resolution, as opposed to using floatXX for everything. It’s amazing to see it work, but not fully there yet.

cellis · on Sept 28, 2021

Hello! First I would like to say this is a very cool writeup. I'm not a computer scientist but do dabble a bit in neural networks. Is it possible this could be used to build a convolutional neural network?

mish15 · on Sept 28, 2021

Yes. Storage also. You can get >99% ordering quality of exhaustive cosine with a tiny fraction of memory usage

mish15 · on Sept 28, 2021

Interesting. Hadn’t seen it actually. BOW is problematic as a starting point, but this is neat and a long time ago too.

mish15 · on Aug 9, 2021

This is easily the most fun thing I’ve been involved with for years. Can’t wait to see it ship.