Nearest neighbour search on Spark

Hnswlib Spark integrates HNSW with Spark MLlib, delivering scalable nearest neighbor search for Python and Scala.

Get started now View it on GitHub

K-Nearest Neighbors (KNN) search is a fundamental algorithm in machine learning and data science used to find the k most similar data points to a given query point. It operates in high-dimensional spaces and is widely applied in recommendation systems, anomaly detection, classification, and clustering. The core idea behind KNN is based on measuring distances—typically using Euclidean distance, cosine similarity, or other metrics—to identify the closest neighbors within a dataset. While a brute-force approach (comparing all points) is computationally expensive, approximate methods like Hierarchical Navigable Small World (HNSW) graphs significantly speed up search times, making KNN feasible for large-scale datasets. In distributed environments like Apache Spark, KNN search can be parallelized for scalability, enabling efficient nearest neighbor retrieval across vast data collections.

About the project

License

Hnwslib spark is distributed under the Apache V2 license.

Contributing

When contributing to this repository, please first discuss the change you wish to make via issue, email, or any other method with the owners of this repository before making a change.

Thank you to the contributors of Hnswlib Spark

  • jelmerk
  • scala-steward
  • ykaitao
  • nur1popcorn
  • ashfaq92

Code of Conduct

Hnswlib spark adheres to No Code of Conduct. We are all adults. We accept anyone’s contributions. Nothing else matters.

View our Code of Conduct on our GitHub repository.