这是用户在 2025-1-26 19:57 为 https://www.pinecone.io/learn/vector-database/?__readwiseLocation= 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
Announcement  公告Pinecone Assistant is now generally available 🚀Learn more
Pinecone 助手现已全面推出 🚀了解更多

What is a Vector Database?
什么是向量数据库?

A vector database indexes and stores vector embeddings for fast retrieval and similarity search, with capabilities like CRUD operations, metadata filtering, horizontal scaling, and serverless.
向量数据库对向量嵌入进行索引和存储,以实现快速检索和相似性搜索,具备 CRUD 操作、元数据过滤、水平扩展和无服务器等能力。


We’re in the midst of the AI revolution. It’s upending any industry it touches, promising great innovations - but it also introduces new challenges. Efficient data processing has become more crucial than ever for applications that involve large language models, generative AI, and semantic search.
我们正处于人工智能革命之中。它正在颠覆所触及的每一个行业,承诺带来巨大的创新——但同时也引入了新的挑战。对于涉及大型语言模型、生成式人工智能和语义搜索的应用来说,高效的数据处理变得比以往任何时候都更加关键。

All of these new applications rely on vector embeddings, a type of vector data representation that carries within it semantic information that’s critical for the AI to gain understanding and maintain a long-term memory they can draw upon when executing complex tasks.
所有这些新应用都依赖于向量嵌入,这是一种向量数据表示形式,其中包含了对 AI 获取理解和维持长期记忆至关重要的语义信息,使它们能够在执行复杂任务时加以利用。

Embeddings are generated by AI models (such as Large Language Models) and have many attributes or features, making their representation challenging to manage. In the context of AI and machine learning, these features represent different dimensions of the data that are essential for understanding patterns, relationships, and underlying structures.
嵌入是由 AI 模型(如大型语言模型)生成的,具有许多属性或特征,这使得它们的表示难以管理。在 AI 和机器学习的背景下,这些特征代表了数据的不同维度,对于理解模式、关系和底层结构至关重要。

That is why we need a specialized database designed specifically for handling this data type. Vector databases like Pinecone fulfill this requirement by offering optimized storage and querying capabilities for embeddings. Vector databases have the capabilities of a traditional database that are absent in standalone vector indexes and the specialization of dealing with vector embeddings, which traditional scalar-based databases lack.
这就是为什么我们需要一个专门为此数据类型设计的专用数据库。像 Pinecone 这样的向量数据库满足了这一需求,它们为嵌入提供了优化的存储和查询能力。向量数据库具备传统数据库的能力,而这些能力是独立的向量索引所不具备的;同时,它们还拥有处理向量嵌入的专业性,这是传统基于标量的数据库所缺乏的。

The challenge of working with vector data is that traditional scalar-based databases can’t keep up with the complexity and scale of such data, making it difficult to extract insights and perform real-time analysis. That’s where vector databases come into play – they are intentionally designed to handle this type of data and offer the performance, scalability, and flexibility you need to make the most out of your data.
处理向量数据的挑战在于,传统的基于标量的数据库无法跟上此类数据的复杂性和规模,使得提取洞察和进行实时分析变得困难。这正是向量数据库发挥作用的地方——它们专门设计用于处理这种类型的数据,并提供所需的性能、可扩展性和灵活性,以充分利用您的数据。

We are seeing the next generation of vector databases introduce more sophisticated architectures to handle the efficient cost and scaling of intelligence. This ability is handled by serverless vector databases, that can separate the cost of storage and compute to enable low-cost knowledge support for AI.
我们正目睹下一代向量数据库引入更复杂的架构,以高效处理智能的成本和扩展问题。这一能力由无服务器向量数据库实现,它们能够分离存储和计算成本,从而为 AI 提供低成本的知识支持。

With a vector database, we can add knowledge to our AIs, like semantic information retrieval, long-term memory, and more. The diagram below gives us a better understanding of the role of vector databases in this type of application:
借助向量数据库,我们可以为 AI 增添知识,如语义信息检索、长期记忆等功能。下图帮助我们更好地理解向量数据库在此类应用中的作用:

Vector Database

Let’s break this down:  让我们分解一下:

  1. First, we use the embedding model to create vector embeddings for the content we want to index.
    首先,我们使用嵌入模型为我们想要索引的内容创建向量嵌入。
  2. The vector embedding is inserted into the vector database, with some reference to the original content the embedding was created from.
    向量嵌入被插入到向量数据库中,并附带一些关于生成该嵌入的原始内容的参考信息。
  3. When the application issues a query, we use the same embedding model to create embeddings for the query and use those embeddings to query the database for similar vector embeddings. As mentioned before, those similar embeddings are associated with the original content that was used to create them.
    当应用程序发出查询时,我们使用相同的嵌入模型为查询创建嵌入,并利用这些嵌入向数据库查询相似的向量嵌入。如前所述,这些相似的嵌入与用于创建它们的原始内容相关联。

What’s the difference between a vector index and a vector database?
向量索引和向量数据库有什么区别?

Standalone vector indices like FAISS (Facebook AI Similarity Search) can significantly improve the search and retrieval of vector embeddings, but they lack capabilities that exist in any database. Vector databases, on the other hand, are purpose-built to manage vector embeddings, providing several advantages over using standalone vector indices:
独立的向量索引如 FAISS(Facebook AI 相似性搜索)可以显著提升向量嵌入的搜索与检索效率,但它们缺乏数据库所具备的功能。相比之下,向量数据库专为管理向量嵌入而设计,提供了使用独立向量索引所不具备的多项优势:

  1. Data management: Vector databases offer well-known and easy-to-use features for data storage, like inserting, deleting, and updating data. This makes managing and maintaining vector data easier than using a standalone vector index like FAISS, which requires additional work to integrate with a storage solution.
    数据管理:向量数据库提供了众所周知且易于使用的数据存储功能,如插入、删除和更新数据。这使得管理和维护向量数据比使用独立的向量索引(如 FAISS)更加容易,后者需要额外的工作来与存储解决方案集成。
  2. Metadata storage and filtering: Vector databases can store metadata associated with each vector entry. Users can then query the database using additional metadata filters for finer-grained queries.
    元数据存储与过滤:向量数据库能够存储与每个向量条目相关联的元数据。用户随后可以利用额外的元数据过滤器来查询数据库,以实现更细粒度的查询。
  3. Scalability: Vector databases are designed to scale with growing data volumes and user demands, providing better support for distributed and parallel processing. Standalone vector indices may require custom solutions to achieve similar levels of scalability (such as deploying and managing them on Kubernetes clusters or other similar systems). Modern vector databases also use serverless architectures to optimize cost at scale.
    可扩展性:向量数据库旨在随着数据量和用户需求的增长而扩展,为分布式和并行处理提供更好的支持。独立的向量索引可能需要定制解决方案才能达到类似的可扩展性水平(例如在 Kubernetes 集群或其他类似系统上部署和管理它们)。现代向量数据库还使用无服务器架构来优化大规模应用的成本。
  4. Real-time updates: Vector databases often support real-time data updates, allowing for dynamic changes to the data to keep results fresh, whereas standalone vector indexes may require a full re-indexing process to incorporate new data, which can be time-consuming and computationally expensive. Advanced vector databases can use performance upgrades available via index rebuilds while maintaining freshness.
    实时更新:向量数据库通常支持实时数据更新,允许对数据进行动态更改以保持结果的新鲜度,而独立的向量索引可能需要完整的重新索引过程来整合新数据,这可能既耗时又计算成本高。先进的向量数据库可以在保持数据新鲜度的同时,利用索引重建提供的性能升级。
  5. Backups and collections: Vector databases handle the routine operation of backing up all the data stored in the database. Pinecone also allows users to selectively choose specific indexes that can be backed up in the form of “collections,” which store the data in that index for later use.
    备份与集合:向量数据库负责处理数据库中所有数据的常规备份操作。Pinecone 还允许用户有选择性地备份特定索引,这些索引可以以“集合”的形式保存,集合中存储了该索引的数据以便后续使用。
  6. Ecosystem integration: Vector databases can more easily integrate with other components of a data processing ecosystem, such as ETL pipelines (like Spark), analytics tools (like Tableau and Segment), and visualization platforms (like Grafana) – streamlining the data management workflow. It also enables easy integration with other AI related tooling like LangChain, LlamaIndex, Cohere, and many others..
    生态系统集成:向量数据库能够更轻松地与数据处理生态系统的其他组件集成,例如 ETL 管道(如 Spark)、分析工具(如 Tableau 和 Segment)以及可视化平台(如 Grafana)——从而简化数据管理工作流程。它还便于与其他 AI 相关工具(如 LangChain、LlamaIndex、Cohere 等)轻松集成。
  7. Data security and access control: Vector databases typically offer built-in data security features and access control mechanisms to protect sensitive information, which may not be available in standalone vector index solutions. Multitenancy through namespaces allows users to partition their indexes fully and even create fully isolated partitions within their own index.
    数据安全与访问控制:向量数据库通常提供内置的数据安全功能和访问控制机制,以保护敏感信息,这些功能在独立的向量索引解决方案中可能不可用。通过命名空间实现的多租户特性,允许用户完全划分其索引,甚至在自己的索引内创建完全隔离的分区。

In short, a vector database provides a superior solution for handling vector embeddings by addressing the limitations of standalone vector indices, such as scalability challenges, cumbersome integration processes, and the absence of real-time updates and built-in security measures, ensuring a more effective and streamlined data management experience.
简而言之,向量数据库通过解决独立向量索引的局限性,如可扩展性挑战、繁琐的集成过程以及缺乏实时更新和内置安全措施,为处理向量嵌入提供了更优的解决方案,确保了更高效和流畅的数据管理体验。

Start using Pinecone for free
开始免费使用 Pinecone
Pinecone is the developer-favorite vector database that's fast and easy to use at any scale.
Pinecone 是开发者喜爱的向量数据库,无论规模大小,都能快速且易于使用。

How does a vector database work?
向量数据库是如何工作的?

We all know how traditional databases work (more or less)—they store strings, numbers, and other types of scalar data in rows and columns. On the other hand, a vector database operates on vectors, so the way it’s optimized and queried is quite different.
我们都知道传统数据库的工作原理(或多或少)——它们以行和列的形式存储字符串、数字和其他类型的标量数据。另一方面,向量数据库则操作向量,因此其优化和查询方式大不相同。

In traditional databases, we are usually querying for rows in the database where the value usually exactly matches our query. In vector databases, we apply a similarity metric to find a vector that is the most similar to our query.
在传统数据库中,我们通常查询数据库中值与我们查询完全匹配的行。而在向量数据库中,我们应用相似性度量来找到与查询最相似的向量。

A vector database uses a combination of different algorithms that all participate in Approximate Nearest Neighbor (ANN) search. These algorithms optimize the search through hashing, quantization, or graph-based search.
向量数据库采用多种算法的组合,这些算法都参与近似最近邻(ANN)搜索。这些算法通过哈希、量化或基于图的搜索来优化搜索过程。

These algorithms are assembled into a pipeline that provides fast and accurate retrieval of the neighbors of a queried vector. Since the vector database provides approximate results, the main trade-offs we consider are between accuracy and speed. The more accurate the result, the slower the query will be. However, a good system can provide ultra-fast search with near-perfect accuracy.
这些算法被整合到一个管道中,能够快速且准确地检索查询向量的邻近向量。由于向量数据库提供的是近似结果,我们主要权衡的是精度与速度之间的关系。结果越精确,查询速度就越慢。然而,一个优秀的系统能够在保持近乎完美精度的同时,提供超快速的搜索能力。

Here’s a common pipeline for a vector database:
这是一个向量数据库的常见流程:

Vector Database pipeline
  1. Indexing: The vector database indexes vectors using an algorithm such as PQ, LSH, or HNSW (more on these below). This step maps the vectors to a data structure that will enable faster searching.
    索引:向量数据库使用诸如 PQ、LSH 或 HNSW(下文将详细介绍这些算法)等算法对向量进行索引。此步骤将向量映射到一种数据结构,以实现更快的搜索。
  2. Querying: The vector database compares the indexed query vector to the indexed vectors in the dataset to find the nearest neighbors (applying a similarity metric used by that index)
    查询:向量数据库将索引的查询向量与数据集中的索引向量进行比较,以找到最近的邻居(应用该索引使用的相似性度量)
  3. Post Processing: In some cases, the vector database retrieves the final nearest neighbors from the dataset and post-processes them to return the final results. This step can include re-ranking the nearest neighbors using a different similarity measure.
    后处理:在某些情况下,向量数据库从数据集中检索出最终的最近邻,并对它们进行后处理以返回最终结果。此步骤可能包括使用不同的相似性度量对最近邻进行重新排序。

Serverless Vector Databases
无服务器向量数据库

Serverless represents the next evolution of vector databases. The above architectures get us to a vector database architecture that is accurate, fast, scalable, but expensive. This architecture is what we see in first-generation vector DBs. With the rise of AI use cases where cost and elasticity are increasingly important, a second generation of serverless vector databases is needed.
无服务器技术代表了向量数据库的下一个进化阶段。上述架构使我们获得了一个准确、快速、可扩展但成本高昂的向量数据库架构。这种架构是我们所见的第一代向量数据库的特征。随着 AI 应用场景的兴起,成本和弹性变得日益重要,因此需要第二代无服务器向量数据库。

First-generation vector DBs have three critical pain points that a serverless vector database solves:
第一代向量数据库存在三个关键痛点,而无服务器向量数据库能够解决这些问题:

  • Separation of storage from compute: To optimize costs, compute should only be used when needed. That means decoupling the index storage from queries and searching only what is needed — this becomes increasingly difficult when involving latency.
    存储与计算分离:为了优化成本,计算应仅在需要时使用。这意味着将索引存储与查询解耦,并仅搜索所需内容——在涉及延迟时,这变得越来越困难。
  • Multitenancy: Handling namespaces in indexes to ensure infrequently queried namespaces do not increase costs.
    多租户:处理索引中的命名空间,以确保不常查询的命名空间不会增加成本。
  • Freshness: A vector DB needs to provide fresh data, meaning within a few seconds of inserting new data, it is queryable. Note, for Pinecone Serverless freshness can be delayed when inserting large amounts of data.
    新鲜度:向量数据库需要提供新鲜的数据,这意味着在插入新数据后的几秒钟内,数据即可被查询。请注意,对于 Pinecone Serverless,在插入大量数据时,新鲜度可能会有所延迟。

To separate storage from compute highly sophisticated geometric partitioning algorithms can break an index into sub-indices, allowing us to focus the search on specific partitions:
为了将存储与计算分离,高度复杂的几何分区算法可以将索引分解为子索引,使我们能够专注于特定分区的搜索:

Chart demonstrating Voronoi tessalations
Partitioning of the search space
搜索空间的分区

With these partitions, the search space of a query can focus on just a few parts of a vector index rather than the full search space. Typical search behaviors will show that certain partitions are accessed more frequently than others, allowing us to dial between compute costs and cold startup times to find an optimal balance between cost and latency.
通过这些分区,查询的搜索空间可以集中在向量索引的少数部分,而不是整个搜索空间。典型的搜索行为会显示某些分区比其他分区访问更频繁,这使我们能够在计算成本和冷启动时间之间进行调整,以找到成本和延迟之间的最佳平衡。

When we do this partitioning, we solve the separation of compute and storage problem. However, geometric partitioning is a slower process at index build time. Meaning we can run into freshness problems as we must wait for new data to be correctly stored in the index.
当我们进行这种分区时,我们解决了计算与存储分离的问题。然而,几何分区在索引构建时是一个较慢的过程。这意味着我们可能会遇到数据新鲜度问题,因为我们必须等待新数据正确地存储到索引中。

To solve this problem, a vector database needs another separate layer called a freshness layer. The freshness layer acts as a temporary “cache” of vectors that can be queried. At the same time, we wait for an index builder to place new vectors into the geometrically partitioned index.
为了解决这个问题,向量数据库需要另一个独立的层,称为新鲜度层。新鲜度层充当可查询向量的临时“缓存”。同时,我们等待索引构建器将新向量放入几何分区的索引中。

Freshness layer diagram demonstrating the query being sent to both the freshness layer and the existing, parititioned index
The freshness layer keeps our data up to date so we can begin querying quickly.
新鲜度层确保我们的数据保持最新,以便我们能够快速开始查询。

During this process, a query router can send queries to both the index and the freshness layer — solving the freshness problem. However, it’s worth noting that the freshness layer exists in compute instances, so we cannot store the full index there. Instead, we wait for the new vectors to be inserted into the index — once complete, they are removed from the freshness layer.
在此过程中,查询路由器可以向索引和新鲜度层发送查询——从而解决新鲜度问题。然而,值得注意的是,新鲜度层存在于计算实例中,因此我们无法在那里存储完整的索引。相反,我们等待新向量被插入到索引中——一旦完成,它们就会从新鲜度层中移除。

Finally, there is the problem of multi tenancy. Many first-generation vector DBs handle multi tenancy and have done so for a long time. However, multitenancy in a serverless architecture is more complex.
最后,还有多租户的问题。许多第一代向量数据库已经处理过多租户问题,并且已经这样做了很长时间。然而,在无服务器架构中,多租户的实现更为复杂。

We must avoid colocating different types of users on the same hardware to keep costs and latencies low. If we have user A, who makes 20 queries a second almost every day on the same hardware as user B who makes 20 queries a month, user B will be stuck on compute hardware 24/7 as that is required for the constantly low latencies that user A needs.
我们必须避免将不同类型的用户部署在同一硬件上,以保持低成本和低延迟。如果我们将用户 A(每天几乎每秒进行 20 次查询)与用户 B(每月仅进行 20 次查询)放在同一硬件上,用户 B 将不得不全天候占用计算资源,因为用户 A 需要持续的低延迟。

To cover this problem, a vector database must be able to identify users with similar usage and colocate them while keeping full separation between them. Again, this can be done based on user usage metrics and automatic allocation of hot/cold infrastructure based on usage.
为了解决这个问题,向量数据库必须能够识别具有相似使用模式的用户,并将他们共置,同时保持彼此之间的完全隔离。这同样可以通过基于用户使用指标的自动分配热/冷基础设施来实现。

Taking a first-generation vector database and adding separation of storage from computing, multitenancy, and freshness gives us a new generation of modern vector databases. This architecture (paired with vector DB fundamentals) is preferred for the modern AI stack.
将第一代向量数据库与存储计算分离、多租户和实时性相结合,便诞生了新一代的现代向量数据库。这种架构(与向量数据库的基础原理相结合)是现代人工智能堆栈中的优选方案。

In the following sections, we will discuss a few of the algorithms behind the fundamentals of vector DBs and explain how they contribute to the overall performance of our database.
在接下来的章节中,我们将讨论向量数据库基础背后的几种算法,并解释它们如何提升我们数据库的整体性能。

Algorithms  算法

Several algorithms can facilitate the creation of a vector index. Their common goal is to enable fast querying by creating a data structure that can be traversed quickly. They will commonly transform the representation of the original vector into a compressed form to optimize the query process.
多种算法可以促进向量索引的创建。它们的共同目标是通过构建能够快速遍历的数据结构来实现快速查询。这些算法通常会将原始向量的表示转换为压缩形式,以优化查询过程。

However, as a user of Pinecone, you don’t need to worry about the intricacies and selection of these various algorithms. Pinecone is designed to handle all the complexities and algorithmic decisions behind the scenes, ensuring you get the best performance and results without any hassle. By leveraging Pinecone’s expertise, you can focus on what truly matters – extracting valuable insights and delivering powerful AI solutions.
然而,作为 Pinecone 的用户,您无需担心这些复杂算法及其选择的细节。Pinecone 旨在幕后处理所有复杂性和算法决策,确保您无需费心即可获得最佳性能和结果。通过利用 Pinecone 的专业知识,您可以专注于真正重要的事情——提取有价值的洞察并交付强大的 AI 解决方案。

The following sections will explore several algorithms and their unique approaches to handling vector embeddings. This knowledge will empower you to make informed decisions and appreciate the seamless performance Pinecone delivers as you unlock the full potential of your application.
以下部分将探讨几种算法及其处理向量嵌入的独特方法。这些知识将使您能够做出明智的决策,并在释放应用程序全部潜力时,欣赏 Pinecone 提供的无缝性能。

Random Projection  随机投影

The basic idea behind random projection is to project the high-dimensional vectors to a lower-dimensional space using a random projection matrix. We create a matrix of random numbers. The size of the matrix is going to be the target low-dimension value we want. We then calculate the dot product of the input vectors and the matrix, which results in a projected matrix that has fewer dimensions than our original vectors but still preserves their similarity.
随机投影的基本思想是使用一个随机投影矩阵将高维向量投影到一个低维空间。我们创建一个随机数矩阵,其大小将是我们期望的目标低维度值。然后,我们计算输入向量与矩阵的点积,从而得到一个投影矩阵,该矩阵的维度比原始向量少,但仍保留了它们的相似性。

Random Projection

When we query, we use the same projection matrix to project the query vector onto the lower-dimensional space. Then, we compare the projected query vector to the projected vectors in the database to find the nearest neighbors. Since the dimensionality of the data is reduced, the search process is significantly faster than searching the entire high-dimensional space.
当我们进行查询时,使用相同的投影矩阵将查询向量投影到低维空间。然后,我们将投影后的查询向量与数据库中的投影向量进行比较,以找到最近邻。由于数据的维度被降低,搜索过程比在整个高维空间中搜索要快得多。

Just keep in mind that random projection is an approximate method, and the projection quality depends on the properties of the projection matrix. In general, the more random the projection matrix is, the better the quality of the projection will be. But generating a truly random projection matrix can be computationally expensive, especially for large datasets. Learn more about random projection.
请记住,随机投影是一种近似方法,投影质量取决于投影矩阵的特性。一般来说,投影矩阵越随机,投影质量就越好。但生成一个真正随机的投影矩阵在计算上可能非常昂贵,特别是对于大型数据集。了解更多关于随机投影的信息。

Product Quantization  产品量化

Another way to build an index is product quantization (PQ), which is a lossy compression technique for high-dimensional vectors (like vector embeddings). It takes the original vector, breaks it up into smaller chunks, simplifies the representation of each chunk by creating a representative “code” for each chunk, and then puts all the chunks back together - without losing information that is vital for similarity operations. The process of PQ can be broken down into four steps: splitting, training, encoding, and querying.
构建索引的另一种方法是乘积量化(PQ),这是一种针对高维向量(如向量嵌入)的有损压缩技术。它将原始向量分解成较小的块,通过为每个块创建代表性的“编码”来简化每个块的表示,然后将所有块重新组合在一起——而不会丢失对相似性操作至关重要的信息。PQ 的过程可以分为四个步骤:分割、训练、编码和查询。

Product Quantization
  1. Splitting -The vectors are broken into segments.
    分割 - 向量被分成若干段。
  2. Training - we build a “codebook” for each segment. Simply put - the algorithm generates a pool of potential “codes” that could be assigned to a vector. In practice - this “codebook” is made up of the center points of clusters created by performing k-means clustering on each of the vector’s segments. We would have the same number of values in the segment codebook as the value we use for the k-means clustering.
    训练 - 我们为每个片段构建一个“码本”。简单来说 - 算法生成一组可能分配给向量的“代码”池。实际上 - 这个“码本”由通过对向量的每个片段执行 k-means 聚类创建的簇的中心点组成。我们在片段码本中的值的数量与用于 k-means 聚类的值相同。
  3. Encoding - The algorithm assigns a specific code to each segment. In practice, we find the nearest value in the codebook to each vector segment after the training is complete. Our PQ code for the segment will be the identifier for the corresponding value in the codebook. We could use as many PQ codes as we’d like, meaning we can pick multiple values from the codebook to represent each segment.
    编码 - 该算法为每个段分配一个特定的代码。实际上,在训练完成后,我们会在码本中找到与每个向量段最接近的值。我们为该段生成的 PQ 代码将是码本中对应值的标识符。我们可以使用任意数量的 PQ 代码,这意味着我们可以从码本中选择多个值来表示每个段。
  4. Querying - When we query, the algorithm breaks down the vectors into sub-vectors and quantizes them using the same codebook. Then, it uses the indexed codes to find the nearest vectors to the query vector.
    查询 - 当我们进行查询时,算法将向量分解为子向量,并使用相同的码本对其进行量化。然后,它利用索引代码来找到与查询向量最接近的向量。

The number of representative vectors in the codebook is a trade-off between the accuracy of the representation and the computational cost of searching the codebook. The more representative vectors in the codebook, the more accurate the representation of the vectors in the subspace, but the higher the computational cost to search the codebook. By contrast, the fewer representative vectors in the codebook, the less accurate the representation, but the lower the computational cost. Learn more about PQ.
码本中代表向量的数量是表示精度与搜索码本计算成本之间的权衡。码本中的代表向量越多,子空间中向量的表示就越准确,但搜索码本的计算成本也越高。相反,码本中的代表向量越少,表示的准确性越低,但计算成本也越低。了解更多关于 PQ 的信息。

Locality-sensitive hashing
局部敏感哈希

Locality-Sensitive Hashing (LSH) is a technique for indexing in the context of an approximate nearest-neighbor search. It is optimized for speed while still delivering an approximate, non-exhaustive result. LSH maps similar vectors into “buckets” using a set of hashing functions, as seen below:
局部敏感哈希(Locality-Sensitive Hashing, LSH)是一种在近似最近邻搜索背景下进行索引的技术。它针对速度进行了优化,同时仍能提供近似而非穷举的结果。LSH 通过一组哈希函数将相似的向量映射到“桶”中,如下所示:

Locality-sensitive hashing

To find the nearest neighbors for a given query vector, we use the same hashing functions used to “bucket” similar vectors into hash tables. The query vector is hashed to a particular table and then compared with the other vectors in that same table to find the closest matches. This method is much faster than searching through the entire dataset because there are far fewer vectors in each hash table than in the whole space.
为了找到给定查询向量的最近邻,我们使用与将相似向量“分桶”到哈希表中相同的哈希函数。查询向量被哈希到特定的表中,然后与该表中的其他向量进行比较,以找到最接近的匹配。这种方法比在整个数据集中搜索要快得多,因为每个哈希表中的向量数量远少于整个空间中的向量数量。

It’s important to remember that LSH is an approximate method, and the quality of the approximation depends on the properties of the hash functions. In general, the more hash functions used, the better the approximation quality will be. However, using a large number of hash functions can be computationally expensive and may not be feasible for large datasets. Learn more about LSH.
重要的是要记住,LSH 是一种近似方法,近似的质量取决于哈希函数的特性。一般来说,使用的哈希函数越多,近似质量就越好。然而,使用大量哈希函数可能会带来较高的计算成本,对于大型数据集可能不可行。了解更多关于 LSH 的信息。

Start using Pinecone  开始使用 Pinecone
Pinecone is the developer-favorite vector database that's fast and easy to use at any scale.
Pinecone 是开发者喜爱的向量数据库,无论规模大小,都能快速且易于使用。

Hierarchical Navigable Small World (HNSW)
分层可导航小世界(HNSW)

HNSW creates a hierarchical, tree-like structure where each node of the tree represents a set of vectors. The edges between the nodes represent the similarity between the vectors. The algorithm starts by creating a set of nodes, each with a small number of vectors. This could be done randomly or by clustering the vectors with algorithms like k-means, where each cluster becomes a node.
HNSW 构建了一个分层的树状结构,其中树的每个节点代表一组向量。节点之间的边表示向量之间的相似性。算法首先创建一组节点,每个节点包含少量向量。这可以通过随机方式完成,或者使用如 k-means 等算法对向量进行聚类,每个聚类成为一个节点。

Hierarchical Navigable Small World (HNSW)

The algorithm then examines the vectors of each node and draws an edge between that node and the nodes that have the most similar vectors to the one it has.
算法随后检查每个节点的向量,并在该节点与具有最相似向量的节点之间绘制一条边。

Hierarchical Navigable Small World (HNSW)

When we query an HNSW index, it uses this graph to navigate through the tree, visiting the nodes that are most likely to contain the closest vectors to the query vector. Learn more about HNSW.
当我们查询一个 HNSW 索引时,它利用这个图在树中导航,访问最有可能包含与查询向量最接近的向量的节点。了解更多关于 HNSW 的信息。

Similarity Measures  相似性度量

Building on the previously discussed algorithms, we need to understand the role of similarity measures in vector databases. These measures are the foundation of how a vector database compares and identifies the most relevant results for a given query.
基于之前讨论的算法,我们需要理解相似性度量在向量数据库中的作用。这些度量是向量数据库如何比较和识别给定查询中最相关结果的基础。

Similarity measures are mathematical methods for determining how similar two vectors are in a vector space. Similarity measures are used in vector databases to compare the vectors stored in the database and find the ones that are most similar to a given query vector.
相似度度量是用于确定向量空间中两个向量相似程度的数学方法。在向量数据库中,相似度度量用于比较数据库中存储的向量,并找出与给定查询向量最相似的向量。

Several similarity measures can be used, including:
可以使用多种相似性度量方法,包括:

  • Cosine similarity: measures the cosine of the angle between two vectors in a vector space. It ranges from -1 to 1, where 1 represents identical vectors, 0 represents orthogonal vectors, and -1 represents vectors that are diametrically opposed.
    余弦相似度:衡量向量空间中两个向量之间角度的余弦值。其范围从-1 到 1,其中 1 表示完全相同的向量,0 表示正交的向量,-1 表示完全相反的向量。
  • Euclidean distance: measures the straight-line distance between two vectors in a vector space. It ranges from 0 to infinity, where 0 represents identical vectors, and larger values represent increasingly dissimilar vectors.
    欧几里得距离:衡量向量空间中两个向量之间的直线距离。其范围从 0 到无穷大,其中 0 表示相同的向量,较大的值表示越来越不相似的向量。
  • Dot product: measures the product of the magnitudes of two vectors and the cosine of the angle between them. It ranges from -∞ to ∞, where a positive value represents vectors that point in the same direction, 0 represents orthogonal vectors, and a negative value represents vectors that point in opposite directions.
    点积:衡量两个向量的大小与它们之间夹角余弦的乘积。其值范围从-∞到∞,其中正值表示方向相同的向量,0 表示正交的向量,负值则表示方向相反的向量。

The choice of similarity measure will have an effect on the results obtained from a vector database. It is also important to note that each similarity measure has its own advantages and disadvantages, and it is important to choose the right one depending on the use case and requirements. Learn more about similarity measures.
相似度度量的选择会影响从向量数据库中获得的结果。同样重要的是要注意,每种相似度度量都有其自身的优缺点,根据使用场景和需求选择合适的度量方法至关重要。了解更多关于相似度度量的信息。

Filtering  过滤

Every vector stored in the database also includes metadata. In addition to the ability to query for similar vectors, vector databases can also filter the results based on a metadata query. To do this, the vector database usually maintains two indexes: a vector index and a metadata index. It then performs the metadata filtering either before or after the vector search itself, but in either case, there are difficulties that cause the query process to slow down.
数据库中存储的每个向量都包含元数据。除了能够查询相似向量外,向量数据库还可以根据元数据查询过滤结果。为此,向量数据库通常维护两个索引:向量索引和元数据索引。然后,它会在向量搜索本身之前或之后执行元数据过滤,但无论哪种情况,都会存在导致查询过程变慢的困难。

Post-filtering and Pre-filtering

The filtering process can be performed either before or after the vector search itself, but each approach has its own challenges that may impact the query performance:
过滤过程可以在向量搜索本身之前或之后执行,但每种方法都有其自身的挑战,可能会影响查询性能:

  • Pre-filtering: In this approach, metadata filtering is done before the vector search. While this can help reduce the search space, it may also cause the system to overlook relevant results that don’t match the metadata filter criteria. Additionally, extensive metadata filtering may slow down the query process due to the added computational overhead.
    预过滤:在这种方法中,元数据过滤在向量搜索之前进行。虽然这有助于缩小搜索范围,但也可能导致系统忽略不符合元数据过滤条件的相关结果。此外,大量的元数据过滤可能会因为增加的计算开销而减慢查询过程。
  • Post-filtering: In this approach, the metadata filtering is done after the vector search. This can help ensure that all relevant results are considered, but it may also introduce additional overhead and slow down the query process as irrelevant results need to be filtered out after the search is complete.
    后过滤:在这种方法中,元数据过滤是在向量搜索之后进行的。这有助于确保所有相关结果都被考虑在内,但也可能引入额外的开销,并在搜索完成后需要过滤掉不相关的结果,从而减慢查询过程。

To optimize the filtering process, vector databases use various techniques, such as leveraging advanced indexing methods for metadata or using parallel processing to speed up the filtering tasks. Balancing the trade-offs between search performance and filtering accuracy is essential for providing efficient and relevant query results in vector databases. Learn more about vector search filtering.
为了优化过滤过程,向量数据库采用了多种技术,例如利用高级索引方法处理元数据,或使用并行处理加速过滤任务。在搜索性能与过滤准确性之间权衡取舍,对于在向量数据库中提供高效且相关的查询结果至关重要。了解更多关于向量搜索过滤的信息。

Database Operations  数据库操作

Unlike vector indexes, vector databases are equipped with a set of capabilities that makes them better qualified to be used in high scale production settings. Let’s take a look at an overall overview of the components that are involved in operating the database.
与向量索引不同,向量数据库配备了一系列功能,使其更适合用于大规模生产环境。让我们来整体了解一下操作数据库所涉及的组件。

Database Operations

Performance and Fault tolerance
性能与容错性

Performance and fault tolerance are tightly related. The more data we have, the more nodes that are required - and the bigger chance for errors and failures. As is the case with other types of databases, we want to ensure that queries are executed as quickly as possible even if some of the underlying nodes fail. This could be due to hardware failures, network failures, or other types of technical bugs. This kind of failure could result in downtime or even incorrect query results.
性能与容错性密切相关。我们拥有的数据越多,所需的节点就越多,出现错误和故障的可能性也就越大。与其他类型的数据库一样,我们希望确保即使某些底层节点发生故障,查询也能尽可能快地执行。这可能是由于硬件故障、网络故障或其他类型的技术错误导致的。此类故障可能导致停机,甚至产生错误的查询结果。

To ensure both high performance and fault tolerance, vector databases use sharding and replication apply the following:
为确保高性能和容错性,向量数据库采用分片和复制技术,具体应用如下:

  1. Sharding - partitioning the data across multiple nodes. There are different methods for partitioning the data - for example, it can be partitioned by the similarity of different clusters of data so that similar vectors are stored in the same partition. When a query is made, it is sent to all the shards and the results are retrieved and combined. This is called the “scatter-gather” pattern.
    分片 - 将数据分布在多个节点上。有多种方法可以对数据进行分区 - 例如,可以根据不同数据集群的相似性进行分区,以便相似的向量存储在同一个分区中。当进行查询时,查询会被发送到所有分片,结果会被检索并合并。这被称为“分散-聚集”模式。
  2. Replication - creating multiple copies of the data across different nodes. This ensures that even if a particular node fails, other nodes will be able to replace it. There are two main consistency models: eventual consistency and strong consistency. Eventual consistency allows for temporary inconsistencies between different copies of the data which will improve availability and reduce latency but may result in conflicts and even data loss. On the other hand, strong consistency requires that all copies of the data are updated before a write operation is considered complete. This approach provides stronger consistency but may result in higher latency.
    复制 - 在不同节点上创建数据的多个副本。这确保了即使某个特定节点发生故障,其他节点也能够替代它。主要有两种一致性模型:最终一致性和强一致性。最终一致性允许数据的不同副本之间存在暂时的不一致,这将提高可用性并减少延迟,但可能会导致冲突甚至数据丢失。另一方面,强一致性要求在写操作被视为完成之前,所有数据副本都必须更新。这种方法提供了更强的一致性,但可能会导致更高的延迟。

Monitoring  监控

To effectively manage and maintain a vector database, we need a robust monitoring system that tracks the important aspects of the database’s performance, health, and overall status. Monitoring is critical for detecting potential problems, optimizing performance, and ensuring smooth production operations. Some aspects of monitoring a vector database include the following:
为了有效管理和维护向量数据库,我们需要一个强大的监控系统来跟踪数据库性能、健康状况和整体状态的重要方面。监控对于检测潜在问题、优化性能以及确保生产操作的顺利进行至关重要。监控向量数据库的一些方面包括以下内容:

  1. Resource usage - monitoring resource usage, such as CPU, memory, disk space, and network activity, enables the identification of potential issues or resource constraints that could affect the performance of the database.
    资源使用情况 - 监控资源使用情况,如 CPU、内存、磁盘空间和网络活动,有助于识别可能影响数据库性能的潜在问题或资源限制。
  2. Query performance - query latency, throughput, and error rates may indicate potential systemic issues that need to be addressed.
    查询性能 - 查询延迟、吞吐量和错误率可能表明存在需要解决的潜在系统性问题。
  3. System health - overall system health monitoring includes the status of individual nodes, the replication process, and other critical components.
    系统健康 - 整体系统健康监控包括各个节点的状态、复制过程以及其他关键组件的状态。

Access-control  访问控制

Access control is the process of managing and regulating user access to data and resources. It is a vital component of data security, ensuring that only authorized users have the ability to view, modify, or interact with sensitive data stored within the vector database.
访问控制是管理和规范用户对数据和资源访问权限的过程。它是数据安全的重要组成部分,确保只有授权用户才能查看、修改或与向量数据库中存储的敏感数据进行交互。

Access control is important for several reasons:
访问控制之所以重要,有以下几个原因:

  1. Data protection: As AI applications often deal with sensitive and confidential information, implementing strict access control mechanisms helps safeguard data from unauthorized access and potential breaches.
    数据保护:由于人工智能应用通常处理敏感和机密信息,实施严格的访问控制机制有助于保护数据免受未经授权的访问和潜在的泄露。
  2. Compliance: Many industries, such as healthcare and finance, are subject to strict data privacy regulations. Implementing proper access control helps organizations comply with these regulations, protecting them from legal and financial repercussions.
    合规性:许多行业,如医疗保健和金融,都受到严格的数据隐私法规的约束。实施适当的访问控制有助于组织遵守这些法规,保护其免受法律和财务影响。
  3. Accountability and auditing: Access control mechanisms enable organizations to maintain a record of user activities within the vector database. This information is crucial for auditing purposes, and when security breaches happen, it helps trace back any unauthorized access or modifications.
    责任与审计:访问控制机制使组织能够维护向量数据库中用户活动的记录。这些信息对于审计目的至关重要,在发生安全漏洞时,有助于追溯任何未经授权的访问或修改。
  4. Scalability and flexibility: As organizations grow and evolve, their access control needs may change. A robust access control system allows for seamless modification and expansion of user permissions, ensuring that data security remains intact throughout the organization’s growth.
    可扩展性和灵活性:随着组织的发展和演变,其访问控制需求可能会发生变化。一个强大的访问控制系统允许无缝修改和扩展用户权限,确保在组织成长过程中数据安全始终保持完好。

Backups and collections  备份与收藏

When all else fails, vector databases offer the ability to rely on regularly created backups. These backups can be stored on external storage systems or cloud-based storage services, ensuring the safety and recoverability of the data. In case of data loss or corruption, these backups can be used to restore the database to a previous state, minimizing downtime and impact on the overall system. With Pinecone, users can choose to back up specific indexes as well and save them as “collections,” which can later be used to populate new indexes.
当其他方法都失效时,向量数据库提供了依赖定期创建备份的能力。这些备份可以存储在外部存储系统或基于云的存储服务上,确保数据的安全性和可恢复性。在数据丢失或损坏的情况下,这些备份可用于将数据库恢复到之前的状态,最大限度地减少停机时间和对整个系统的影响。使用 Pinecone,用户还可以选择备份特定索引并将其保存为“集合”,这些集合稍后可用于填充新索引。

API and SDKs  API 和 SDK

This is where the rubber meets the road: Developers who interact with the database want to do so with an easy-to-use API, using a toolset that is familiar and comfortable. By providing a user-friendly interface, the vector database API layer simplifies the development of high-performance vector search applications.
这里是关键所在:与数据库交互的开发者希望使用一个易于使用的 API,以及熟悉且顺手的工具集。通过提供用户友好的界面,向量数据库 API 层简化了高性能向量搜索应用的开发。

In addition to the API, vector databases would often provide programming language specific SDKs that wrap the API. The SDKs make it even easier for developers to interact with the database in their applications. This allows developers to concentrate on their specific use cases, such as semantic text search, generative question-answering, hybrid search, image similarity search, or product recommendations, without having to worry about the underlying infrastructure complexities.
除了 API 之外,向量数据库通常还会提供特定编程语言的 SDK,这些 SDK 封装了 API。SDK 使得开发者能够更轻松地在他们的应用程序中与数据库进行交互。这让开发者能够专注于他们的特定用例,如语义文本搜索、生成式问答、混合搜索、图像相似性搜索或产品推荐,而无需担心底层基础设施的复杂性。

Summary  摘要

The exponential growth of vector embeddings in fields such as NLP, computer vision, and other AI applications has resulted in the emergence of vector databases as the computation engine that allows us to interact effectively with vector embeddings in our applications.
向量嵌入在自然语言处理(NLP)、计算机视觉和其他人工智能应用领域的指数级增长,催生了向量数据库作为计算引擎的出现,使我们能够在应用程序中有效地与向量嵌入进行交互。

Vector databases are purpose-built databases that are specialized to tackle the problems that arise when managing vector embeddings in production scenarios. For that reason, they offer significant advantages over traditional scalar-based databases and standalone vector indexes.
向量数据库是专门构建的数据库,旨在解决在生产场景中管理向量嵌入时出现的问题。因此,它们相较于传统的基于标量的数据库和独立的向量索引提供了显著的优势。

In this post, we reviewed the key aspects of a vector database, including how it works, what algorithms it uses, and the additional features that make it operationally ready for production scenarios. We hope this helps you understand the inner workings of vector databases. Luckily, this isn’t something you must know to use Pinecone. Pinecone takes care of all of these considerations (and then some) and frees you to focus on the rest of your application.
在这篇文章中,我们回顾了向量数据库的关键方面,包括其工作原理、使用的算法以及使其为生产场景做好准备的附加功能。我们希望这能帮助您理解向量数据库的内部运作。幸运的是,使用 Pinecone 并不需要了解这些。Pinecone 会处理所有这些考虑(甚至更多),让您可以专注于应用程序的其他部分。

Share: