ESTips的主页 - PintereStory

向量数据库：使用Elasticsearch实现向量数据存储与搜索

马超的博客 2023-06-01 21:19:47

一、简介

Elasticsearch在7.x的版本中支持向量检索。在向量函数的计算过程中，会对所有匹配的文档进行线性扫描。因此，查询预计时间会随着匹配文档的数量线性增长。出于这个原因，建议使用查询参数来限制匹配文档的数量（类似二次查找的逻辑，先使用match query检索到相关文档，然后使用向量函数计算文档相关度）。

访问dense_vector的推荐方法是通过cosinessimilarity, dotProduct, 1norm或l2norm函数。但是需要注意，每个DSL脚本只能调用这些函数一次。例如，不要在循环中使用这些函数来计算文档向量和多个其他向量之间的相似性。如果需要该功能，可以通过直接访问向量值来重新实现这些函数。

二、实验前准备

2.1 创建索引设置向量字段

创建一个支持向量检索的mapping，字段类型为dense_vector。

// 7.x 支持的 dims 最大为 1024。
PUT index3
{
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "dense_vector",
        "dims": 3
      },
      "my_text" : {
        "type" : "keyword"
      }
    }
  }
}

2.2 写入数据

PUT index3/_doc/1
{
  "my_text" : "text1",
  "my_vector" : [0.5, 10, 6]
}

PUT index3/_doc/2
{
  "my_text" : "text2",
  "my_vector" : [-0.5, 10, 10]
}

三、向量计算函数

3.1 余弦相似度：cosineSimilarity

cosinessimilarity函数计算给定查询向量和文档向量之间的余弦相似性度量。

POST index3/_search
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, doc['my_vector'])+1.0",
        "params": {
          "queryVector": [-0.5, 10, 6]
        }
      }
    }
  }
}

要限制script_score计算的文档数量，需要提供一个过滤器 (query)。
script脚本在cosineSimilarity上增加了1.0，以防止得分为负。
为了更好的利用DSL优化器，可以使用参数的方式提供一个查询向量。
检查缺失值：如果文档中没有用于执行向量函数的向量字段的值，会抛出错误。可以使用doc['my_vector'].size() == 0来检查文档是否有my_vector字段的值。脚本样例:

"source": 
"
doc['my_vector'].size() == 0 ? 0 : 
cosineSimilarity(params.queryVector, 'my_vector')
"

如果文档的dense_vector字段与查询的向量维度不同，就会抛出异常。

3.2 计算点积：dotProduct

dotProduct函数计算给定查询向量和文档向量之间的点积度量。

POST index3/_search
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": """
        double value = dotProduct(params.queryVector,doc['my_vector']);
        return sigmoid(1, Math.E, -value);
        """,
        "params": {
          "queryVector": [
            -0.5,
            10,
            6
          ]
        }
      }
    }
  }
}

使用标准的sigmoid函数可以防止分数为负。

3.3 曼哈顿距离：l1norm

l1norm函数计算给定查询向量和文档向量之间的L1距离(曼哈顿距离)。

POST index3/_search
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source":"1 / (1 + l1norm(params.queryVector, doc['my_vector']))",
        "params": {
          "queryVector": [-0.5, 10, 6]
        }
      }
    }
  }
}

与表示相似性的余弦相似度不同，1norm和l2norm表示距离或差异。这意味着，向量越相似，由1norm和l2norm函数产生的分数就越低。因此，当我们需要相似的向量来获得更高的分数时，我们将1norm和l2norm的输出反过来。另外，为了避免在文档向量与查询完全匹配时被除0，在分母中加了1。

3.4 欧几里得距离：l2norm

l2norm函数计算给定查询向量和文档向量之间的L2距离(欧几里德距离)。

POST index3/_search
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "1 / (1 + l2norm(params.queryVector, doc['my_vector']))",
        "params": {
          "queryVector": [
            -0.5,
            10,
            6
          ]
        }
      }
    }
  }
}

3.5 自定义计算函数

使用函数访问向量的值，自定义实现向量余弦相似度计算。ES 中向量检索 doc[].vectorValue 函数是在 Elasticsearch 7.8.0 版本开始支持的，在ES 7.5.1 或 7.8.0 以下版本会运行失败。

可以通过以下函数直接访问向量值:

doc[<field>].vectorValue – 以浮点数数组的形式返回向量的值。
doc[<field>].magnitude – 将向量的大小作为浮点数返回（对于7.5版本之前创建的向量，其向量的大小不会被存储）。所以这个函数每次被调用时都会进行重新计算。

POST index3/_search
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": """
          float[] v = doc['my_vector'].vectorValue;
          float vm = doc['my_vector'].magnitude;
          float dotProduct = 0;
          for (int i = 0; i < v.length; i++) {
            dotProduct += v[i] * params.queryVector[i];
          }
          return dotProduct / (vm * (float) params.queryVectorMag);
        """,
        "params": {
          "queryVector": [
            -0.5,
            10,
            6
          ],
          "queryVectorMag": 5.25357
        }
      }
    }
  }
}

ESTips

2024-08-29

来源：

向量数据库：使用Elasticsearch实现向量数据存储与搜索_elasticsearch 向量-CSDN博客

ElasticSearch IK分词器：热更新词库

ESTips

2024-04-19

来源：

ElasticSearch7.3学习(十五)----中文分词器(IK Analyzer)及自定义词库 - |旧市拾荒| - 博客园

ElasticSearch IK分词插件：自定义词库

ESTips

2024-04-19

来源：

ElasticSearch7.3学习(十五)----中文分词器(IK Analyzer)及自定义词库 - |旧市拾荒| - 博客园

ElasticSearch自定义字典配置

IKAnalyzer.cfg.xml can be located at {conf}/analysis-ik/config/IKAnalyzer.cfg.xml or {plugins}/elasticsearch-analysis-ik-*/config/IKAnalyzer.cfg.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords">custom/ext_stopword.dic</entry>
 	<!--用户可以在这里配置远程扩展字典 -->
	<entry key="remote_ext_dict">location</entry>
 	<!--用户可以在这里配置远程扩展停止词字典-->
	<entry key="remote_ext_stopwords">http://xxx.com/xxx.dic</entry>
</properties>

ESTips

2024-02-20

来源：

medcl/elasticsearch-analysis-ik: The IK Analysis plugin integrates Lucene IK analyzer into elasticsearch, support customized dictionary.

ElasticSearch的JVM配置文件放在哪里？

If needed, you can override the default JVM options by adding custom options files (preferred) or setting the ES_JAVA_OPTS environment variable.

JVM options files must have the suffix .options and contain a line-delimited list of JVM arguments. JVM processes options files in lexicographic order.

根据不同的安装方式，JVM配置文件需要防止在不同目录：

tar.gz 或者.zip解压: 把 JVM 配置文件放到 config/jvm.options.d/.
Debian or RPM: Add custom JVM options files to /etc/elasticsearch/jvm.options.d/.
Docker: Bind mount custom JVM options files into /usr/share/elasticsearch/config/jvm.options.d/.

ESTips

2023-10-16

来源：

Advanced configuration | Elasticsearch Guide [7.17] | Elastic

ElasticSearch命令行设置内存大小

For testing, you can also set the heap sizes using the ES_JAVA_OPTS environment variable:

ES_JAVA_OPTS="-Xms2g -Xmx2g" ./bin/elasticsearch

ESTips

2023-10-16

来源：

Advanced configuration | Elasticsearch Guide [7.17] | Elastic

命令行启动ElasticSearch的两种方法

Elasticsearch Guide

Running Elasticsearch from the command line

Elasticsearch can be started from the command line as follows:

./bin/elasticsearch

If you have password-protected the Elasticsearch keystore, you will be prompted to enter the keystore’s password. See Secure settings for more details.

By default Elasticsearch prints its logs to the console (stdout) and to the <cluster name>.log file within the logs directory. Elasticsearch logs some information while it is starting up, but once it has finished initializing it will continue to run in the foreground and won’t log anything further until something happens that is worth recording. While Elasticsearch is running you can interact with it through its HTTP interface which is on port 9200 by default. To stop Elasticsearch, press Ctrl-C.

All scripts packaged with Elasticsearch require a version of Bash that supports arrays and assume that Bash is available at /bin/bash. As such, Bash should be available at this path either directly or via a symbolic link.

Running as a daemon

To run Elasticsearch as a daemon, specify -d on the command line, and record the process ID in a file using the -p option:

./bin/elasticsearch -d -p pid

If you have password-protected the Elasticsearch keystore, you will be prompted to enter the keystore’s password. See Secure settings for more details.

Log messages can be found in the $ES_HOME/logs/ directory.

To shut down Elasticsearch, kill the process ID recorded in the pid file:

pkill -F pid

ESTips

2023-10-13

来源：

Starting Elasticsearch | Elasticsearch Guide [7.17] | Elastic