ES集群排查指引

一、elasticsearch查询指令

# 统一请求格式,也可以直接在kinaba操作执行
# curl -u user:pass -XGET http://user:pass:port/xxxxxx?

# 查看索引数据
GET monitor_demotest_20210723/_search

# 查看es集群监控状态
GET /_cluster/health

# 查看集群索引状态
GET _cat/indices

# 查看异常信息
GET /_cluster/allocation/explain

# 查看分片数
GET /_cluster/settings?include_defaults

# 设置分配数
PUT /_cluster/settings
{
  "persistent" : {
    "cluster" : {
         "max_shards_per_node" : "100000"
    }
  }
}

# 关闭分片
PUT /_cluster/settings
{
  "transient" : {
      "cluster.routing.allocation.enable" : "none"
  }
}

# 启用分片
PUT /_cluster/settings
{
  "transient" : {
      "cluster.routing.allocation.enable" : "all"
  }
}

# 查看索引segment情况
/_cat/segments

# 查看主分片分配失败索引
/_cat/shards

# 查看节点最大句柄数
/_nodes/stats/process

# 查询节点最大句柄数
GET /_nodes/stats/process?filter_path=**.max_file_descriptors

# 查看集群节点
/_cat/nodes

二、elasticsearch报错处理

2.1 内存不足问题
# 日志显示,无法分配内存,可使用内存不足
faied:error='Cannot allocate memory'

# 解决方法,调整内存大小
vim config/jvm.options
----------
-Xms8g
-Xmx8g

# 查看占用资源最多的进程
ps aux|head -1;ps aux|grep -v PID|sort -rn -k +4|head -n 2
2.2 文件句柄数大小
# 日志显示,打开文件太多,文件句柄数太小
too many open files

# 解决方法
vim /etc/sysctl.conf
----------
echo "fs.file-max=6553500" >> /etc/sysctl.conf
sysctl -p
2.3 容器内jdk依赖问题
# 日志显示找不到jdk依赖问题
OpenJDK 64-Bit Server VM warning: option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.

# 解决方法
docker exec -it elasticsearch bash  # 交互es容器
vi /usr/share/elasticsearch/bin/elasticsearch
------------
export JAVA_HOME=/usr/share/elasticsearch/jdk
export PATH=$JAVA_HOME/bin:$PATH
if [ -x "$JAVA_HOME/bin/java" ]; then
   JAVA="/usr/share/elasticsearch/jdk/bin/java"
else
   JAVA=which java
fi
------------
docker-compose stop elasticsearch
docker-compose up -d elasticsearch
2.4 密钥错误问题
# 打印日志信息显示
"Caused by: java. security. Unrecoverab LeKeyException: failed to decrypt safe contents entry: javax. crypto. BadPaddingException: Given final block not properly padded. Such issues can arise if a bad key is used during dec rypt ion."

# 解决,重新在主节点生成新的密钥,再次拷贝到扩容机器
./bin/elasticsearch-certutil ca -out data/elastic-certificates.p12 -pass ""
scp elastic-certificates.p12 ${node-ip}:/xxx/xxx
2.5 集群red状态
# 检查节点状态,如果有节点缺失重启该节点;
curl http://{ESIP}:9200/_cat/nodes?v

# 找出是哪个索引状态为 red ,如果索引重要则修复,如果不重要则delete掉
curl GET http://localhost:9200/_cat/shards

# 找出未分配索引,结果,第一列表示索引名,第二列表示分片编号,第三列p是主分片,r是副本,找出分片为unassigned 状态的索引,手工分片即可
curl -s "http://localhost:9200/_cat/shards" | grep UNASSIGNED

# 重新分片完成后,切换请求其他es节点验证
curl http://localhost:9200/_cluster/health?pretty

三、elasticsearch分片问题

登录kibana,查看集群状态,发现elasticsearch集群健康状况为黄色(yellow)或者(red)状态,

通过kinaba开发者工具执行语句;

# 查看当前集群健康状态
GET /_cluster/health

# 查找问题索引
GET _cat/indices

# 查看详细异常信息
GET /_cluster/allocation/explain

排查后发现是单数据分片数太小,节点内存不足问题导致,然后调整内存;

# 修改内存大小
vim config/jvm.options
----------
-Xms8g
-Xmx8g

# 重启es集群
docker-compose stop elasticsearch
docker-compose up -d elasticsearch

等待集群启动完成后,调整分片数大小;

PUT /_cluster/settings
{
  "persistent" : {
    "cluster" : {
         "max_shards_per_node" : "100000"
    }
  }
}

继续设置分片,触发索引分片;

# 关闭分片
PUT /_cluster/settings
{
  "transient" : {
      "cluster.routing.allocation.enable" : "none"
  }
}

# 启用分片
PUT /_cluster/settings
{
  "transient" : {
      "cluster.routing.allocation.enable" : "all"
  }
}

如果到了这一步,发现在足够的内存和空间下,还是无法分片成功,手动的去触发问题索引进行分片;

# 关闭问题索引
POST /.monitor_20200412/_close

# 重新打开索引
POST /.monitor_20200412/_open

设置完成后,在kibana上查看集群状态,此时变成green状态;