ES总结五:聚合

聚合分析

众所周知,Elasticsearch是一个分布式的全文搜索引擎,索引和搜索是Elasticsearch的基本功能。事实上,Elasticsearch的聚合(Aggregations)功能也十分强大,允许在数据上做复杂的分析统计。Elasticsearch提供的聚合分析功能主要有指标聚合、桶聚合、管道聚合和矩阵聚合四大类,管道聚合和矩阵聚合官方说明是在试验阶段,后期会完全更改或者移除,这里不再对管道聚合和矩阵聚合进行讲解。

指标聚合

Max

Max Aggregation用于最大值统计。例如,统计books索引中价格最高的是哪本书,查询语句如下:
GET books/_search

{
    "aggs": {
        "max_price": {
            "max ": {
                "field ": "price "
            }
        }
    }
}

返回值如下:

{
    "aggregations": {
        "max_price": {
            "value ": "93 "
        }
    }
}

Min

Min Aggregation用于最小值统计。例如,统计books索引中最早出版的是哪本书,查询语句如下:
GET books/_search

{
    "aggs": {
        "min_year": {
            "min ": {
                "field ": "publish_time "
            }
        }
    }
}

返回值如下:

{
    "aggregations": {
        "min_year": {
            "value ": "2017-02-02"
        }
    }
}

Avg

Avg Aggregation用于计算平均值。例如,计算books索引中所有书的平均价格,查询语句如下:
GET books/_search

{
    "aggs": {
        "avg_price": {
            "avg ": {
                "field ": "price "
            }
        }
    }
}

返回值如下:

{
    "aggregations": {
        "avg_price": {
            "value ": "42.2"
        }
    }
}

Sum

Sum Aggregation用于计算总和。例如,计算books索引中所有书的总价,查询语句如下:
GET books/_search

{
    "aggs": {
        "sum_price": {
            "sum ": {
                "field ": "price "
            }
        }
    }
}

返回值如下:

{
    "aggregations": {
        "sum_price": {
            "value ": "455"
        }
    }
}

Cardinality

Cardinality Aggregation用于基数统计,其作用是先执行类似SQL中的distinct操作,去掉集合中的重复项,然后统计排重后的集合长度。例如,在books索引中对language字段进行cardinality操作可以统计出编程语言的种类数,查询语句如下:
GET books/_search

{
    "aggs": {
        "all_lang": {
            "Cardinality ": {
                "field ": "language"
            }
        }
    }
}

返回值如下:

{
    "aggregations": {
        "all_lang": {
            "value ": "3"
        }
    }
}

stats

Stats Aggregation用于基本统计,会一次返回countmaxminavg和sum这5个指标。例如,在books索引中对price字段进行基本统计,查询语句如下:
GET books/_search

{
    "aggs": {
        "all_stas": {
            "stats": {
                "field ": "price "
            }
        }
    }
}

返回值如下:

{
    "aggregations": {
        "all_stas": {
            "count":5,
            "min":46.5,
            "max":81.4,
            "avg":63.8,
            "sum":319
        }
    }
}

Extended Stats

Extended Stats Aggregation用于高级统计,和基本统计功能类似,但是会比基本统计多4个统计结果:平方和、方差、标准差、平均值加/减两个标准差的区间。对books索引中的price字段进行高级统计,查询语句如下:
GET books/_search

{
    "aggs": {
        "all_stas": {
            "extended_stats": {
                "field ": "price "
            }
        }
    }
}

返回值如下:

{
    "aggregations": {
        "all_stas": {
            "count":5,
            "min":46.5,
            "max":81.4,
            "avg":63.8,
            "sum":319,
            "sum_of_squares":21095.46,
            "variance":148.65199999999967,
            "std一deviation":12.19229264740638,
            "std_deviation_bounds":
            {"upper":88.18458529481276,"lower":39.41541470518724}
        }
    }
}

Percentiles

Percentiles Aggregation用于百分位统计。百分位数是一个统计学术语,如果将一组数据从大到小排序,并计算相应的累计百分位,某一百分位所对应数据的值就称为这一百分位的百分位数。
例如,对books索引中的price字段进行百分位统计,查询语句如下:
GET books/_search

{
    "aggs": {
        "percentiles_price": {
            "Percentiles": {
                "field ": "price "
            }
        }
    }
}

返回值如下:

{
    "aggregations": {
        "percentiles_price": {
            "1.0":46.82,
            "5.0":48.1,
            "25.0":54.5,
            "50.0":66.4,
            "75.0":70.2,
            "95.0":79.16,
            "99.0":80.95200000000001
        }
    }
}

结果说明:
占比为50%的文档的price值 <= 66.4,或反过来:price<=66.4的文档数占总命中文档数的50%

ValueCount

ValueCountAggregation可按字段统计文档数量。例如,统计books索引中包含author字段的文档数量,查询语句如下:
POST books/_search

{
    "size": "0",
    "aggs": {
        "doc_count ": {
            "value_count ": {
                "field ": "author"
            }
        }
    }
}

返回结果:

{
    "aggregations": {
        "doc_count ": {
            "value ": "author"
        }


    }
}

桶聚合

Bucket可以理解为一个桶,它会遍历文档中的内容,凡是符合某一要求的就放入一个桶中,分桶相当于SQL中的groupby以books索引中的图书为例,一本书会被划分到科技类、
经济类或者其他分类中,那么科技类图书就是一个桶,经济类图书也是一个桶,桶就是符合某一划分标准的文档集合。

Terms

TermsAggregation用于分组聚合。例如,根据language字段对books索引中的文档进行分组,统计属于各编程语言的书的数量,构造查询语句如下:
POST books/search?size=0

{
    "aggs": {
        "per_count ": {
            "terms ": {
                "field ": "language"
            }
        }
    }
}

返回结果:

{
    "aggregations": {
        "per_count ": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_countn": 0,
            "buckets": [{
                "key": "java",
                "doccount": 2
            }, {
                "key": "python",
                "doccount": 2
            }, {
                "key": "javascript ",
                "doccount ": 1
            }]
        }
    }
}

在terms分桶的基础上,还可以对每个桶进行指标聚合。例如,想统计每一类图书的平均价格,可以先按照language字段进行TermsAggregation,再进行AvgAggregation,查询语句如下

{
    "aggs": {
        "per_count ": {
            "terms ": {
                "field ": "language"
            },
            "aggs": {
                "avg_price": {
                    "avg": {
                        "field": "price"
                    }
                }
            }
        }
    }
}

结果如下:
{
“aggregations”: {
“per_count “: {
“doc_count_error_upper_bound”: 0,
“sum_other_doc_countn”: 0,
“buckets”: [{
“key”: “java”,
“doccount”: 2,
“avg_price”: {
“value”: 58.35
}
}, {
“key”: “python”,
“doccount”: 2,
“avg_price”: {
“value”: 58.35
}
}, {
“key”: “javascript “,
“doccount “: 1,
“avg_price”: {
“value”: 58.35
}
}]
}
}
}

Filter

Filter Aggregation是过滤器聚合,可以把符合过滤器中的条件的文档分到一个桶中。例如,计算title字段中含有关键词Java的文档的平均值,查询语句如下:
POSTbooks/search?size=0

{
    "aggs ": {
        "java_avg_price ": {
            "filter": {
                "term ": {
                    "title ": "java "
                }
            },
            "aggs ": {
                "avg ": {
                    "field ": "price "
                }
            }
        }
    }
}

结果如下:

{
    "aggregations": {
        "java_avg_price": {
            "doc_count": 2,
            "navg_price ": {
                "value ": 58.35
            }
        }
    }
}

Filters

FiltersAggregation是多过滤器聚合,可以把符合多个过滤条件的文档分到不同的桶中。下面命令中filters中包含两个matchquery,对每个query的查询结果进行分组统计,查询语句如下:
POST books/search?size=0

{
    "aggs": {
        "per_avg_price": {
            "filters": {
                "filters": [{
                    "match": {
                        "title": "java"
                    }
                }, {
                    "match": {
                        "title": "python"
                    }
                }]
            },
            "aggs": {
                "avg_price": {
                    "avg ": {
                        "field ": "price"
                    }
                }
            }
        }
    }
}

结果如下:

{
    "aggregations": {
        "per_avg_price": {
            "buckets": [{
                "doc_count": 2,
                "avg_price": {
                    "value": "58.35"
                }
            }, {
                "doc_count ": 2,
                "avg_price": {
                    "value": 67.95
                }
            }]
        }
    }
}

Range

Range Aggregation是范围聚合,用于反映数据的分布情况。比如,对books索引中的图书按照价格区间在0〜50、50〜80、80以上进行范围聚合,查询语句如下:
POST books/_search?size=0

{
    "aggs": {
        "price_ranges": {
            "range": {
                "field": "price",
                "ranges": [{
                    "to": 50
                }, {
                    "from": 50,
                    "to": 80
                }, {
                    "from": 80
                }]
            }
        }
    }
}

结果如下:

{
    "aggregations": {
        "price一ranges": {
            "buckets": [{
                "key": "*-50.0",
                "to": 50,
                "doccount": 1
            }, {
                "key ": "50.0 - 80.0 ",
                "doccount ": 3
            }, {
                "key ": "80.0-*",
                "from": 80,
                "doccount": 1

            }]
        }
    }
}

其他聚合

Date Range Aggregation
DateRangeAggregation专门用于日期类型的范围聚合,和RangeAggregation的区别在于日期的起止值可以使用数学表达式。
Date Histogram Aggregation
Date Histogram Aggregation是时间直方图聚合,常用于按照日期对文档进行统计并绘制条形图。
Missing Aggregation
Missing Aggregation是空值聚合,可以把文档集中所有缺失字段的文档分到一个桶中。
Children Aggregation
Children Aggregation是一种特殊的单桶聚合,可以根据父子文档关系进行分桶。
Geo Distance Aggregation
Geo Distance Aggregation用于对地理点(geo_point)做范围统计
IP Range Aggregation
用于对IP类型数据范围聚合