ES总结五:聚合
<p>[TOC]</p>
<h1>聚合分析</h1>
<p>众所周知,Elasticsearch是一个分布式的全文搜索引擎,索引和搜索是Elasticsearch的基本功能。事实上,Elasticsearch的聚合(Aggregations)功能也十分强大,允许在数据上做复杂的分析统计。Elasticsearch提供的聚合分析功能主要有指标聚合、桶聚合、管道聚合和矩阵聚合四大类,管道聚合和矩阵聚合官方说明是在试验阶段,后期会完全更改或者移除,这里不再对管道聚合和矩阵聚合进行讲解。</p>
<h1>指标聚合</h1>
<h2>Max</h2>
<p>Max Aggregation用于最大值统计。例如,统计books索引中价格最高的是哪本书,查询语句如下:
<code>GET books/_search</code></p>
<pre><code class="language-json">{
"aggs": {
"max_price": {
"max ": {
"field ": "price "
}
}
}
}</code></pre>
<p>返回值如下:</p>
<pre><code class="language-json">{
"aggregations": {
"max_price": {
"value ": "93 "
}
}
}</code></pre>
<h2>Min</h2>
<p>Min Aggregation用于最小值统计。例如,统计books索引中最早出版的是哪本书,查询语句如下:
<code>GET books/_search</code></p>
<pre><code class="language-json">{
"aggs": {
"min_year": {
"min ": {
"field ": "publish_time "
}
}
}
}</code></pre>
<p>返回值如下:</p>
<pre><code class="language-json">{
"aggregations": {
"min_year": {
"value ": "2017-02-02"
}
}
}</code></pre>
<h2>Avg</h2>
<p>Avg Aggregation用于计算平均值。例如,计算books索引中所有书的平均价格,查询语句如下:
<code>GET books/_search</code></p>
<pre><code class="language-json">{
"aggs": {
"avg_price": {
"avg ": {
"field ": "price "
}
}
}
}</code></pre>
<p>返回值如下:</p>
<pre><code class="language-json">{
"aggregations": {
"avg_price": {
"value ": "42.2"
}
}
}</code></pre>
<h2>Sum</h2>
<p>Sum Aggregation用于计算总和。例如,计算books索引中所有书的总价,查询语句如下:
<code>GET books/_search</code></p>
<pre><code class="language-json">{
"aggs": {
"sum_price": {
"sum ": {
"field ": "price "
}
}
}
}</code></pre>
<p>返回值如下:</p>
<pre><code class="language-json">{
"aggregations": {
"sum_price": {
"value ": "455"
}
}
}</code></pre>
<h2>Cardinality</h2>
<p>Cardinality Aggregation用于基数统计,<strong>其作用是先执行类似SQL中的distinct操作</strong>,去掉集合中的重复项,然后统计排重后的集合长度。例如,在books索引中对language字段进行cardinality操作可以统计出编程语言的种类数,查询语句如下:
<code>GET books/_search</code></p>
<pre><code class="language-json">{
"aggs": {
"all_lang": {
"Cardinality ": {
"field ": "language"
}
}
}
}</code></pre>
<p>返回值如下:</p>
<pre><code class="language-json">{
"aggregations": {
"all_lang": {
"value ": "3"
}
}
}</code></pre>
<h2>stats</h2>
<p>Stats Aggregation用于基本统计,会一次返回countmaxminavg和sum这5个指标。例如,在books索引中对price字段进行基本统计,查询语句如下:
<code>GET books/_search</code></p>
<pre><code class="language-json">{
"aggs": {
"all_stas": {
"stats": {
"field ": "price "
}
}
}
}</code></pre>
<p>返回值如下:</p>
<pre><code class="language-json">{
"aggregations": {
"all_stas": {
"count":5,
"min":46.5,
"max":81.4,
"avg":63.8,
"sum":319
}
}
}</code></pre>
<h2>Extended Stats</h2>
<p>Extended Stats Aggregation用于高级统计,和基本统计功能类似,但是会比基本统计多4个统计结果:平方和、方差、标准差、平均值加/减两个标准差的区间。对books索引中的price字段进行高级统计,查询语句如下:
<code>GET books/_search</code></p>
<pre><code class="language-json">{
"aggs": {
"all_stas": {
"extended_stats": {
"field ": "price "
}
}
}
}</code></pre>
<p>返回值如下:</p>
<pre><code class="language-json">{
"aggregations": {
"all_stas": {
"count":5,
"min":46.5,
"max":81.4,
"avg":63.8,
"sum":319,
"sum_of_squares":21095.46,
"variance":148.65199999999967,
"std一deviation":12.19229264740638,
"std_deviation_bounds":
{"upper":88.18458529481276,"lower":39.41541470518724}
}
}
}</code></pre>
<h2>Percentiles</h2>
<p>Percentiles Aggregation用于百分位统计。百分位数是一个统计学术语,如果将一组数据从大到小排序,并计算相应的累计百分位,某一百分位所对应数据的值就称为这一百分位的百分位数。
例如,对books索引中的price字段进行百分位统计,查询语句如下:
<code>GET books/_search</code></p>
<pre><code class="language-json">{
"aggs": {
"percentiles_price": {
"Percentiles": {
"field ": "price "
}
}
}
}</code></pre>
<p>返回值如下:</p>
<pre><code class="language-json">{
"aggregations": {
"percentiles_price": {
"1.0":46.82,
"5.0":48.1,
"25.0":54.5,
"50.0":66.4,
"75.0":70.2,
"95.0":79.16,
"99.0":80.95200000000001
}
}
}</code></pre>
<p>结果说明:
占比为50%的文档的price值 <= 66.4,或反过来:price<=66.4的文档数占总命中文档数的50%</p>
<h2>ValueCount</h2>
<p>ValueCountAggregation可按字段统计文档数量。例如,统计books索引中包含author字段的文档数量,查询语句如下:
<code>POST books/_search</code></p>
<pre><code class="language-json">{
"size": "0",
"aggs": {
"doc_count ": {
"value_count ": {
"field ": "author"
}
}
}
}</code></pre>
<p>返回结果:</p>
<pre><code class="language-json">{
"aggregations": {
"doc_count ": {
"value ": "author"
}
}
}</code></pre>
<h1>桶聚合</h1>
<p>Bucket可以理解为一个桶,它会遍历文档中的内容,凡是符合某一要求的就放入一个桶中,分桶相当于SQL中的groupby以books索引中的图书为例,一本书会被划分到科技类、
经济类或者其他分类中,那么科技类图书就是一个桶,经济类图书也是一个桶,桶就是符合某一划分标准的文档集合。</p>
<h2>Terms</h2>
<p>TermsAggregation用于分组聚合。例如,根据language字段对books索引中的文档进行分组,统计属于各编程语言的书的数量,构造查询语句如下:
<code>POST books/search?size=0</code></p>
<pre><code class="language-json">{
"aggs": {
"per_count ": {
"terms ": {
"field ": "language"
}
}
}
}</code></pre>
<p>返回结果:</p>
<pre><code class="language-json">{
"aggregations": {
"per_count ": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_countn": 0,
"buckets": [{
"key": "java",
"doccount": 2
}, {
"key": "python",
"doccount": 2
}, {
"key": "javascript ",
"doccount ": 1
}]
}
}
}</code></pre>
<p>在terms分桶的基础上,还可以对每个桶进行指标聚合。例如,想统计每一类图书的平均价格,可以先按照language字段进行TermsAggregation,再进行AvgAggregation,查询语句如下</p>
<pre><code class="language-json">{
"aggs": {
"per_count ": {
"terms ": {
"field ": "language"
},
"aggs": {
"avg_price": {
"avg": {
"field": "price"
}
}
}
}
}
}</code></pre>
<p>结果如下:
{
"aggregations": {
"per_count ": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_countn": 0,
"buckets": [{
"key": "java",
"doccount": 2,
"avg_price": {
"value": 58.35
}
}, {
"key": "python",
"doccount": 2,
"avg_price": {
"value": 58.35
}
}, {
"key": "javascript ",
"doccount ": 1,
"avg_price": {
"value": 58.35
}
}]
}
}
}</p>
<h2>Filter</h2>
<p>Filter Aggregation是过滤器聚合,可以把符合过滤器中的条件的文档分到一个桶中。例如,计算title字段中含有关键词Java的文档的平均值,查询语句如下:
<code>POSTbooks/search?size=0</code></p>
<pre><code class="language-json">{
"aggs ": {
"java_avg_price ": {
"filter": {
"term ": {
"title ": "java "
}
},
"aggs ": {
"avg ": {
"field ": "price "
}
}
}
}
}</code></pre>
<p>结果如下:</p>
<pre><code class="language-json">{
"aggregations": {
"java_avg_price": {
"doc_count": 2,
"navg_price ": {
"value ": 58.35
}
}
}
}</code></pre>
<h2>Filters</h2>
<p>FiltersAggregation是多过滤器聚合,可以把符合多个过滤条件的文档分到不同的桶中。下面命令中filters中包含两个matchquery,对每个query的查询结果进行分组统计,查询语句如下:
<code>POST books/search?size=0</code></p>
<pre><code class="language-json">{
"aggs": {
"per_avg_price": {
"filters": {
"filters": [{
"match": {
"title": "java"
}
}, {
"match": {
"title": "python"
}
}]
},
"aggs": {
"avg_price": {
"avg ": {
"field ": "price"
}
}
}
}
}
}</code></pre>
<p>结果如下:</p>
<pre><code class="language-json">{
"aggregations": {
"per_avg_price": {
"buckets": [{
"doc_count": 2,
"avg_price": {
"value": "58.35"
}
}, {
"doc_count ": 2,
"avg_price": {
"value": 67.95
}
}]
}
}
}</code></pre>
<h2>Range</h2>
<p>Range Aggregation是范围聚合,用于反映数据的分布情况。比如,对books索引中的图书按照价格区间在0〜50、50〜80、80以上进行范围聚合,查询语句如下:
<code>POST books/_search?size=0</code></p>
<pre><code class="language-json">{
"aggs": {
"price_ranges": {
"range": {
"field": "price",
"ranges": [{
"to": 50
}, {
"from": 50,
"to": 80
}, {
"from": 80
}]
}
}
}
}</code></pre>
<p>结果如下:</p>
<pre><code class="language-json">{
"aggregations": {
"price一ranges": {
"buckets": [{
"key": "*-50.0",
"to": 50,
"doccount": 1
}, {
"key ": "50.0 - 80.0 ",
"doccount ": 3
}, {
"key ": "80.0-*",
"from": 80,
"doccount": 1
}]
}
}
}</code></pre>
<h2>其他聚合</h2>
<p>Date Range Aggregation
DateRangeAggregation专门用于日期类型的范围聚合,和RangeAggregation的区别在于日期的起止值可以使用数学表达式。
Date Histogram Aggregation
Date Histogram Aggregation是时间直方图聚合,常用于按照日期对文档进行统计并绘制条形图。
Missing Aggregation
Missing Aggregation是空值聚合,可以把文档集中所有缺失字段的文档分到一个桶中。
Children Aggregation
Children Aggregation是一种特殊的单桶聚合,可以根据父子文档关系进行分桶。
Geo Distance Aggregation
Geo Distance Aggregation用于对地理点(geo_point)做范围统计
IP Range Aggregation
用于对IP类型数据范围聚合</p>