MyBlog


ES总结五:聚合

<p>[TOC]</p> <h1>聚合分析</h1> <p>众所周知,Elasticsearch是一个分布式的全文搜索引擎,索引和搜索是Elasticsearch的基本功能。事实上,Elasticsearch的聚合(Aggregations)功能也十分强大,允许在数据上做复杂的分析统计。Elasticsearch提供的聚合分析功能主要有指标聚合、桶聚合、管道聚合和矩阵聚合四大类,管道聚合和矩阵聚合官方说明是在试验阶段,后期会完全更改或者移除,这里不再对管道聚合和矩阵聚合进行讲解。</p> <h1>指标聚合</h1> <h2>Max</h2> <p>Max Aggregation用于最大值统计。例如,统计books索引中价格最高的是哪本书,查询语句如下: <code>GET books/_search</code></p> <pre><code class="language-json">{ "aggs": { "max_price": { "max ": { "field ": "price " } } } }</code></pre> <p>返回值如下:</p> <pre><code class="language-json">{ "aggregations": { "max_price": { "value ": "93 " } } }</code></pre> <h2>Min</h2> <p>Min Aggregation用于最小值统计。例如,统计books索引中最早出版的是哪本书,查询语句如下: <code>GET books/_search</code></p> <pre><code class="language-json">{ "aggs": { "min_year": { "min ": { "field ": "publish_time " } } } }</code></pre> <p>返回值如下:</p> <pre><code class="language-json">{ "aggregations": { "min_year": { "value ": "2017-02-02" } } }</code></pre> <h2>Avg</h2> <p>Avg Aggregation用于计算平均值。例如,计算books索引中所有书的平均价格,查询语句如下: <code>GET books/_search</code></p> <pre><code class="language-json">{ "aggs": { "avg_price": { "avg ": { "field ": "price " } } } }</code></pre> <p>返回值如下:</p> <pre><code class="language-json">{ "aggregations": { "avg_price": { "value ": "42.2" } } }</code></pre> <h2>Sum</h2> <p>Sum Aggregation用于计算总和。例如,计算books索引中所有书的总价,查询语句如下: <code>GET books/_search</code></p> <pre><code class="language-json">{ "aggs": { "sum_price": { "sum ": { "field ": "price " } } } }</code></pre> <p>返回值如下:</p> <pre><code class="language-json">{ "aggregations": { "sum_price": { "value ": "455" } } }</code></pre> <h2>Cardinality</h2> <p>Cardinality Aggregation用于基数统计,<strong>其作用是先执行类似SQL中的distinct操作</strong>,去掉集合中的重复项,然后统计排重后的集合长度。例如,在books索引中对language字段进行cardinality操作可以统计出编程语言的种类数,查询语句如下: <code>GET books/_search</code></p> <pre><code class="language-json">{ "aggs": { "all_lang": { "Cardinality ": { "field ": "language" } } } }</code></pre> <p>返回值如下:</p> <pre><code class="language-json">{ "aggregations": { "all_lang": { "value ": "3" } } }</code></pre> <h2>stats</h2> <p>Stats Aggregation用于基本统计,会一次返回countmaxminavg和sum这5个指标。例如,在books索引中对price字段进行基本统计,查询语句如下: <code>GET books/_search</code></p> <pre><code class="language-json">{ "aggs": { "all_stas": { "stats": { "field ": "price " } } } }</code></pre> <p>返回值如下:</p> <pre><code class="language-json">{ "aggregations": { "all_stas": { "count":5, "min":46.5, "max":81.4, "avg":63.8, "sum":319 } } }</code></pre> <h2>Extended Stats</h2> <p>Extended Stats Aggregation用于高级统计,和基本统计功能类似,但是会比基本统计多4个统计结果:平方和、方差、标准差、平均值加/减两个标准差的区间。对books索引中的price字段进行高级统计,查询语句如下: <code>GET books/_search</code></p> <pre><code class="language-json">{ "aggs": { "all_stas": { "extended_stats": { "field ": "price " } } } }</code></pre> <p>返回值如下:</p> <pre><code class="language-json">{ "aggregations": { "all_stas": { "count":5, "min":46.5, "max":81.4, "avg":63.8, "sum":319, "sum_of_squares":21095.46, "variance":148.65199999999967, "std一deviation":12.19229264740638, "std_deviation_bounds": {"upper":88.18458529481276,"lower":39.41541470518724} } } }</code></pre> <h2>Percentiles</h2> <p>Percentiles Aggregation用于百分位统计。百分位数是一个统计学术语,如果将一组数据从大到小排序,并计算相应的累计百分位,某一百分位所对应数据的值就称为这一百分位的百分位数。 例如,对books索引中的price字段进行百分位统计,查询语句如下: <code>GET books/_search</code></p> <pre><code class="language-json">{ "aggs": { "percentiles_price": { "Percentiles": { "field ": "price " } } } }</code></pre> <p>返回值如下:</p> <pre><code class="language-json">{ "aggregations": { "percentiles_price": { "1.0":46.82, "5.0":48.1, "25.0":54.5, "50.0":66.4, "75.0":70.2, "95.0":79.16, "99.0":80.95200000000001 } } }</code></pre> <p>结果说明: 占比为50%的文档的price值 &lt;= 66.4,或反过来:price&lt;=66.4的文档数占总命中文档数的50%</p> <h2>ValueCount</h2> <p>ValueCountAggregation可按字段统计文档数量。例如,统计books索引中包含author字段的文档数量,查询语句如下: <code>POST books/_search</code></p> <pre><code class="language-json">{ "size": "0", "aggs": { "doc_count ": { "value_count ": { "field ": "author" } } } }</code></pre> <p>返回结果:</p> <pre><code class="language-json">{ "aggregations": { "doc_count ": { "value ": "author" } } }</code></pre> <h1>桶聚合</h1> <p>Bucket可以理解为一个桶,它会遍历文档中的内容,凡是符合某一要求的就放入一个桶中,分桶相当于SQL中的groupby以books索引中的图书为例,一本书会被划分到科技类、 经济类或者其他分类中,那么科技类图书就是一个桶,经济类图书也是一个桶,桶就是符合某一划分标准的文档集合。</p> <h2>Terms</h2> <p>TermsAggregation用于分组聚合。例如,根据language字段对books索引中的文档进行分组,统计属于各编程语言的书的数量,构造查询语句如下: <code>POST books/search?size=0</code></p> <pre><code class="language-json">{ "aggs": { "per_count ": { "terms ": { "field ": "language" } } } }</code></pre> <p>返回结果:</p> <pre><code class="language-json">{ "aggregations": { "per_count ": { "doc_count_error_upper_bound": 0, "sum_other_doc_countn": 0, "buckets": [{ "key": "java", "doccount": 2 }, { "key": "python", "doccount": 2 }, { "key": "javascript ", "doccount ": 1 }] } } }</code></pre> <p>在terms分桶的基础上,还可以对每个桶进行指标聚合。例如,想统计每一类图书的平均价格,可以先按照language字段进行TermsAggregation,再进行AvgAggregation,查询语句如下</p> <pre><code class="language-json">{ "aggs": { "per_count ": { "terms ": { "field ": "language" }, "aggs": { "avg_price": { "avg": { "field": "price" } } } } } }</code></pre> <p>结果如下: { &quot;aggregations&quot;: { &quot;per_count &quot;: { &quot;doc_count_error_upper_bound&quot;: 0, &quot;sum_other_doc_countn&quot;: 0, &quot;buckets&quot;: [{ &quot;key&quot;: &quot;java&quot;, &quot;doccount&quot;: 2, &quot;avg_price&quot;: { &quot;value&quot;: 58.35 } }, { &quot;key&quot;: &quot;python&quot;, &quot;doccount&quot;: 2, &quot;avg_price&quot;: { &quot;value&quot;: 58.35 } }, { &quot;key&quot;: &quot;javascript &quot;, &quot;doccount &quot;: 1, &quot;avg_price&quot;: { &quot;value&quot;: 58.35 } }] } } }</p> <h2>Filter</h2> <p>Filter Aggregation是过滤器聚合,可以把符合过滤器中的条件的文档分到一个桶中。例如,计算title字段中含有关键词Java的文档的平均值,查询语句如下: <code>POSTbooks/search?size=0</code></p> <pre><code class="language-json">{ "aggs ": { "java_avg_price ": { "filter": { "term ": { "title ": "java " } }, "aggs ": { "avg ": { "field ": "price " } } } } }</code></pre> <p>结果如下:</p> <pre><code class="language-json">{ "aggregations": { "java_avg_price": { "doc_count": 2, "navg_price ": { "value ": 58.35 } } } }</code></pre> <h2>Filters</h2> <p>FiltersAggregation是多过滤器聚合,可以把符合多个过滤条件的文档分到不同的桶中。下面命令中filters中包含两个matchquery,对每个query的查询结果进行分组统计,查询语句如下: <code>POST books/search?size=0</code></p> <pre><code class="language-json">{ "aggs": { "per_avg_price": { "filters": { "filters": [{ "match": { "title": "java" } }, { "match": { "title": "python" } }] }, "aggs": { "avg_price": { "avg ": { "field ": "price" } } } } } }</code></pre> <p>结果如下:</p> <pre><code class="language-json">{ "aggregations": { "per_avg_price": { "buckets": [{ "doc_count": 2, "avg_price": { "value": "58.35" } }, { "doc_count ": 2, "avg_price": { "value": 67.95 } }] } } }</code></pre> <h2>Range</h2> <p>Range Aggregation是范围聚合,用于反映数据的分布情况。比如,对books索引中的图书按照价格区间在0〜50、50〜80、80以上进行范围聚合,查询语句如下: <code>POST books/_search?size=0</code></p> <pre><code class="language-json">{ "aggs": { "price_ranges": { "range": { "field": "price", "ranges": [{ "to": 50 }, { "from": 50, "to": 80 }, { "from": 80 }] } } } }</code></pre> <p>结果如下:</p> <pre><code class="language-json">{ "aggregations": { "price一ranges": { "buckets": [{ "key": "*-50.0", "to": 50, "doccount": 1 }, { "key ": "50.0 - 80.0 ", "doccount ": 3 }, { "key ": "80.0-*", "from": 80, "doccount": 1 }] } } }</code></pre> <h2>其他聚合</h2> <p>Date Range Aggregation DateRangeAggregation专门用于日期类型的范围聚合,和RangeAggregation的区别在于日期的起止值可以使用数学表达式。 Date Histogram Aggregation Date Histogram Aggregation是时间直方图聚合,常用于按照日期对文档进行统计并绘制条形图。 Missing Aggregation Missing Aggregation是空值聚合,可以把文档集中所有缺失字段的文档分到一个桶中。 Children Aggregation Children Aggregation是一种特殊的单桶聚合,可以根据父子文档关系进行分桶。 Geo Distance Aggregation Geo Distance Aggregation用于对地理点(geo_point)做范围统计 IP Range Aggregation 用于对IP类型数据范围聚合</p>

页面列表

ITEM_HTML