Oracle 12cR2 gathering statistics :approximate_ndv_algorithm之hyperloglog(一)

众所周知,拥有适当的数据库统计信息对于查询优化器至关重要。统计数据应该正确地描述数据库中的数据。有效地收集统计数据并获得正确的统计数据,并不是一个容易的过程。我们先来说说统一信息中的采样。数据库从一个小样本开始,然后在一些分析的基础上确定样本大小是否足够大。因此,数据库首先需要确定最优的样本大小,然后扫描数据库对象以计算对象统计信息。但是,如果一个表很大,那么被认为是最优的样本数量实际上太小了,当然计算出来的统计数据是不准确的。此外,这种方法需要重新扫描所有分区,以便计算分区表的全局对象统计信息。对于非分区表,数据库必须执行排序操作,以便为列生成许多不同的值(NDV)-这是一种资源密集型操作。在Oracle 12c R2中Oracle推出了approximate_ndv_algorithm参数,可以设置hyperloglog、自适应采样等算法。而为了支持这种机制,在数据库中引入了一种称为“概要”的特殊数据结构。它包含有关对象(即表的列、分区)的信息,这些信息有助于有效地近似的获取相关的统计信息。
让我们来初步了解一些,概要和approximate_ndv_algorithm中的hyperloglog算法

CREATE TABLE u1.sales1
 (prod_id NUMBER NOT NULL,
 cust_id NUMBER NOT NULL,
 time_id DATE NOT NULL,
 channel_id NUMBER NOT NULL,
 promo_id NUMBER NOT NULL,
 quantity_sold NUMBER(10,2) NOT NULL,
 amount_sold NUMBER(10,2) NOT NULL)
 PARTITION BY RANGE (TIME_ID)
 (
 PARTITION sales_q1_1998 VALUES LESS THAN (TO_DATE(' 1998-04-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')),
 PARTITION sales_q2_1998 VALUES LESS THAN (TO_DATE(' 1998-07-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN'))
 )
/
多执行几次,创建一张较大的表。
insert /*+ append parallel(4)*/into u1.sales1 select * from (select * from sh.sales where time_id<=to_date('6/30/1998','mm/dd/yyyy');
insert /*+ append parallel(4)*/into u1.sales1 select * from (select * from u1.sales1 where time_id<=to_date('6/30/1998','mm/dd/yyyy'));
insert /*+ append parallel(4)*/into u1.sales1 select * from (select * from u1.sales1 where time_id<=to_date('6/30/1998','mm/dd/yyyy'));
insert /*+ append parallel(4)*/into u1.sales1 select * from (select * from u1.sales1 where time_id<=to_date('6/30/1998','mm/dd/yyyy'));
SQL> select segment_name,bytes/1024/1024/1024 from dba_segments where segment_name='SALES1';
SEGMENT_NAME                   BYTES/1024/1024/1024
------------------------------ --------------------
SALES1                                     3.296875
SALES1                                   2.74902344

收集表的统计信息,收集方式使用approximate_ndv_algorithm中的hyperloglog算法。同时设置增量收集。

BEGIN
 DBMS_STATS.set_table_prefs (ownname => 'u1',tabname => 'sales1',pname => 'approximate_ndv_algorithm',pvalue => 'hyperloglog');
 DBMS_STATS.set_table_prefs ('u1','sales1','INCREMENTAL','TRUE');
END;

统计信息进行收集,同时打开trace进行跟踪

BEGIN
 DBMS_STATS.set_global_prefs ('trace', TO_CHAR (2048 + 32768 + 4 + 16));
 DBMS_STATS.gather_table_stats (ownname => 'u1', tabname => 'SALES1');
END;

从trace中可以看到。Oracle开始为表的一个分区(SALES_Q2_1998)收集统计信息。而收集的方式是HLL(incremental)的方式。

DBMS_STATS: gather stats on partition SALES_Q2_1998: synopsis not gathered yet; not analyzed yet;
DBMS_STATS: Start gather_stats.. pfix:  ownname: U1 tabname: SALES1 pname: SALES_Q2_1998 spname:  execution phase: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: reporting_man_log_task: target: U1.SALES1.SALES_Q2_1998 objn: 74398 auto_stats: FALSE status: IN PROGRESS ctx.batching_coeff: 0

DBMS_STATS: no AS synopses to delete for #74396
DBMS_STATS: Using approximate NDV pct=0
DBMS_STATS:  NNV  NDV  AVG  MMX  HST  EP   RP   NNNP IND  CNDV HSTN HSTR  COLNAME
DBMS_STATS:        Y    Y    Y                                            PROD_ID
DBMS_STATS:        Y    Y    Y                                            CUST_ID
DBMS_STATS:        Y         Y    Y         Y    Y                        TIME_ID
DBMS_STATS:        Y    Y    Y                                            CHANNEL_ID
DBMS_STATS:        Y    Y    Y                                            PROMO_ID
DBMS_STATS:        Y    Y    Y                                            QUANTITY_SOLD
DBMS_STATS:        Y    Y    Y                                            AMOUNT_SOLD
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: Approximate NDV Options
DBMS_STATS: SYN,NIL,NIL,SYN,NIL,NIL,STOPN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,RWID, NDV_HLL, B74396,U254,U254,U254,U254,U254,U254,U254U
DBMS_STATS: Starting query at 26-JUN-18 01.29.25.406113000 AM +08:00
DBMS_STATS: select /*+  full(t)    no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring xmlindex_sel_idx_tbl opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad  */to_ch
ar(count("PROD_ID")),substrb(dump(min("PROD_ID"),16,0,64),1,240),substrb(dump(max("PROD_ID"),16,0,64),1,240),to_char(count("CUST_ID")),substrb(dump(min("CUST_ID"),16,0,64),1,240),substrb(dump(max("CUST_ID"),16,0,64),1,240),to_char(count("TIME_ID"
)),substrb(dump(min("TIME_ID"),16,0,64),1,240),substrb(dump(max("TIME_ID"),16,0,64),1,240),to_char(count("CHANNEL_ID")),substrb(dump(min("CHANNEL_ID"),16,0,64),1,240),substrb(dump(max("CHANNEL_ID"),16,0,64),1,240),to_char(count("PROMO_ID")),subst
rb(dump(min("PROMO_ID"),16,0,64),1,240),substrb(dump(max("PROMO_ID"),16,0,64),1,240),to_char(count("QUANTITY_SOLD")),substrb(dump(min("QUANTITY_SOLD"),16,0,64),1,240),substrb(dump(max("QUANTITY_SOLD"),16,0,64),1,240),to_char(count("AMOUNT_SOLD"))
,substrb(dump(min("AMOUNT_SOLD"),16,0,64),1,240),substrb(dump(max("AMOUNT_SOLD"),16,0,64),1,240),count(rowidtochar(rowid)) from "U1"."SALES1" t  where TBL$OR$IDX$PART$NUM("U1"."SALES1",0,4,0,"ROWID") = :objn /* SYN,NIL,NIL,SYN,NIL,NIL,STOPN,NIL,N
IL,SYN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,RWID, NDV_HLL, B74396,U254,U254,U254,U254,U254,U254,U254U*/

使用同样的方法,数据库计算第二个分区SALES_Q1_1998的列统计信息。

DBMS_STATS: gather stats on partition SALES_Q1_1998: synopsis not gathered yet; not analyzed yet;
DBMS_STATS: Start gather_stats.. pfix:  ownname: U1 tabname: SALES1 pname: SALES_Q1_1998 spname:  execution phase: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: reporting_man_log_task: target: U1.SALES1.SALES_Q1_1998 objn: 74397 auto_stats: FALSE status: IN PROGRESS ctx.batching_coeff: 0
DBMS_STATS: delete synopses of a single partition
DBMS_STATS: Starting query at 26-JUN-18 01.29.48.878218000 AM +08:00
DBMS_STATS:  select /*+ OPT_PARAM('_parallel_syspls_obey_force' 'false') */ 1
            from dual where exists (select null from wri$_optstat_synopsis_head$ h, table(:clist_syn) c where h.bo# = :bo#  and h.intcol# = c.column_value and h.analyzetime is not null and h.spare1 is null  and group# in  (:singlegroup) )
DBMS_STATS: no AS synopses to delete for #74396
DBMS_STATS: Using approximate NDV pct=0
DBMS_STATS:  NNV  NDV  AVG  MMX  HST  EP   RP   NNNP IND  CNDV HSTN HSTR  COLNAME
DBMS_STATS:        Y    Y    Y                                            PROD_ID
DBMS_STATS:        Y    Y    Y                                            CUST_ID
DBMS_STATS:        Y         Y    Y         Y    Y                        TIME_ID
DBMS_STATS:        Y    Y    Y                                            CHANNEL_ID
DBMS_STATS:        Y    Y    Y                                            PROMO_ID
DBMS_STATS:        Y    Y    Y                                            QUANTITY_SOLD
DBMS_STATS:        Y    Y    Y                                            AMOUNT_SOLD
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: Approximate NDV Options
DBMS_STATS: SYN,NIL,NIL,SYN,NIL,NIL,STOPN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,RWID, NDV_HLL, B74396,U254,U254,U254,U254,U254,U254,U254U
DBMS_STATS: Starting query at 26-JUN-18 01.29.48.884823000 AM +08:00
DBMS_STATS: select /*+  full(t)    no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring xmlindex_sel_idx_tbl opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad  */to_ch
ar(count("PROD_ID")),substrb(dump(min("PROD_ID"),16,0,64),1,240),substrb(dump(max("PROD_ID"),16,0,64),1,240),to_char(count("CUST_ID")),substrb(dump(min("CUST_ID"),16,0,64),1,240),substrb(dump(max("CUST_ID"),16,0,64),1,240),to_char(count("TIME_ID"
)),substrb(dump(min("TIME_ID"),16,0,64),1,240),substrb(dump(max("TIME_ID"),16,0,64),1,240),to_char(count("CHANNEL_ID")),substrb(dump(min("CHANNEL_ID"),16,0,64),1,240),substrb(dump(max("CHANNEL_ID"),16,0,64),1,240),to_char(count("PROMO_ID")),subst
rb(dump(min("PROMO_ID"),16,0,64),1,240),substrb(dump(max("PROMO_ID"),16,0,64),1,240),to_char(count("QUANTITY_SOLD")),substrb(dump(min("QUANTITY_SOLD"),16,0,64),1,240),substrb(dump(max("QUANTITY_SOLD"),16,0,64),1,240),to_char(count("AMOUNT_SOLD"))
,substrb(dump(min("AMOUNT_SOLD"),16,0,64),1,240),substrb(dump(max("AMOUNT_SOLD"),16,0,64),1,240),count(rowidtochar(rowid)) from "U1"."SALES1" t  where TBL$OR$IDX$PART$NUM("U1"."SALES1",0,4,0,"ROWID") = :objn /* SYN,NIL,NIL,SYN,NIL,NIL,STOPN,NIL,N
IL,SYN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,RWID, NDV_HLL, B74396,U254,U254,U254,U254,U254,U254,U254U*/

最后,Oracle通过agg计算了已分区表的全局统计信息。 重新计算分区级别的统计信息。oracle通过合并概要实现了这一点。我们可以从跟踪文件中清楚地看到它。

DBMS_STATS: prepare reporting structures...
DBMS_STATS: reporting_man_update_task: objn: 74397 auto_stats: FALSE status: COMPLETED ctx.batching_coeff: 0
DBMS_STATS: Start gather_stats.. pfix:  ownname: U1 tabname: SALES1 pname:  spname:  execution phase: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: Synopsis Aggregation Degree: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: get_agg_colstats: HLL only                                      
DBMS_STATS: Deriving Global Histogram for TIME_ID
DBMS_STATS:  derive global freq hist...
DBMS_STATS: Evaluating frequency histogram for col: "TIME_ID"
DBMS_STATS:  number of values = 181, max # of buckects = 254, pct = 100, ssize = 162703360
DBMS_STATS:   range_skew: YES
DBMS_STATS:   csr.hreq: 0  Histogram gathering flags: 1037
DBMS_STATS: Derive global stats from partition synopses/stats for table SALES1                    ===========>从表SALES1的分区概要和统计信息中派生出全局的统计信息。
DBMS_STATS: Start fill_cstats - hybrid_enabled: TRUE

因此,如果表分区有统计信息,那么通过合并适当分区的概要就可以计算出全局统计信息。从Oracle 12c R2开始,synopsis(它是基于HLL算法创建)存储在WRI$_OPTSTAT_SYNOPSIS_HEAD$表中。

SQL> SELECT bo#,group#,intcol#,synopsis#,split,spare1,DBMS_LOB.SUBSTR (spare2, 10) spare2 FROM wri$_optstat_synopsis_head$ 
  2  WHERE bo# IN (SELECT object_id FROM dba_objects WHERE object_name = 'SALES1'); 

       BO#     GROUP#    INTCOL#  SYNOPSIS#      SPLIT     SPARE1 SPARE2
---------- ---------- ---------- ---------- ---------- ---------- ----------------------------------------
     74396     148794          1                     0          1 0D0C009B003C00000000
     74396     148794          2                     0          1 0D0C000108A000000000
     74396     148794          3                     0          1 0D0C0005005A00000000
     74396     148794          4                     0          1 0D0C0303000400000000
     74396     148794          5                     0          1 0D0C01F6000200000000
     74396     148794          6                     0          1 0D0C00E9000100000000
     74396     148794          7                     0          1 0D0C0004017C00000000
     74396     148796          1                     0          1 0D0C00B7003100000000
     74396     148796          2                     0          1 0D0C000107F600000000
     74396     148796          3                     0          1 0D0C001C005B00000000
     74396     148796          4                     0          1 0D0C0303000400000000
     74396     148796          5                     0          1 0D0C01F6000200000000
     74396     148796          6                     0          1 0D0C00E9000100000000
     74396     148796          7                     0          1 0D0C000E00D600000000

我们有一个两个分区的分区表,表中有7个列,因此这里数据字典里面有7*2=14个概要,BO#代表Object_id。GROUP#等同于分区的编号。INTCOL#代表了Column number,和sys.col$.col#里面的值相同,SPLIT,为概要执行的拆分数。SPARE1,如果根据HLL(hyperloglog)算法创建的概要,这里的值应当总是为1。SPARE2,此列包含通过将HyperLogLog或Adaptive Sampling应用于相应列值而生成的散列值(概要值)。

因此,在我们的例子中,我们有概要及其属性:hash值和拆分数。这意味着这些值足以在本地或全局级别计算列的NDV。

在Oracle 12c R2中引入了一些新的函数允许我们通过聚合概要的散列值来计算ndv。
Apx_COUNT_DISTICATE_Detail-返回有关的近似行数的信息。这是一个特殊的格式的blob。
Apx_COUNT_DISTICATH_AGG -此函数基于Apx_COUNT_DISTICATE_Detail的结果创建更高级别的摘要,它允许我们避免基础表的重新扫描以获得新的聚合。
to_approx_count_distinct –此函数以Number的形式返回上述函数的结果。

现在,我们可以检查字典中的列统计信息,并将用上面的函数对概要产生的结果进行对比。

SELECT partition_name, column_name, num_distinct FROM dba_part_col_statistics WHERE owner = 'U1' AND table_name = 'SALES1' ORDER BY 1, 2;

PARTITION_NAME       COLUMN_NAME                    NUM_DISTINCT
-------------------- ------------------------------ ------------
SALES_Q1_1998        AMOUNT_SOLD                             398
SALES_Q1_1998        CHANNEL_ID                                4
SALES_Q1_1998        CUST_ID                                3172
SALES_Q1_1998        PROD_ID                                  60
SALES_Q1_1998        PROMO_ID                                  2
SALES_Q1_1998        QUANTITY_SOLD                             1
SALES_Q1_1998        TIME_ID                                  91
SALES_Q2_1998        AMOUNT_SOLD                             219
SALES_Q2_1998        CHANNEL_ID                                4
SALES_Q2_1998        CUST_ID                                2819
SALES_Q2_1998        PROD_ID                                  49
SALES_Q2_1998        PROMO_ID                                  2
SALES_Q2_1998        QUANTITY_SOLD                             1
SALES_Q2_1998        TIME_ID                                  92

SELECT SUBOBJECT_NAME part_name,name colname, ndv FROM ( SELECT group#,intcol#,to_approx_count_distinct (approx_count_distinct_agg (spare2)) ndv
FROM wri$_optstat_synopsis_head$
WHERE bo# = 74396 GROUP BY group#, intcol#) s,sys.col$ c,dba_objects o WHERE c.obj# = 74396 AND c.col# = s.intcol# AND o.object_id = s.group#/2 ORDER BY 1, 2;

PART_NAME                      COLNAME             NDV
------------------------------ ---------------   -----
SALES_Q1_1998                  AMOUNT_SOLD         398
SALES_Q1_1998                  CHANNEL_ID            4
SALES_Q1_1998                  CUST_ID            3172
SALES_Q1_1998                  PROD_ID              60
SALES_Q1_1998                  PROMO_ID              2
SALES_Q1_1998                  QUANTITY_SOLD         1
SALES_Q1_1998                  TIME_ID              91
SALES_Q2_1998                  AMOUNT_SOLD         219
SALES_Q2_1998                  CHANNEL_ID            4
SALES_Q2_1998                  CUST_ID            2819
SALES_Q2_1998                  PROD_ID              49
SALES_Q2_1998                  PROMO_ID              2
SALES_Q2_1998                  QUANTITY_SOLD         1
SALES_Q2_1998                  TIME_ID              92

正如您所看到的,上述两个查询都返回完全相同的结果。Oracle实际上在上面提到的approx_*函数上实现了HLL算法,并通过将该函数应用于概要数据来派生分区和全局级别的统计信息。
让我们获取表级统计信息,可以看到也是一致的。

SELECT ds.column_name, ds.num_distinct, s.ndv 
FROM ( SELECT intcol#,to_approx_count_distinct (approx_count_distinct_agg (spare2)) ndv
FROM wri$_optstat_synopsis_head$
WHERE bo# = 74396 GROUP BY intcol#) s,sys.col$ c, dba_tab_col_statistics ds
WHERE c.obj# = 74396 AND c.col# = s.intcol# AND ds.table_name = 'SALES1' AND c.name = ds.column_name

COLUMN_NAME         NUM_DISTINCT        NDV
------------------- ------------ ----------
PROD_ID                       60         60
CUST_ID                     4276       4276
TIME_ID                      183        183
CHANNEL_ID                     4          4
PROMO_ID                       2          2
QUANTITY_SOLD                  1          1
AMOUNT_SOLD                  428        428

关注dbDao.com的新浪微博

扫码加入微信Oracle小密圈,了解Oracle最新技术下载分享资源

TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569