Oracle 12cR2 gathering statistics :approximate_ndv_algorithm之ADAPTIVE SAMPLING(二)

上一篇文章我们写了近似NDV算法中的HYPERLOGLOG算法,这一章我们主要写另外一种自适应采样ADAPTIVE SAMPLING算法。

execute DBMS_STATS.delete_table_stats('U1','SALES1');

BEGIN
DBMS_STATS.set_table_prefs (ownname => 'U1',tabname => 'sales1',pname => 'approximate_ndv_algorithm',pvalue => 'ADAPTIVE SAMPLING');
DBMS_STATS.set_global_prefs ('trace', TO_CHAR (2048 + 32768 + 4 + 16));
DBMS_STATS.gather_table_stats (ownname => 'U1', tabname => 'SALES1');
END;

与Hyperloglog算法一样,当Oracle使用as算法计算列统计信息时,它首先分别收集分区级别的统计信息,最后通过概要(synopses)再计算全局级别的统计信息。

DBMS_STATS: gather stats on partition SALES_Q2_1998: synopsis not gathered yet; not analyzed yet;
DBMS_STATS: Start gather_stats.. pfix:  ownname: U1 tabname: SALES1 pname: SALES_Q2_1998 spname:  execution phase: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: AS
DBMS_STATS: reporting_man_log_task: target: U1.SALES1.SALES_Q2_1998 objn: 74398 auto_stats: FALSE status: IN PROGRESS ctx.batching_coeff: 0

这表示Oracle开始收集Sales_Q2_1998分区的统计信息,并使用以下SQL语句来计算统计数据。

DBMS_STATS: select /*+  full(t)    no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring xmlindex_sel_idx_tbl opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad  *
/to_char(count("PROD_ID")),substrb(dump(min("PROD_ID"),16,0,64),1,240),substrb(dump(max("PROD_ID"),16,0,64),1,240),to_char(count("CUST_ID")),substrb(dump(min("CUST_ID"),16,0,64),1,240),substrb(dump(max("CUST_ID"),16,0,64),1,240),to_char(cou
nt("TIME_ID")),substrb(dump(min("TIME_ID"),16,0,64),1,240),substrb(dump(max("TIME_ID"),16,0,64),1,240),to_char(count("CHANNEL_ID")),substrb(dump(min("CHANNEL_ID"),16,0,64),1,240),substrb(dump(max("CHANNEL_ID"),16,0,64),1,240),to_char(count(
"PROMO_ID")),substrb(dump(min("PROMO_ID"),16,0,64),1,240),substrb(dump(max("PROMO_ID"),16,0,64),1,240),to_char(count("QUANTITY_SOLD")),substrb(dump(min("QUANTITY_SOLD"),16,0,64),1,240),substrb(dump(max("QUANTITY_SOLD"),16,0,64),1,240),to_ch
ar(count("AMOUNT_SOLD")),substrb(dump(min("AMOUNT_SOLD"),16,0,64),1,240),substrb(dump(max("AMOUNT_SOLD"),16,0,64),1,240),count(rowidtochar(rowid)) from "U1"."SALES1" t  where TBL$OR$IDX$PART$NUM("U1"."SALES1",0,4,0,"ROWID") = :objn /* SYN,N
IL,NIL,SYN,NIL,NIL,STOPN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,RWID, B74396,U254,U254,U254,U254,U254,U254,U254U*/

然后,数据库收集Sales_Q1_1998分区的统计信息。

DBMS_STATS: gather stats on partition SALES_Q1_1998: synopsis not gathered yet; not analyzed yet;
DBMS_STATS: Start gather_stats.. pfix:  ownname: U1 tabname: SALES1 pname: SALES_Q1_1998 spname:  execution phase: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: AS
DBMS_STATS: reporting_man_log_task: target: U1.SALES1.SALES_Q1_1998 objn: 74397 auto_stats: FALSE status: IN PROGRESS ctx.batching_coeff: 0

最后,Oracle根据分区概要提供的信息计算SALES1表的全局统计信息。

DBMS_STATS: Start gather_stats.. pfix:  ownname: U1 tabname: SALES1 pname:  spname:  execution phase: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: AS
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: AS
DBMS_STATS: Synopsis Aggregation Degree: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: AS
DBMS_STATS: get_agg_colstats: AS only
DBMS_STATS: Deriving Global Histogram for TIME_ID
DBMS_STATS:  derive global freq hist...
DBMS_STATS: Evaluating frequency histogram for col: "TIME_ID"
DBMS_STATS:  number of values = 181, max # of buckects = 254, pct = 100, ssize = 162703360
DBMS_STATS:   range_skew: YES
DBMS_STATS:   csr.hreq: 0  Histogram gathering flags: 1037
DBMS_STATS: Derive global stats from partition synopses/stats for table SALES1

Oracle如何使用概要(synopsis)数据来计算统计数据?首先,当使用AS算法收集统计数据时,概要(synopsis)数据将同时存储在wri$_optstat_synopsis$和wri$_optstat_synopsis_head$表中。Oracle将一些信息插入到wri$_optstat_synopsis_head$表中;具体来说,拆分列非常重要。此时spare1和spare2列为空。而在使用HLL算法的时候,这两列是有数据的。

SQL> select * from sys.wri$_optstat_synopsis_head$  where BO#=74396; 

       BO#     GROUP#    INTCOL#  SYNOPSIS#      SPLIT ANALYZETI     SPARE1 SPARE2
---------- ---------- ---------- ---------- ---------- --------- ---------- --------------------
     74396     148794          1                     0 11-JUL-18
     74396     148794          2                     0 11-JUL-18
     74396     148794          3                     0 11-JUL-18
     74396     148794          4                     0 11-JUL-18
     74396     148794          5                     0 11-JUL-18
     74396     148794          6                     0 11-JUL-18
     74396     148794          7                     0 11-JUL-18
     74396     148796          1                     0 11-JUL-18
     74396     148796          2                     0 11-JUL-18
     74396     148796          3                     0 11-JUL-18
     74396     148796          4                     0 11-JUL-18
     74396     148796          5                     0 11-JUL-18
     74396     148796          6                     0 11-JUL-18
     74396     148796          7                     0 11-JUL-18

此外,概要值(hashvalue)存储在WRI$_optstat_synoptstat$的hashvalue列中。因此,根据公式NDV=N*power(2,i),要计算NDV,我们需要知道相应列的散列值的不同值的数目和拆分的数目。
首先我们来计算列的拆分数。

SQL> SELECT (SELECT name FROM sys.col$ WHERE obj# = 74396 AND col# = t.intcol#) column_name, MAX (split) maxsplit FROM sys.wri$_optstat_synopsis_head$ t WHERE t.bo# = 74396 GROUP BY t.intcol#; 

COLUMN_NAME            MAXSPLIT
-------------------- ----------
PROD_ID                       0
QUANTITY_SOLD                 0
CUST_ID                       0
CHANNEL_ID                    0
PROMO_ID                      0
TIME_ID                       0
AMOUNT_SOLD                   0

计算hashvalue值的唯一值的数量。

SQL> SELECT (SELECT name FROM sys.col$ WHERE obj# = 74396 AND col# = t.intcol#) column_name, COUNT (DISTINCT (hashvalue)) dhv FROM sys.wri$_optstat_synopsis$ t WHERE bo# = 74396  GROUP BY intcol#;

COLUMN_NAME            DHV
--------------------------
PROD_ID                 60
CUST_ID               4305
TIME_ID                181
CHANNEL_ID               4
PROMO_ID                 2
QUANTITY_SOLD            1
AMOUNT_SOLD            425

上述两个值查出来之后,然后根据公式NDV=N*power(2,i)来计算。例如我们使用PROD_ID=60来计算。NDV=N*power(2,i) = 60*power(2,0)=60。现在我们来检查数据字典,我们可以发现和数据字典一致。实际上,对于这些列,我们没有任何split,因此列的不同hashvalue值的数目等于它们的ndv。

SQL> SELECT column_name, num_distinct FROM dba_tab_col_statistics WHERE table_name='SALES1';

COLUMN_NAME                              NUM_DISTINCT
---------------------------------------- ------------
PROD_ID                                            60
CUST_ID                                          4305
TIME_ID                                           181
CHANNEL_ID                                          4
PROMO_ID                                            2
QUANTITY_SOLD                                       1
AMOUNT_SOLD                                       425

在Oracle12cR1中,为了近似NDV,引入了一个新函数-大约计数APPROX_COUNT_DISTINCT,它使用HLL算法。而在Oracle 12c R2中,HLL算法已经用DBMS_STATS包实现了近似数据库统计(NDV)。Oracle数据库可以使用这两种算法(ADAPTIVE SAMPLING/hyperloglog)。

使用ADAPTIVE SAMPLING,概要将存储在两个地方((wri$_optstat_synopsis_head$,wri$_optstat_synopsis$)。而使用hyperloglog算法,概要只会存储在wri$_optstat_synopsis$表中。如果收集分区统计数据,则选项近似approximate_ndv_algorithm=“REPEAT OR HYPERLOGLOG”允许我们继续通过自适应抽样方法创建概要; 这是默认设置。如果我们如果approximate_ndv_algorithm更改为“HYPERLOGLOG”, INCREMENTAL_STALENESS is NULL(在Oracle 12c数据库中,一个称为INCREMENTAL_STALENESS的新属性允许你控制分区统计信息何时被认为已陈化,并且不能胜任生成全局统计信息。在缺省情况下,INCREMENTAL_STALENESS被设为NULL),所有旧风格的概要将被删除,而将为之前的和新添加的表分区创建新的概要。如果approximate_ndv_algorithm=“HYPERLOGLOG“ ,INCREMENTAL_STALENESS= ALLOW_MIXED_FORMAT,不会立即删除旧风格的概要,而是逐步的删除。

关于这两种算法的性能,这两种统计收集过程似乎没有明显的差别。两种算法均为≈2%。而主要的区别是所需的内存。HLL允许我们以极小的内存(和磁盘存储)以高精度计算NDV。

关注dbDao.com的新浪微博

扫码加入微信Oracle小密圈,了解Oracle最新技术下载分享资源

TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569