众所周知,拥有适当的数据库统计信息对于查询优化器至关重要。统计数据应该正确地描述数据库中的数据。有效地收集统计数据并获得正确的统计数据,并不是一个容易的过程。我们先来说说统一信息中的采样。数据库从一个小样本开始,然后在一些分析的基础上确定样本大小是否足够大。因此,数据库首先需要确定最优的样本大小,然后扫描数据库对象以计算对象统计信息。但是,如果一个表很大,那么被认为是最优的样本数量实际上太小了,当然计算出来的统计数据是不准确的。此外,这种方法需要重新扫描所有分区,以便计算分区表的全局对象统计信息。对于非分区表,数据库必须执行排序操作,以便为列生成许多不同的值(NDV)-这是一种资源密集型操作。在Oracle 12c R2中Oracle推出了approximate_ndv_algorithm参数,可以设置hyperloglog、自适应采样等算法。而为了支持这种机制,在数据库中引入了一种称为“概要”的特殊数据结构。它包含有关对象(即表的列、分区)的信息,这些信息有助于有效地近似的获取相关的统计信息。
让我们来初步了解一些,概要和approximate_ndv_algorithm中的hyperloglog算法
CREATE TABLE u1.sales1 (prod_id NUMBER NOT NULL, cust_id NUMBER NOT NULL, time_id DATE NOT NULL, channel_id NUMBER NOT NULL, promo_id NUMBER NOT NULL, quantity_sold NUMBER(10,2) NOT NULL, amount_sold NUMBER(10,2) NOT NULL) PARTITION BY RANGE (TIME_ID) ( PARTITION sales_q1_1998 VALUES LESS THAN (TO_DATE(' 1998-04-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')), PARTITION sales_q2_1998 VALUES LESS THAN (TO_DATE(' 1998-07-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')) ) / 多执行几次,创建一张较大的表。 insert /*+ append parallel(4)*/into u1.sales1 select * from (select * from sh.sales where time_id<=to_date('6/30/1998','mm/dd/yyyy'); insert /*+ append parallel(4)*/into u1.sales1 select * from (select * from u1.sales1 where time_id<=to_date('6/30/1998','mm/dd/yyyy')); insert /*+ append parallel(4)*/into u1.sales1 select * from (select * from u1.sales1 where time_id<=to_date('6/30/1998','mm/dd/yyyy')); insert /*+ append parallel(4)*/into u1.sales1 select * from (select * from u1.sales1 where time_id<=to_date('6/30/1998','mm/dd/yyyy')); SQL> select segment_name,bytes/1024/1024/1024 from dba_segments where segment_name='SALES1'; SEGMENT_NAME BYTES/1024/1024/1024 ------------------------------ -------------------- SALES1 3.296875 SALES1 2.74902344
收集表的统计信息,收集方式使用approximate_ndv_algorithm中的hyperloglog算法。同时设置增量收集。
BEGIN DBMS_STATS.set_table_prefs (ownname => 'u1',tabname => 'sales1',pname => 'approximate_ndv_algorithm',pvalue => 'hyperloglog'); DBMS_STATS.set_table_prefs ('u1','sales1','INCREMENTAL','TRUE'); END;
统计信息进行收集,同时打开trace进行跟踪
BEGIN DBMS_STATS.set_global_prefs ('trace', TO_CHAR (2048 + 32768 + 4 + 16)); DBMS_STATS.gather_table_stats (ownname => 'u1', tabname => 'SALES1'); END;
从trace中可以看到。Oracle开始为表的一个分区(SALES_Q2_1998)收集统计信息。而收集的方式是HLL(incremental)的方式。
DBMS_STATS: gather stats on partition SALES_Q2_1998: synopsis not gathered yet; not analyzed yet; DBMS_STATS: Start gather_stats.. pfix: ownname: U1 tabname: SALES1 pname: SALES_Q2_1998 spname: execution phase: 1 DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental) DBMS_STATS: reporting_man_log_task: target: U1.SALES1.SALES_Q2_1998 objn: 74398 auto_stats: FALSE status: IN PROGRESS ctx.batching_coeff: 0 DBMS_STATS: no AS synopses to delete for #74396 DBMS_STATS: Using approximate NDV pct=0 DBMS_STATS: NNV NDV AVG MMX HST EP RP NNNP IND CNDV HSTN HSTR COLNAME DBMS_STATS: Y Y Y PROD_ID DBMS_STATS: Y Y Y CUST_ID DBMS_STATS: Y Y Y Y Y TIME_ID DBMS_STATS: Y Y Y CHANNEL_ID DBMS_STATS: Y Y Y PROMO_ID DBMS_STATS: Y Y Y QUANTITY_SOLD DBMS_STATS: Y Y Y AMOUNT_SOLD DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental) DBMS_STATS: Approximate NDV Options DBMS_STATS: SYN,NIL,NIL,SYN,NIL,NIL,STOPN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,RWID, NDV_HLL, B74396,U254,U254,U254,U254,U254,U254,U254U DBMS_STATS: Starting query at 26-JUN-18 01.29.25.406113000 AM +08:00 DBMS_STATS: select /*+ full(t) no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring xmlindex_sel_idx_tbl opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad */to_ch ar(count("PROD_ID")),substrb(dump(min("PROD_ID"),16,0,64),1,240),substrb(dump(max("PROD_ID"),16,0,64),1,240),to_char(count("CUST_ID")),substrb(dump(min("CUST_ID"),16,0,64),1,240),substrb(dump(max("CUST_ID"),16,0,64),1,240),to_char(count("TIME_ID" )),substrb(dump(min("TIME_ID"),16,0,64),1,240),substrb(dump(max("TIME_ID"),16,0,64),1,240),to_char(count("CHANNEL_ID")),substrb(dump(min("CHANNEL_ID"),16,0,64),1,240),substrb(dump(max("CHANNEL_ID"),16,0,64),1,240),to_char(count("PROMO_ID")),subst rb(dump(min("PROMO_ID"),16,0,64),1,240),substrb(dump(max("PROMO_ID"),16,0,64),1,240),to_char(count("QUANTITY_SOLD")),substrb(dump(min("QUANTITY_SOLD"),16,0,64),1,240),substrb(dump(max("QUANTITY_SOLD"),16,0,64),1,240),to_char(count("AMOUNT_SOLD")) ,substrb(dump(min("AMOUNT_SOLD"),16,0,64),1,240),substrb(dump(max("AMOUNT_SOLD"),16,0,64),1,240),count(rowidtochar(rowid)) from "U1"."SALES1" t where TBL$OR$IDX$PART$NUM("U1"."SALES1",0,4,0,"ROWID") = :objn /* SYN,NIL,NIL,SYN,NIL,NIL,STOPN,NIL,N IL,SYN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,RWID, NDV_HLL, B74396,U254,U254,U254,U254,U254,U254,U254U*/
使用同样的方法,数据库计算第二个分区SALES_Q1_1998的列统计信息。
DBMS_STATS: gather stats on partition SALES_Q1_1998: synopsis not gathered yet; not analyzed yet; DBMS_STATS: Start gather_stats.. pfix: ownname: U1 tabname: SALES1 pname: SALES_Q1_1998 spname: execution phase: 1 DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental) DBMS_STATS: reporting_man_log_task: target: U1.SALES1.SALES_Q1_1998 objn: 74397 auto_stats: FALSE status: IN PROGRESS ctx.batching_coeff: 0 DBMS_STATS: delete synopses of a single partition DBMS_STATS: Starting query at 26-JUN-18 01.29.48.878218000 AM +08:00 DBMS_STATS: select /*+ OPT_PARAM('_parallel_syspls_obey_force' 'false') */ 1 from dual where exists (select null from wri$_optstat_synopsis_head$ h, table(:clist_syn) c where h.bo# = :bo# and h.intcol# = c.column_value and h.analyzetime is not null and h.spare1 is null and group# in (:singlegroup) ) DBMS_STATS: no AS synopses to delete for #74396 DBMS_STATS: Using approximate NDV pct=0 DBMS_STATS: NNV NDV AVG MMX HST EP RP NNNP IND CNDV HSTN HSTR COLNAME DBMS_STATS: Y Y Y PROD_ID DBMS_STATS: Y Y Y CUST_ID DBMS_STATS: Y Y Y Y Y TIME_ID DBMS_STATS: Y Y Y CHANNEL_ID DBMS_STATS: Y Y Y PROMO_ID DBMS_STATS: Y Y Y QUANTITY_SOLD DBMS_STATS: Y Y Y AMOUNT_SOLD DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental) DBMS_STATS: Approximate NDV Options DBMS_STATS: SYN,NIL,NIL,SYN,NIL,NIL,STOPN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,RWID, NDV_HLL, B74396,U254,U254,U254,U254,U254,U254,U254U DBMS_STATS: Starting query at 26-JUN-18 01.29.48.884823000 AM +08:00 DBMS_STATS: select /*+ full(t) no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring xmlindex_sel_idx_tbl opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad */to_ch ar(count("PROD_ID")),substrb(dump(min("PROD_ID"),16,0,64),1,240),substrb(dump(max("PROD_ID"),16,0,64),1,240),to_char(count("CUST_ID")),substrb(dump(min("CUST_ID"),16,0,64),1,240),substrb(dump(max("CUST_ID"),16,0,64),1,240),to_char(count("TIME_ID" )),substrb(dump(min("TIME_ID"),16,0,64),1,240),substrb(dump(max("TIME_ID"),16,0,64),1,240),to_char(count("CHANNEL_ID")),substrb(dump(min("CHANNEL_ID"),16,0,64),1,240),substrb(dump(max("CHANNEL_ID"),16,0,64),1,240),to_char(count("PROMO_ID")),subst rb(dump(min("PROMO_ID"),16,0,64),1,240),substrb(dump(max("PROMO_ID"),16,0,64),1,240),to_char(count("QUANTITY_SOLD")),substrb(dump(min("QUANTITY_SOLD"),16,0,64),1,240),substrb(dump(max("QUANTITY_SOLD"),16,0,64),1,240),to_char(count("AMOUNT_SOLD")) ,substrb(dump(min("AMOUNT_SOLD"),16,0,64),1,240),substrb(dump(max("AMOUNT_SOLD"),16,0,64),1,240),count(rowidtochar(rowid)) from "U1"."SALES1" t where TBL$OR$IDX$PART$NUM("U1"."SALES1",0,4,0,"ROWID") = :objn /* SYN,NIL,NIL,SYN,NIL,NIL,STOPN,NIL,N IL,SYN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,RWID, NDV_HLL, B74396,U254,U254,U254,U254,U254,U254,U254U*/
最后,Oracle通过agg计算了已分区表的全局统计信息。 重新计算分区级别的统计信息。oracle通过合并概要实现了这一点。我们可以从跟踪文件中清楚地看到它。
DBMS_STATS: prepare reporting structures...
DBMS_STATS: reporting_man_update_task: objn: 74397 auto_stats: FALSE status: COMPLETED ctx.batching_coeff: 0
DBMS_STATS: Start gather_stats.. pfix: ownname: U1 tabname: SALES1 pname: spname: execution phase: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: Synopsis Aggregation Degree: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: get_agg_colstats: HLL only
DBMS_STATS: Deriving Global Histogram for TIME_ID
DBMS_STATS: derive global freq hist...
DBMS_STATS: Evaluating frequency histogram for col: "TIME_ID"
DBMS_STATS: number of values = 181, max # of buckects = 254, pct = 100, ssize = 162703360
DBMS_STATS: range_skew: YES
DBMS_STATS: csr.hreq: 0 Histogram gathering flags: 1037
DBMS_STATS: Derive global stats from partition synopses/stats for table SALES1 ===========>从表SALES1的分区概要和统计信息中派生出全局的统计信息。
DBMS_STATS: Start fill_cstats - hybrid_enabled: TRUE
因此,如果表分区有统计信息,那么通过合并适当分区的概要就可以计算出全局统计信息。从Oracle 12c R2开始,synopsis(它是基于HLL算法创建)存储在WRI$_OPTSTAT_SYNOPSIS_HEAD$表中。
SQL> SELECT bo#,group#,intcol#,synopsis#,split,spare1,DBMS_LOB.SUBSTR (spare2, 10) spare2 FROM wri$_optstat_synopsis_head$ 2 WHERE bo# IN (SELECT object_id FROM dba_objects WHERE object_name = 'SALES1'); BO# GROUP# INTCOL# SYNOPSIS# SPLIT SPARE1 SPARE2 ---------- ---------- ---------- ---------- ---------- ---------- ---------------------------------------- 74396 148794 1 0 1 0D0C009B003C00000000 74396 148794 2 0 1 0D0C000108A000000000 74396 148794 3 0 1 0D0C0005005A00000000 74396 148794 4 0 1 0D0C0303000400000000 74396 148794 5 0 1 0D0C01F6000200000000 74396 148794 6 0 1 0D0C00E9000100000000 74396 148794 7 0 1 0D0C0004017C00000000 74396 148796 1 0 1 0D0C00B7003100000000 74396 148796 2 0 1 0D0C000107F600000000 74396 148796 3 0 1 0D0C001C005B00000000 74396 148796 4 0 1 0D0C0303000400000000 74396 148796 5 0 1 0D0C01F6000200000000 74396 148796 6 0 1 0D0C00E9000100000000 74396 148796 7 0 1 0D0C000E00D600000000
我们有一个两个分区的分区表,表中有7个列,因此这里数据字典里面有7*2=14个概要,BO#代表Object_id。GROUP#等同于分区的编号。INTCOL#代表了Column number,和sys.col$.col#里面的值相同,SPLIT,为概要执行的拆分数。SPARE1,如果根据HLL(hyperloglog)算法创建的概要,这里的值应当总是为1。SPARE2,此列包含通过将HyperLogLog或Adaptive Sampling应用于相应列值而生成的散列值(概要值)。
因此,在我们的例子中,我们有概要及其属性:hash值和拆分数。这意味着这些值足以在本地或全局级别计算列的NDV。
在Oracle 12c R2中引入了一些新的函数允许我们通过聚合概要的散列值来计算ndv。
Apx_COUNT_DISTICATE_Detail-返回有关的近似行数的信息。这是一个特殊的格式的blob。
Apx_COUNT_DISTICATH_AGG -此函数基于Apx_COUNT_DISTICATE_Detail的结果创建更高级别的摘要,它允许我们避免基础表的重新扫描以获得新的聚合。
to_approx_count_distinct –此函数以Number的形式返回上述函数的结果。
现在,我们可以检查字典中的列统计信息,并将用上面的函数对概要产生的结果进行对比。
SELECT partition_name, column_name, num_distinct FROM dba_part_col_statistics WHERE owner = 'U1' AND table_name = 'SALES1' ORDER BY 1, 2; PARTITION_NAME COLUMN_NAME NUM_DISTINCT -------------------- ------------------------------ ------------ SALES_Q1_1998 AMOUNT_SOLD 398 SALES_Q1_1998 CHANNEL_ID 4 SALES_Q1_1998 CUST_ID 3172 SALES_Q1_1998 PROD_ID 60 SALES_Q1_1998 PROMO_ID 2 SALES_Q1_1998 QUANTITY_SOLD 1 SALES_Q1_1998 TIME_ID 91 SALES_Q2_1998 AMOUNT_SOLD 219 SALES_Q2_1998 CHANNEL_ID 4 SALES_Q2_1998 CUST_ID 2819 SALES_Q2_1998 PROD_ID 49 SALES_Q2_1998 PROMO_ID 2 SALES_Q2_1998 QUANTITY_SOLD 1 SALES_Q2_1998 TIME_ID 92 SELECT SUBOBJECT_NAME part_name,name colname, ndv FROM ( SELECT group#,intcol#,to_approx_count_distinct (approx_count_distinct_agg (spare2)) ndv FROM wri$_optstat_synopsis_head$ WHERE bo# = 74396 GROUP BY group#, intcol#) s,sys.col$ c,dba_objects o WHERE c.obj# = 74396 AND c.col# = s.intcol# AND o.object_id = s.group#/2 ORDER BY 1, 2; PART_NAME COLNAME NDV ------------------------------ --------------- ----- SALES_Q1_1998 AMOUNT_SOLD 398 SALES_Q1_1998 CHANNEL_ID 4 SALES_Q1_1998 CUST_ID 3172 SALES_Q1_1998 PROD_ID 60 SALES_Q1_1998 PROMO_ID 2 SALES_Q1_1998 QUANTITY_SOLD 1 SALES_Q1_1998 TIME_ID 91 SALES_Q2_1998 AMOUNT_SOLD 219 SALES_Q2_1998 CHANNEL_ID 4 SALES_Q2_1998 CUST_ID 2819 SALES_Q2_1998 PROD_ID 49 SALES_Q2_1998 PROMO_ID 2 SALES_Q2_1998 QUANTITY_SOLD 1 SALES_Q2_1998 TIME_ID 92
正如您所看到的,上述两个查询都返回完全相同的结果。Oracle实际上在上面提到的approx_*函数上实现了HLL算法,并通过将该函数应用于概要数据来派生分区和全局级别的统计信息。
让我们获取表级统计信息,可以看到也是一致的。
SELECT ds.column_name, ds.num_distinct, s.ndv FROM ( SELECT intcol#,to_approx_count_distinct (approx_count_distinct_agg (spare2)) ndv FROM wri$_optstat_synopsis_head$ WHERE bo# = 74396 GROUP BY intcol#) s,sys.col$ c, dba_tab_col_statistics ds WHERE c.obj# = 74396 AND c.col# = s.intcol# AND ds.table_name = 'SALES1' AND c.name = ds.column_name COLUMN_NAME NUM_DISTINCT NDV ------------------- ------------ ---------- PROD_ID 60 60 CUST_ID 4276 4276 TIME_ID 183 183 CHANNEL_ID 4 4 PROMO_ID 2 2 QUANTITY_SOLD 1 1 AMOUNT_SOLD 428 428