Oracle 12cR1-Quick Distinct Count:APPROX_COUNT_DISTINCT

oracle在12cR1版本推出了新的APPROX_COUNT_DISTINCT函数,介绍这个函数之前,我们先要科普一个知识叫NDV。NDV的英文解释就是(Number of distinct values in a column (excluding nulls)。在oracle 11g版本对于统计信息,就是DBA_TAB_COL_STATISTICS中的NUM_DISTINCT值,就是通过NDV的数值就是通过select count(distinct)的方式进行获取的,而这个获取有一个问题,当数据量足够大的时候,它需要先通过排序算法把唯一值先保存在PGA当中,然后消除后续的重复值。当数据量增大,对应的PGA的消耗也对应消耗更大。当达到一定的程度之后,PGA超过限制就需要唯一值保存在TEMP当中。导致性能下降。而在12c推出的新的算法(Approximate NDV),直接通过全表扫描获取精确统计数据。该算法显著的提升了速度,但是精确值确下降了,但是整个下降的一个值对于统计来说足够用了。下面我们来看看这两个的区别。

首先执行标准的count (distinct)的操作,多执行了几次。发现每次需要大概11.21秒完成。产生了634738逻辑读和362026的物理读。

SQL> select count(distinct OBJECT_NAME) from t1;
Elapsed: 00:00:20.04

Execution Plan
----------------------------------------------------------
Plan hash value: 405047221

-----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 66 | | 1000 (1)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 66 | | | |
| 2 | VIEW | VW_DAG_0 | 60648 | 3908K| | 1000 (1)| 00:00:01 |
| 3 | HASH GROUP BY | | 60648 | 2072K| 3160K| 1000 (1)| 00:00:01 |
| 4 | TABLE ACCESS FULL| T1 | 72850 | 2489K| | 396 (1)| 00:00:01 |
-----------------------------------------------------------------------------------------


Statistics
----------------------------------------------------------
499 recursive calls
0 db block gets
634738 consistent gets
423556 physical reads
0 redo size
562 bytes sent via SQL*Net to client
607 bytes received via SQL*Net from client
2SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processe

通过新的12c的新函数APPROX_COUNT_DISTINCT,多执行几次,发现每次只需要大概9.13秒就完成。产生了634738逻辑读和362026的物理读。

SQL> select APPROX_COUNT_DISTINCT(object_name) from t1;

Elapsed: 00:00:09.13

Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953

------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 35 | 396 (1)| 00:00:01 |
| 1 | SORT AGGREGATE APPROX| | 1 | 35 | | |
| 2 | TABLE ACCESS FULL | T1 | 72850 | 2489K| 396 (1)| 00:00:01 |
------------------------------------------------------------------------------


Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
634738 consistent gets
362026 physical reads
0 redo size
570 bytes sent via SQL*Net to client
607 bytes received via SQL*Net from client
2SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

在上述两个结果中,我们发现逻辑读是一样的,物理读是不一致的。在前面我们说过,直接distinct需要把唯一值先保存在PGA当中,然后消除后续的重复值。这里它使用的是hash group by的方式。而我们的新的算法Approximate,直接就是全表扫描。接下来我们继续做10046来对比多出的物理读是怎么回事。

SQL ID: fj2n4mgk7nra8 Plan Hash: 405047221

select count(distinct OBJECT_NAME)
from
t1


call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2 8.54 19.56 423556 634738 0 1
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 8.54 19.56 423556 634738 0 1

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 110
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max) Row Source Operation
---------- ---------- ---------- ---------------------------------------------------
111 SORT AGGREGATE (cr=634738 pr=423556 pw=61530 time=19561932 us starts=1)
604806048060480VIEWVW_DAG_0 (cr=634738 pr=423556 pw=61530 time=18078480 us starts=1 cost=1000 size=4002768 card=60648)
604806048060480 HASH GROUP BY (cr=634738 pr=423556 pw=61530 time=18078434 us starts=1 cost=1000 size=2122680 card=60648)
186496001864960018649600TABLE ACCESS FULL T1 (cr=634738 pr=362026 pw=0 time=10396708 us starts=1 cost=396 size=2549750 card=72850)


Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 20.000.00
reliable message 1 0.00 0.00
PGA memory operation 56 0.00 0.00
asynch descriptor resize 2 0.00 0.00
Disk file operations I/O 2 0.00 0.00
direct path read 5678 0.00 7.36
direct path write temp 8790 0.01 5.40
direct path read temp 8790 0.00 0.10
SQL*Net message from client 20.500.50

SQL ID: gywh3p1rzfccb Plan Hash: 3724264953

select APPROX_COUNT_DISTINCT(object_name)
from
t1


call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2 4.59 11.23 362026 634738 0 1
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 4.59 11.23 362026 634738 0 1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 110
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max) Row Source Operation
---------- ---------- ---------- ---------------------------------------------------
111 SORT AGGREGATE APPROX (cr=634738 pr=362026 pw=0 time=11236917 us starts=1)
186496001864960018649600TABLE ACCESS FULL T1 (cr=634738 pr=362026 pw=0 time=12277469 us starts=1 cost=396 size=2549750 card=72850)


Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 20.000.00
PGA memory operation 2 0.00 0.00
asynch descriptor resize 1 0.00 0.00
Disk file operations I/O 1 0.00 0.00
direct path read 5678 0.00 7.56
SQL*Net message from client 21.261.26

综合上面两个SQL的10046对比,发现”TABLE ACCESS FULL”这个操作产生了pr,也就是物理读是362026。而产生的pw为0,也就是没产生物理写。而第一种为什么产生了423556的逻辑读要高于第二种,那是因为做了HASH GROUP BY操作,内存PGA空间不够,需要放到temp里面临时存储结果,这里就多了pr=423556 pw=61530 。先看这个pw=61530,就是对temp写入的IO,同时写完了还要继续读temp中的结果,就会产生362026+61530,刚刚好是423556大小的物理读。
在等待事件上也可以看到,写入temp的速度较慢
direct path write temp 8790 0.01 5.40
direct path read temp 8790 0.00 0.10
而Approximate算法就是全表扫描的操作,不涉及tmep的读写。所以效率要高很多。

当然最重要的一件事情就是,count(distinct)要比APPROX_COUNT_DISTINCT()准确。如果你是需要准确的值那么distinct是肯定要使用的,但是你只是对一个表进行统计。统计一个大概的数据,不需要太精确,类似于统计信息中的distinct值。使用APPROX_COUNT_DISTINCT就非常好。因为它的执行速度快很多。

SQL> select count(distinct OBJECT_NAME) from t1;

COUNT(DISTINCTOBJECT_NAME)
--------------------------
60480

SQL> select APPROX_COUNT_DISTINCT(object_name) from t1;

APPROX_COUNT_DISTINCT(OBJECT_NAME)
----------------------------------
61136

关注dbDao.com的新浪微博

扫码加入微信Oracle小密圈,了解Oracle最新技术下载分享资源

TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569