Oracle中比对2张表之间数据是否一致的几种方法

大约是2个星期前做一个夜班的时候,开发人员需要比对shareplex 数据同步复制软件在 源端和目标端的2张表上的数据是否一致,实际上后来想了下shareplex 本身应当具有这种数据校验功能, 但是还是希望从数据库的角度得出几种可用的同表结构下的数据比对方法。

 

注意以下几种数据比对方式适用的前提条件:

 

1. 所要比对的表的结构是一致的
2. 比对过程中源端和 目标端 表上的数据都是静态的,没有任何DML修改

 

方式1:

假设你所要进行数据比对的数据库其中有一个版本为11g且该表上有相应的主键索引(primary key index)或者唯一非空索引(unique key &not null)的话,那么恭喜你! 你可以借助11g 新引入的专门做数据对比的PL/SQL Package dbms_comparison来实现数据校验的目的,如以下演示:

 

 

Source 源端版本为11gR2 :

conn maclean/maclean
SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
PL/SQL Release 11.2.0.3.0 - Production
CORE    11.2.0.3.0      Production
TNS for Linux: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production

SQL> select * from global_name;

GLOBAL_NAME
--------------------------------------------------------------------------------
www.askmac.cn  & www.askmac.cn

 drop table test1;
 create table test1 tablespace users as select object_id t1,object_name t2 from dba_objects where object_id is not null;
 alter table test1 add primary key(t1);
 exec dbms_stats.gather_table_stats('MACLEAN','TEST1',cascade=>TRUE);

create database link maclean connect to maclean identified by maclean using 'G10R21';
Database link created.

 

以上源端数据库版本为11.2.0.3 , 源表结构为test1(t1 number primary key,t2 varchar2(128),透过dblink链接到版本为10.2.0.1的目标端

 

conn maclean/maclean

SQL> select * from v$version

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bi
PL/SQL Release 10.2.0.1.0 - Production
CORE    10.2.0.1.0      Production
TNS for Linux: Version 10.2.0.1.0 - Production
NLSRTL Version 10.2.0.1.0 - Production

create table test2 tablespace users as select object_id t1,object_name t2
from dba_objects where object_id is not null;
alter table test2 add primary key(t1);
exec dbms_stats.gather_table_stats('MACLEAN','TEST2',cascade=>TRUE);

 

目标端版本为10.2.0.1 , 表结构为test2(t1 number primary key,t2 varchar2(128))。

注意这里2张表上均必须有相同的主键索引或者伪主键索引(pseudoprimary key伪主键要求是唯一键且所有的成员列均是非空NOT NULL)。

实际创建comparison对象,并实施校验:

 

begin
  dbms_comparison.create_comparison(comparison_name    => 'MACLEAN_TEST_COM',
                                    schema_name        => 'MACLEAN',
                                    object_name        => 'TEST1',
                                    dblink_name        => 'MACLEAN',
                                    remote_schema_name => 'MACLEAN',
                                    remote_object_name => 'TEST2',
                                    scan_mode          => dbms_comparison.CMP_SCAN_MODE_FULL);
end;

PL/SQL procedure successfully completed.

SQL> set linesize 80 pagesize 1400

SQL> select * from user_comparison where comparison_name='MACLEAN_TEST_COM';

COMPARISON_NAME                COMPA SCHEMA_NAME
------------------------------ ----- ------------------------------
OBJECT_NAME                    OBJECT_TYPE       REMOTE_SCHEMA_NAME
------------------------------ ----------------- ------------------------------
REMOTE_OBJECT_NAME             REMOTE_OBJECT_TYP
------------------------------ -----------------
DBLINK_NAME
--------------------------------------------------------------------------------
SCAN_MODE SCAN_PERCENT
--------- ------------
CYCLIC_INDEX_VALUE
--------------------------------------------------------------------------------
NULL_VALUE
--------------------------------------------------------------------------------
LOCAL_CONVERGE_TAG
--------------------------------------------------------------------------------
REMOTE_CONVERGE_TAG
--------------------------------------------------------------------------------
MAX_NUM_BUCKETS MIN_ROWS_IN_BUCKET
--------------- ------------------
LAST_UPDATE_TIME
---------------------------------------------------------------------------
MACLEAN_TEST_COM               TABLE MACLEAN
TEST1                          TABLE             MACLEAN
TEST2                          TABLE
MACLEAN
FULL

ORA$STREAMS$NV

           1000              10000
20-DEC-11 01.08.44.562092 PM

 

利用dbms_comparison.create_comparison创建comparison后,新建的comparison会出现在user_comparison视图中;

以上我们完成了comparison的创建,但实际的校验仍未发生我们利用10046事件监控这个数据对比过程:

 

conn maclean/maclean
set timing on;
alter system flush shared_pool;

alter session set events '10046 trace name context forever,level 8';

set serveroutput on

DECLARE
  retval dbms_comparison.comparison_type;
BEGIN
  IF dbms_comparison.compare('MACLEAN_TEST_COM', retval, perform_row_dif => TRUE) THEN
    dbms_output.put_line('No Differences');
  ELSE
    dbms_output.put_line('Differences Found');
  END IF;
END;
/

Differences Found           =====> 返回结果为Differences Found,说明数据存在差异并不一致

PL/SQL procedure successfully completed.

Elapsed: 00:00:10.87

===========================10046 tkprof result =========================

SELECT MIN("T1"), MAX("T1")
FROM
 "MACLEAN"."TEST1"

SELECT MIN("T1"), MAX("T1")
FROM
 "MACLEAN"."TEST2"@MACLEAN

SELECT COUNT(1)
FROM
 "MACLEAN"."TEST1" s WHERE ("T1" >= :scan_min AND "T1" <= :scan_max )

SELECT COUNT(1)
FROM
 "MACLEAN"."TEST2"@MACLEAN s WHERE ("T1" >= :scan_min AND "T1" <= :scan_max )

SELECT q.wb1, min(q."T1") min_range1, max(q."T1") max_range1, count(*)
  num_rows, sum(q.s_hash) sum_range_hash
FROM
 (SELECT /*+ FULL(s) */  width_bucket(s."T1", :scan_min1, :scan_max_inc1,
  :num_buckets) wb1, s."T1", ora_hash(NVL(to_char(s."T1"), 'ORA$STREAMS$NV'),
  4294967295, ora_hash(NVL((s."T2"), 'ORA$STREAMS$NV'), 4294967295, 0))
  s_hash FROM "MACLEAN"."TEST1" s WHERE (s."T1">=:scan_min1 AND s."T1"<=
  :scan_max1) ) q GROUP BY q.wb1 ORDER BY q.wb1

SELECT /*+ REMOTE_MAPPED */ q.wb1, min(q."T1") min_range1, max(q."T1")
  max_range1, count(*) num_rows, sum(q.s_hash) sum_range_hash
FROM
 (SELECT /*+ FULL(s) REMOTE_MAPPED */  width_bucket(s."T1", :scan_min1,
  :scan_max_inc1, :num_buckets) wb1, s."T1", ora_hash(NVL(to_char(s."T1"),
  'ORA$STREAMS$NV'), 4294967295, ora_hash(NVL((s."T2"), 'ORA$STREAMS$NV'),
  4294967295, 0)) s_hash FROM "MACLEAN"."TEST2"@MACLEAN s WHERE (s."T1">=
  :scan_min1 AND s."T1"<=:scan_max1) ) q GROUP BY q.wb1 ORDER BY q.wb1

SELECT /*+ FULL(P) +*/ * FROM "MACLEAN"."TEST2" P

SELECT /*+ FULL ("A1") */
 WIDTH_BUCKET("A1"."T1", :SCAN_MIN1, :SCAN_MAX_INC1, :NUM_BUCKETS),
 MIN("A1"."T1"),
 MAX("A1"."T1"),
 COUNT(*),
 SUM(ORA_HASH(NVL(TO_CHAR("A1"."T1"), 'ORA$STREAMS$NV'),
              4294967295,
              ORA_HASH(NVL("A1"."T2", 'ORA$STREAMS$NV'), 4294967295, 0)))
  FROM "MACLEAN"."TEST2" "A1"
 WHERE "A1"."T1" >= :SCAN_MIN1
   AND "A1"."T1" <= :SCAN_MAX1
 GROUP BY WIDTH_BUCKET("A1"."T1", :SCAN_MIN1, :SCAN_MAX_INC1, :NUM_BUCKETS)
 ORDER BY WIDTH_BUCKET("A1"."T1", :SCAN_MIN1, :SCAN_MAX_INC1, :NUM_BUCKETS)

SELECT ROWID, "T1", "T2"
  FROM "MACLEAN"."TEST2" "R"
 WHERE "T1" >= :1
   AND "T1" <= :2

--------------------------------------------------------------------------------------------
| Id  | Operation                    | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |             |   126 |  3528 |     4   (0)| 00:00:01 |
|*  1 |  FILTER                      |             |       |       |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| TEST2       |   126 |  3528 |     4   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | SYS_C006255 |   227 |       |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(TO_NUMBER(:1)<=TO_NUMBER(:2))
   3 - access("T1">=TO_NUMBER(:1) AND "T1"<=TO_NUMBER(:2))

SELECT ll.l_rowid, rr.r_rowid, NVL(ll."T1", rr."T1") idx_val
FROM
 (SELECT l.rowid l_rowid, l."T1", ora_hash(NVL(to_char(l."T1"),
  'ORA$STREAMS$NV'), 4294967295, ora_hash(NVL((l."T2"), 'ORA$STREAMS$NV'),
  4294967295, 0)) l_hash  FROM "MACLEAN"."TEST1" l WHERE l."T1">=:scan_min1
  AND l."T1"<=:scan_max1 ) ll FULL OUTER JOIN (SELECT /*+ NO_MERGE
  REMOTE_MAPPED */ r.rowid r_rowid, r."T1", ora_hash(NVL(to_char(r."T1"),
  'ORA$STREAMS$NV'), 4294967295, ora_hash(NVL((r."T2"), 'ORA$STREAMS$NV'),
  4294967295, 0)) r_hash FROM "MACLEAN"."TEST2"@MACLEAN r WHERE r."T1">=
  :scan_min1  AND r."T1"<=:scan_max1 ) rr ON  ll."T1"=rr."T1" WHERE ll.l_hash
  IS NULL OR rr.r_hash IS NULL OR ll.l_hash <> rr.r_hash

----------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name         | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |              |   190 |   754K|     9  (12)| 00:00:01 |        |      |
|*  1 |  VIEW                           | VW_FOJ_0     |   190 |   754K|     9  (12)| 00:00:01 |        |      |
|*  2 |   HASH JOIN FULL OUTER          |              |   190 |   754K|     9  (12)| 00:00:01 |        |      |
|   3 |    VIEW                         |              |   190 |  7220 |     4   (0)| 00:00:01 |        |      |
|*  4 |     FILTER                      |              |       |       |            |          |        |      |
|   5 |      TABLE ACCESS BY INDEX ROWID| TEST1        |   190 |  5510 |     4   (0)| 00:00:01 |        |      |
|*  6 |       INDEX RANGE SCAN          | SYS_C0013098 |   341 |       |     2   (0)| 00:00:01 |        |      |
|   7 |    VIEW                         |              |   126 |   495K|     4   (0)| 00:00:01 |        |      |
|   8 |     REMOTE                      | TEST2        |   126 |  3528 |     4   (0)| 00:00:01 | MACLE~ | R->S |
----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("LL"."L_HASH" IS NULL OR "RR"."R_HASH" IS NULL OR "LL"."L_HASH"<>"RR"."R_HASH")
   2 - access("LL"."T1"="RR"."T1")
   4 - filter(TO_NUMBER(:SCAN_MIN1)<=TO_NUMBER(:SCAN_MAX1))
   6 - access("L"."T1">=TO_NUMBER(:SCAN_MIN1) AND "L"."T1"<=TO_NUMBER(:SCAN_MAX1))

Remote SQL Information (identified by operation id):
----------------------------------------------------

   8 - SELECT ROWID,"T1","T2" FROM "MACLEAN"."TEST2" "R" WHERE "T1">=:1 AND "T1"<=:2 (accessing
       'MACLEAN' )

 

可以看到以上过程中虽然没有避免对TEST1、TEST2表的全表扫描(FULL TABLE SCAN), 但是好在实际参与HASH JOIN FULL OUTER 的仅是访问索引后获得的少量数据,所以效率还是挺高的。

 

此外可以通过user_comparison_row_dif了解实际那些row存在差异,如:

 

SQL> set linesize 80 pagesize 1400
SQL> select *
  2    from user_comparison_row_dif
  3   where comparison_name = 'MACLEAN_TEST_COM'
  4     and rownum < 2;

COMPARISON_NAME                   SCAN_ID LOCAL_ROWID        REMOTE_ROWID
------------------------------ ---------- ------------------ ------------------
INDEX_VALUE
--------------------------------------------------------------------------------
STA LAST_UPDATE_TIME
--- ---------------------------------------------------------------------------
MACLEAN_TEST_COM                       42 AAATWGAAEAAANBrAAB AAANJrAAEAAB8AMAAd
46
DIF 20-DEC-11 01.18.08.917257 PM

 

以上利用dbms_comparison包完成了一次简单的数据比对,该方法适用于11g以上版本且要求表上有主键索引或非空唯一索引, 且不支持以下数据类型字段的比对

  •     LONG
  •     LONG RAW
  •     ROWID
  •     UROWID
  •     CLOB
  •     NCLOB
  •     BLOB
  •     BFILE
  •     User-defined types (including object types, REFs, varrays, and nested tables)
  •     Oracle-supplied types (including any types, XML types, spatial types, and media types)

 

 

 

若要比对存有以上类型字段的表,那么需要在create_comparison时指定column_list参数排除掉这些类型的字段。

方法1 dbms_comparison的优势在于可以提供详细的比较信息,且在有适当索引的前提下效率较高。
缺点在于有数据库版本的要求(at least 11gR1), 且也不支持LONG 、CLOB等字段的比较。

 

方式2:

利用minus Query 对比数据

这可以说是操作上最简单的一种方法,如:

 

select * from test1 minus select * from test2@maclean;

-----------------------------------------------------------------------------------------------------
| Id  | Operation           | Name  | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |       | 75816 |  3527K|       |  1163  (40)| 00:00:14 |        |      |
|   1 |  MINUS              |       |       |       |       |            |          |        |      |
|   2 |   SORT UNIQUE       |       | 75816 |  2147K|  2984K|   710   (1)| 00:00:09 |        |      |
|   3 |    TABLE ACCESS FULL| TEST1 | 75816 |  2147K|       |   104   (1)| 00:00:02 |        |      |
|   4 |   SORT UNIQUE       |       | 50467 |  1379K|  1800K|   453   (1)| 00:00:06 |        |      |
|   5 |    REMOTE           | TEST2 | 50467 |  1379K|       |    56   (0)| 00:00:01 | MACLE~ | R->S |
-----------------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   5 - SELECT "T1","T2" FROM "TEST2" "TEST2" (accessing 'MACLEAN' )

Select *
  from (select 'MACLEAN.TEST1' "Row Source", a.*
          from (select /*+ FULL(Tbl1)  */
                 T1, T2
                  from MACLEAN.TEST1 Tbl1
                minus
                select /*+ FULL(Tbl2)  */
                 T1, T2
                  from MACLEAN.TEST2@"MACLEAN" Tbl2) A
        union all
        select 'MACLEAN.TEST2@"MACLEAN"', b.*
          from (select /*+ FULL(Tbl2)  */
                 T1, T2
                  from MACLEAN.TEST2@"MACLEAN" Tbl2
                minus
                select /*+ FULL(Tbl1)  */
                 T1, T2
                  from MACLEAN.TEST1 Tbl1) B) Order by 1;

 

MINUS Clause会导致2张表均在本地被全表扫描(TABLE FULL SCAN),且要求发生SORT排序。 若所对比的表上有大量的数据,那么排序的代价将会是非常大的, 因此这种方法的效率不高。

方式2 MINUS的优点在于操作简便,特别适合于小表之间的数据检验。
缺点在于 由于SORT排序可能导致在大数据量的情况下效率很低, 且同样不支持LOB 和 LONG 这样的大对象。

 

方式3:

使用not exists子句,如:

 

select *
  from test1 a
 where not exists (select 1
          from test2 b
         where a.t1 = b.t1
           and a.t2 = b.t2);

no rows selected

Elapsed: 00:00:00.06

------------------------------------------------------------------------------------
| Id  | Operation          | Name  | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |       | 75816 |  7996K|       |   691   (1)| 00:00:09 |
|*  1 |  HASH JOIN ANTI    |       | 75816 |  7996K|  3040K|   691   (1)| 00:00:09 |
|   2 |   TABLE ACCESS FULL| TEST1 | 75816 |  2147K|       |   104   (1)| 00:00:02 |
|   3 |   TABLE ACCESS FULL| TEST2 | 77512 |  5979K|       |   104   (1)| 00:00:02 |
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("A"."T1"="B"."T1" AND "A"."T2"="B"."T2")

 

 

照理说在数据量较大的情况下not exists使用的HASH JOIN ANTI是在性能上是优于MINUS操作的, 但是当所要比较的表身处不同的2个数据库(distributed query)时将无法使用HASH JOIN ANTI,而会使用FILTER OPERATION这种效率极低的操作:

 

 

select *
  from test1 a
 where not exists (select 1
          from test2@maclean b
         where a.t1 = b.t1
           and a.t2 = b.t2)
no rows selected

Elapsed: 00:01:05.76

 --------------------------------------------------------------------------------------------
| Id  | Operation          | Name  | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |       | 75816 |  2147K|   147K  (1)| 00:29:31 |        |      |
|*  1 |  FILTER            |       |       |       |            |          |        |      |
|   2 |   TABLE ACCESS FULL| TEST1 | 75816 |  2147K|   104   (1)| 00:00:02 |        |      |
|   3 |   REMOTE           | TEST2 |     1 |    29 |     2   (0)| 00:00:01 | MACLE~ | R->S |
--------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter( NOT EXISTS (SELECT 0 FROM  "B" WHERE "B"."T1"=:B1 AND "B"."T2"=:B2))

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "T1","T2" FROM "TEST2" "B" WHERE "T1"=:1 AND "T2"=:2 (accessing
       'MACLEAN' )

 

可以从以上执行计划看到FILTER 操作是十分昂贵的。

 

补充:

有网友反映可以通过增加 unnest hint 让CBO优化器在远程子查询有效的情况下整体考虑整个查询块,这样可以让执行计划用上HASH JOIN RIGHT ANTI, 这是我一开始没有考虑到的。

 

 

select *
  from test1 a
 where not exists (select /*+ unnset */
         1
          from test2@maclean b
         where a.t1 = b.t1
           and a.t2 = b.t2);

           
           

PLAN_TABLE_OUTPUT
------------------------------------------
Plan hash value: 1776635653

------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name  | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     | Inst   |IN-OUT|
------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |       | 79815 |    12M|       |   594   (1)| 00:00:08 |        |      |
|*  1 |  HASH JOIN RIGHT ANTI|       | 79815 |    12M|  1816K|   594   (1)| 00:00:08 |        |      |
|   2 |   REMOTE             | TEST2 | 20420 |  1575K|       |    56   (0)| 00:00:01 | MACLE~ | R->S |
|   3 |   TABLE ACCESS FULL  | TEST1 | 79815 |  6157K|       |   104   (1)| 00:00:02 |        |      |
------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("A"."T1"="B"."T1" AND "A"."T2"="B"."T2")

Remote SQL Information (identified by operation id):
----------------------------------------------------

   2 - SELECT "T1","T2" FROM "TEST2" "B" (accessing 'MACLEAN' )

 

 

在此基础上加入ordered hint 可以让执行计划使用HASH JOIN ANTI

 

   
 select /*+ ordered */ *
  from test1 a
 where not exists (select /*+ unnset */
         1
          from test2@maclean b
         where a.t1 = b.t1
           and a.t2 = b.t2);  

PLAN_TABLE_OUTPUT
--------------------------------------------------
Plan hash value: 3089912131

----------------------------------------------------------------------------------------------------
| Id  | Operation          | Name  | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     | Inst   |IN-OUT|
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |       | 79815 |    12M|       |   594   (1)| 00:00:08 |        |      |
|*  1 |  HASH JOIN ANTI    |       | 79815 |    12M|  7096K|   594   (1)| 00:00:08 |        |      |
|   2 |   TABLE ACCESS FULL| TEST1 | 79815 |  6157K|       |   104   (1)| 00:00:02 |        |      |
|   3 |   REMOTE           | TEST2 | 20420 |  1575K|       |    56   (0)| 00:00:01 | MACLE~ | R->S |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("A"."T1"="B"."T1" AND "A"."T2"="B"."T2")

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "T1","T2" FROM "TEST2" "B" (accessing 'MACLEAN' )

 

方式3 的优点在于操作简便, 且当需要对比的表位于同一数据库时效率要比MINUS方式高,但如果是distributed query分布式查询则效率可能会因FILTER操作而急剧下降,这时候需要我们手动添加unnest这样的SQL提示,以保证执行计划使用HASH JOIN ANTI操作,这样能够保证not exists方式的性能。not exists同样不支持CLOB等大对象。

 

方式4:

Toad、PL/SQL Developer等图形化工具都提供了compare table data的功能, 这里我们以Toad工具为例,介绍如何使用该工具校验数据:

 

打开Toad 链接数据库-> 点击Database-> Compare -> Data

 

 

分别在Source 1和Source 2对话框中输入源表和目标表的信息

因为Toad的底层实际上使用了MINUS操作,所以提高SORT_AREA_SIZE有助于提高compare的性能,若使用AUTO PGA则可以不设置。

 

选择所要比较的列

 

 

 

首先可以比较2张表的行数,点击Execute计算count

 

 

使用MINUS 找出其中一张表上有,而另一张没有的行

 

使用MINUS 找出所有的差别

 

 

Toad的compare data功能是基于MINUS实现的,所以效率上并没有优势。但是通过图形界面省去了写SQL语句的麻烦。这种方法同样不支持LOB、LONG等对象。

 

方式5:

这是一种别出心裁的做法。 将一行数据的上所有字段合并起来,并使用dbms_utility.get_hash_value对合并后的中间值取hash value,再将所有这些从各行所获得的hash值sum累加, 若2表的hash累加值相等则判定2表的数据一致。

 

简单来说,如下面这样:

 

create table hash_one as select object_id t1,object_name t2 from dba_objects;

select dbms_utility.get_hash_value(t1||t2,0,power(2,30)) from hash_one where rownum <3;

DBMS_UTILITY.GET_HASH_VALUE(T1||T2,0,POWER(2,30))
-------------------------------------------------
                                         89209477
                                        757190129

select sum(dbms_utility.get_hash_value(t1||t2,0,power(2,30))) from hash_one;

SUM(DBMS_UTILITY.GET_HASH_VALU
------------------------------
                40683165992756

select sum(dbms_utility.get_hash_value(object_id||object_name,0,power(2,30))) from dba_objects;

SUM(DBMS_UTILITY.GET_HASH_VALU
------------------------------
                40683165992756

 

 

对于列较多的表,手动去构造所有字段合并可能会比较麻烦,利用以下SQL可以快速构造出我们所需要的语句:

 

放到PL/SQL Developer等工具中运行,在sqlplus 中可能因ORA-00923: FROM keyword not found where expected出错

select 'select sum(dbms_utility.get_hash_value('||column_name_path||',0,power(2,30)) ) from '||owner||'.'||table_name||';'  from (select owner,table_name,column_name_path,row_number() over(partition by table_name order by table_name,curr_level desc) column_name_path_rank from (select owner,table_name,column_name,rank,level as curr_level,ltrim(sys_connect_by_path(column_name,'||''|''||'),'||''|''||') column_name_path from (select owner,table_name,column_name,row_number() over(partition by table_name order by table_name,column_name) rank from dba_tab_columns where owner=UPPER('&OWNER')  and table_name=UPPER('&TABNAME')  order by table_name,column_name) connect by table_name = prior table_name and rank-1 = prior rank)) where column_name_path_rank=1;

 

使用示范:

 

SQL> @get_hash_col
Enter value for owner: SYS
Enter value for tabname: TAB$

'SELECTSUM(DBMS_UTILITY.GET_HASH_VALUE('||COLUMN_NAME_PATH||',0,POWER(2,30)))FROM
--------------------------------------------------------------------------------
select sum(dbms_utility.get_hash_value(ANALYZETIME||'|'||AUDIT$||'|'||AVGRLN||'|
'||AVGSPC||'|'||AVGSPC_FLB||'|'||BLKCNT||'|'||BLOCK#||'|'||BOBJ#||'|'||CHNCNT||'
|'||CLUCOLS||'|'||COLS||'|'||DATAOBJ#||'|'||DEGREE||'|'||EMPCNT||'|'||FILE#||'|'
||FLAGS||'|'||FLBCNT||'|'||INITRANS||'|'||INSTANCES||'|'||INTCOLS||'|'||KERNELCO
LS||'|'||MAXTRANS||'|'||OBJ#||'|'||PCTFREE$||'|'||PCTUSED$||'|'||PROPERTY||'|'||
ROWCNT||'|'||SAMPLESIZE||'|'||SPARE1||'|'||SPARE2||'|'||SPARE3||'|'||SPARE4||'|'
||SPARE5||'|'||SPARE6||'|'||TAB#||'|'||TRIGFLAG||'|'||TS#,0,1073741824) ) from S
YS.TAB$;

利用以上生成的SQL 计算表的sum(hash)值

select sum(dbms_utility.get_hash_value(ANALYZETIME || '|' || AUDIT$ || '|' ||
                                       AVGRLN || '|' || AVGSPC || '|' ||
                                       AVGSPC_FLB || '|' || BLKCNT || '|' ||
                                       BLOCK# || '|' || BOBJ# || '|' ||
                                       CHNCNT || '|' || CLUCOLS || '|' || COLS || '|' ||
                                       DATAOBJ# || '|' || DEGREE || '|' ||
                                       EMPCNT || '|' || FILE# || '|' ||
                                       FLAGS || '|' || FLBCNT || '|' ||
                                       INITRANS || '|' || INSTANCES || '|' ||
                                       INTCOLS || '|' || KERNELCOLS || '|' ||
                                       MAXTRANS || '|' || OBJ# || '|' ||
                                       PCTFREE$ || '|' || PCTUSED$ || '|' ||
                                       PROPERTY || '|' || ROWCNT || '|' ||
                                       SAMPLESIZE || '|' || SPARE1 || '|' ||
                                       SPARE2 || '|' || SPARE3 || '|' ||
                                       SPARE4 || '|' || SPARE5 || '|' ||
                                       SPARE6 || '|' || TAB# || '|' ||
                                       TRIGFLAG || '|' || TS#,
                                       0,
                                       1073741824))
  from SYS.TAB$;

SUM(DBMS_UTILITY.GET_HASH_VALU
------------------------------
                 1646389632463

 

方式5 利用累加整行数据的hash来判定表上数据是否一致, 仅需要对2张表做全表扫描,效率上是这几种方法中最高的, 且能保证较高的准确率。

 

但是该hash方式存在以下几点不足:

1. 所有字段合并的整行数据可能超过4000字节,这时会出现ORA-1498错误。换而言之使用这种方式的前提是表中任一行的行长不能超过4000 bytes,当然常规情况下很少会有一行数据超过4000 bytes,也可以通过dba_tables.avg_row_len平均行长的统计信息来判定,若avg_row_len<<4000 那么一般不会有溢出的问题。

2. 该hash 方式仅能帮助判断 数据是否一致, 而无法提供更多有用的,例如是哪些行不一致等细节信息

3. 同样的该hash方式对于lob、long字段也无能为力

Comments

  1. Charlie 木匠 (@mujiang) says

    新鲜, 学习了.
    鄙人以为, 给每一行计算Hash, 然后比较两张表的每一行的Hash value, 就能找到哪些行不一致等细节信息. 表连接效率会高一些.

  2. bin_huhj says

    刘大,利用SQL可以快速构造产生sum(hash)值的SQL语句,也就是一个行转列的功能吧?用wmsys.wm_wm_concat函数可以大大简化SQL代码:select ‘select sum(ora_hash(‘||enames||’,power(2,30),0)) from ‘||owner||’.’||table_name||’;’from (select owner, table_name, replace(wmsys.wm_concat(column_name),’,’,’||’ ) as enames from (select owner, table_name, column_name from dba_tab_columns where owner = UPPER(‘&tabowner’) AND TABLE_NAME = UPPER(‘&tabname’)) group by owner, table_name); 另外,我测试发现用ora_hash计算hash值,比dbms_utility.get_hash_value效率要高很多。

Charlie 木匠 (@mujiang)进行回复 取消回复

*

沪ICP备14014813号-2

沪公网安备 31010802001379号