RAC动态资源(DRM)管理介绍

以下文本摘自: metalink doc 390483.1

Subject:  DRM – Dynamic Resource management
Doc ID:  390483.1  Type:  BULLETIN
Modified Date :  13-JAN-2009  Status:  PUBLISHED

In this Document
Purpose
Scope and Application
DRM – Dynamic Resource management
DRM – Dynamic Resource Mastering
References

Applies to:
Oracle Server – Enterprise Edition – Version: 10.1.0.2 to 11.1.0
Oracle Server – Standard Edition – Version: 10.1.0.2 to 11.1.0
Information in this document applies to any platform.
Oracle Real application Clusters
Purpose

To describe the concept of DRM (Dynamic Resource Mastering)
Scope and Application

This note in intended for experienced Real application cluster DBA’s
DRM – Dynamic Resource management
DRM – Dynamic Resource Mastering

When using Real application Clusters (RAC), Each instance has its own SGA and buffer cache. RAC will ensure that these block changes are co -ordinated to maximize performance and to ensure data intergrity. Each copy of the buffer also called as a cache resource has a master which is one of the nodes of the cluster.

In database releases before 10g (10.1.0.2) once a cache resource is mastered on an instance, a re-mastering or a change in the master would take place only during a reconfiguration that would happen automatically during both normal operations like instance startup or instance shutdown or abnormal events like Node eviction by Cluster Manager. So if Node B is the master of a cache resource, this resource will remain mastered on Node B until reconfiguration.

10g  introduces a concept of resource remastering via DRM. With DRM a resource can be re-mastered on another node say from Node B to Node A if it is found that the cache resource is accessed more frequently from Node A. A reconfiguration is no longer the only reason for a resource to be re-mastered.

In 10gR1 DRM is driven by affinity of files and in 10gR2 it is based on objects.

Sample LMD trace file during a DRM operation

Begin DRM(202) - transfer pkey 4294951314 to 0 oscan 1.1
*** 2006-08-01 17:34:54.645
Begin DRM(202) - transfer pkey 4294951315 to 0 oscan 1.1
*** 2006-08-01 17:34:54.646
Begin DRM(202) - transfer pkey 4294951316 to 0 oscan 1.1
*** 2006-08-01 17:34:54.646
Begin DRM(202) - transfer pkey 4294951317 to 0 oscan 1.1

DRM attributes are intentionally undocumented since they may change depending on the version. These attributes should not be changed without discussing with Support.
@DRM is driven by the following
@ 1.) _gc_affinity_time = Time in minutes at which statistics will be evaluated (default = 10 mins)
@ 2.) _gc_affinity_limit = # of times a node accesses a file/object (default = 50)
@ 3.) _gc_affinity_minimum = minimum # of times per minute a file/object is accessed before affinity kicks in
@ (default = 600 per minute per cpu )

It is important to note that

  1. Two instance will not start a DRM operation at the same time however lmd,lms,lmon processes from all instances collectively take part in the DRM operation.
  2. Normal activity on the database is not affected due to DRM. This means users continue insert/update/delete operations without any interruptions. Also DRM operations complete very quickly.

@

Disable DRM
Generally DRM should not be disabled unless Oracle Support/Development has suggested turning it off due to some known issues.
@To disable DRM, set
@To disable DRM, set
@_gc_affinity_time=0                                 # Only if DB version is 10.1 or 10.2
@_gc_undo_affinity=FALSE                       # Only if Db version is 10.2
@_gc_policy_time=FALSE                         # Only if DB version is 11.1 or higher
@_gc_affinity_time has been renamed to _gc_policy_time in 11g

Comments

  1. admin says

    quote:
    “10g Real Application Clusters introduced a concept of resource remastering via Dynamic Resource Mastering (DRM). With DRM a resource can be re-mastered on another node in the cluster if it is found that the cache resource is accessed more frequently from that node.

    In 10G R1 this was file based whereas the 10G R2 it is object based.
    In 10G R1 due to a few bugs many related to high CPU usage during the DRM freeze window most customers disabled DRM by setting the following parameters
    _gc_affinity_time=0
    _gc_undo_affinity=FALSE

    _gc_affinity_time defines the frequency in minutes to check if remastering
    is needed.
    _gc_affinity_limit defines the number of times a node must access an object
    for it to be a DRM candidate
    _gc_affinity_minimum defines the minimum number of times an object is accessed per minute before affinity kicks in

    The performance problems may manifest themselves in terms of a DRM related wait event like ‘gcs drm freeze in enter server mode’

    In 10G R2 this feature appears to be more stable.

    You can also manually remaster an object on a different node which
    is different from the node on which the object is currrently mastered as shown below

    SQL> select object_id,current_master, previous_master ,remaster_cnt from V$GCSPFMASTER_INFO where object_id = 144615

    OBJECT_ID CURRENT_MASTER PREVIOUS_MASTER REMASTER_CNT
    ———- ————– ————— ————
    144615 0 2 0

    The object 144615 is currently mastered on node 0.
    To remaster the object onto node 2 connect to node 2 as sysdba

    Go to instance 2
    NODE2> oradebug setmypid
    Statement processed.
    NODE2> oradebug lkdebug -m pkey 144615
    Statement processed.
    NODE2> select object_id,current_master, previous_master ,remaster_cnt from V$GCSPFMASTER_INFO where object_id = 144615

    OBJECT_ID CURRENT_MASTER PREVIOUS_MASTER REMASTER_CNT
    ———- ————– ————— ————
    144615 2 0 0
    Note: In V$GCSPFMASTER_INFO you will also see resources with object ids in the 4Gb range (e.g. 4294950913)
    These are for undo segments.

    To dissolve remastering of this object on this instance
    SQL> oradebug lkdebug -m dpkey 144615
    Statement processed.

    SQL> select object_id,current_master, previous_master ,remaster_cnt from V$GCSPFMASTER_INFO where object_id = 144615;
    no rows selected

    The remaster_cnt appears to be 0 for all objects. I have got Oracle
    to log bug 5649377 on this issue.

    SQL> select distinct remaster_cnt from V$GCSPFMASTER_INFO ;

    REMASTER_CNT
    ————
    0

    DRM statistics are available in X$KJDRMAFNSTATS

    SQL> select * from X$KJDRMAFNSTATS
    2 /

    ADDR INDX INST_ID DRMS AVG_DRM_TIME OBJECTS_PER_DRM QUISCE_T FRZ_T CLEANUP_T REPLAY_T FIXWRITE_T SYNC_T
    ——– ———- ———- ———- ———— ————— ———- ———- ———- ———- ———- ———-
    RES_CLEANED REPLAY_S REPLAY_R MY_OBJECTS
    ———– ———- ———- ———-
    200089CC 0 1 32 214 1 3 14 0 0 0 99
    0 2441 6952 30

    The column MY_OBJECTS denotes the number of objects mastered on that node.
    This should match with the following

    SQL> select count(*) from V$GCSPFMASTER_INFO where current_master=0
    2 /

    COUNT(*)
    ———-
    30 “

  2. admin says

    Hdr: 5649377 10.2.0.1.0 RDBMS 10.2.0.1.0 DLM PRODID-5 PORTID-46
    Abstract: REMASTER_CNT IN V$GCSPFMASTER_INFO IS NOT GETTING UPDATED
    PROBLEM:
    ——–
    When a object is getting remastered the column REMASTER_CNT in
    V$GCSPFMASTER_INFO is not getting updated

    DIAGNOSTIC ANALYSIS:
    ——————–
    SQL> create table t1 ( a number);

    Table created.

    SQL> begin
    2 for i in 1..100
    3 loop
    4 insert into t1 values(i);
    5 end loop;
    6 commit;
    7 end;
    8 /

    PL/SQL procedure successfully completed.

    SQL> select count(*) from t1;

    COUNT(*)
    ———-
    100

    SQL> select * from t1;

    A
    Run select * from t1 multiple times and get the object mastered on any node.

    In our case the object is currently mastered on the node elephant

    elephant> select * from V$GCSPFMASTER_INFO where object_id = 290525;

    FILE_ID OBJECT_ID CURRENT_MASTER PREVIOUS_MASTER REMASTER_CNT
    ———- ———- ————– ————— ————
    0 290525 0 32767 0

    Force a remastering onto 2nd node hippo
    hippo>>oradebug lkdebug -m pkey 290525
    Statement processed.
    hippo>> select * from V$GCSPFMASTER_INFO where object_id = 290525;

    FILE_ID OBJECT_ID CURRENT_MASTER PREVIOUS_MASTER REMASTER_CNT
    ———- ———- ————– ————— ————
    0 290525 2 0 0

    The remaster_cnt has not increased.

    Force a remastering onto 3rd node rhino
    rhino> oradebug lkdebug -m pkey 290525
    Statement processed.
    rhino> select * from V$GCSPFMASTER_INFO where object_id = 290525;

    FILE_ID OBJECT_ID CURRENT_MASTER PREVIOUS_MASTER REMASTER_CNT
    ———- ———- ————– ————— ————
    0 290525 1 0 0

    Again the remaster count has not increased.

    WORKAROUND:
    ———–
    NA

    RELATED BUGS:
    ————-
    NA

    REPRODUCIBILITY:
    —————-
    Reproducible

    TEST CASE:
    ———-

    STACK TRACE:
    ————

    SUPPORTING INFORMATION:
    ———————–

    24 HOUR CONTACT INFORMATION FOR P1 BUGS:
    —————————————-

    DIAL-IN INFORMATION:
    ——————–

    IMPACT DATE:
    ————

    REDISCOVERY INFORMATION:
    Remaster count (REMASTER_CNT) in V$GCSPFMASTER_INFO is always 0, even after
    remastering has occurred.
    WORKAROUND:
    None
    RELEASE NOTES:
    ]]When dynamic remastering occurred, the remaster_cnt column in the dynamic
    ]]view v$gcspfmaster_info was not being updated. This has now been fixed.

  3. lmon Process Terminated With Error Ora-481
    Applies to:
    Oracle Server – Enterprise Edition – Version: 10.1.0.4 to 10.2.0.2
    This problem can occur on any platform.
    RAC 10.1.0.3 (or 10.1.0.4)
    Symptoms
    Instance finished due to ORA-481 in lmon.

    Cause
    This can be produced for the internal bug: 3659289 Lmon Took Off Line During Remastering Sync.

    The bug matches:

    1. lmon produces the ORA-481
    2. instance crashed
    3. lmon failed to get an anwer from the lmd during a dynamic reconfiguration in the same step:

    * kjfcdrmrfg: SYNC TIMEOUT while waiting for lmd in sync step 31
    ..
    sync() timed out – lmon exiting

    4. a quiesce step was executed before the error:

    Solution
    Before 10.2.0.3, we should not use DRM. The main reason is that we have some issues fixed in 10.2.0.3.0 which cannot be backported to earlier versions.

    To disable DRM the following parameters must be set on the instances:

    – 10gR1
    _gc_affinity_time=0
    – 10gR2
    _gc_undo_affinity=FALSE
    _gc_affinity_time=0

  4. Hdr: 5181297 10.2.0.1.0 RDBMS 10.2.0.1.0 RAC PRODID-5 PORTID-197 ORA-481
    Abstract: LMON TERMINATES INSTANCE DUE TO SYNC TIMEOUT (STEP 31) – ORA-481
    PROBLEM:
    ——–
    o 4 node HP-UX Itanium RAC cluster running 10.2.0.1.0
    o each node has 48 processors
    o customer running migration testing, replacing PA RISC with Itanium
    processors
    o on Apr 17 at around 23:15 instance on node 1 terminated due to LMON dying
    o LMON apparently terminated due to a timeout recorded in the LMON trace file
    kjfcdrmrfg: SYNC TIMEOUT (48358, 47457, 900), step 31
    o prior to the instance termination all instances were experiencing a hang
    o after instance 1 died other instance “unfroze”

    DIAGNOSTIC ANALYSIS:
    ——————–
    There is no obvious reason for the instance termination. Resource starvation
    appears unlikely as the CPU load during the time frame in qusestion is rather
    low. The highest load on some processors is 20%.

    WORKAROUND:
    ———–
    none

    RELATED BUGS:
    ————-

    REPRODUCIBILITY:
    —————-
    intermittent

    TEST CASE:
    ———-
    n/a

    STACK TRACE:
    ————
    LMON trace shows the following call stack:

    ksedst ksedmp $cold_ksddoa ksdpcg ksdpec ksfpec
    kgesev ksesec0 kjfcdrmrfg kjfcln ksbrdp opirip
    opidrv sou2o opimai_real main main_opd_entry

    SUPPORTING INFORMATION:
    ———————–

    24 HOUR CONTACT INFORMATION FOR P1 BUGS:
    —————————————-

    DIAL-IN INFORMATION:
    ——————–

    IMPACT DATE:
    ————
    on ngmp1.
    =========
    From ngmp1_lmon_15085.trc
    ————————-
    *** 23:15:44.052
    kjfcdrmrfg: SYNC TIMEOUT (48358, 47457, 900), step 31
    Submitting asynchronized dump request [28]

    … <>
    error 481 detected in background process
    ORA-481: LMON process terminated with error
    *** 23:16:37.822
    ksuitm: waiting for [5] seconds before killing DIAG

    From alert.log
    ————–
    Mon Apr 17 23:16:06 2006
    Errors in file /oracle/app/oracle/admin/NGMP/bdump/ngmp1_lmon_15085.trc:
    ORA-481: LMON process terminated with error

    … <>
    Mon Apr 17 23:16:42 2006
    Instance terminated by LMON, pid = 15085
    *** 04/23/06 10:55 pm ***
    on ngmp2
    ========
    From alert.log
    ————–
    Mon Apr 17 23:15:23 2006
    GES: Potential blocker (pid=23648) on resource SS-0x3-0x2;
    enqueue info in file /oracle/app/oracle/admin/NGMP/udump/ngmp2_ora_3359.trc
    and DIAG trace file
    Mon Apr 17 23:16:08 2006
    Trace dumping is performing id=[cdmp_20060417231607]
    Mon Apr 17 23:16:47 2006
    Reconfiguration started (old inc 16, new inc 18)
    List of nodes:
    1 2 3

    on ngmp3
    ========
    From alert.log
    ————–
    Mon Apr 17 23:16:08 2006
    Trace dumping is performing id=[cdmp_20060417231607]
    Mon Apr 17 23:16:47 2006
    Reconfiguration started (old inc 16, new inc 18)
    List of nodes:
    1 2 3

    on ngmp4
    ========
    From alert.log
    ————–
    Mon Apr 17 23:15:23 2006
    GES: Potential blocker (pid=23803) on resource CI-0xe-0x1;
    enqueue info in file
    /oracle/app/oracle/admin/NGMP/bdump/ngmp4_lmd0_23655.trc and DIAG trace file
    Mon Apr 17 23:16:08 2006
    Trace dumping is performing id=[cdmp_20060417231607]
    Mon Apr 17 23:16:47 2006
    Reconfiguration started (old inc 16, new inc 18)
    List of nodes:
    1 2 3

  5. admin says

    Hdr: 5960580 10.2.0.2 RDBMS 10.2.0.2 RAC PRODID-5 PORTID-23 5181297
    Abstract: LMON CRASHING WITH ORA-481 ERROR
    PROBLEM:
    ——–
    LMON crashed with ora-481 on Mon Mar 26 18:53:47 2007. Ct said that a
    particular job which was completing in 3.5 hours on Friday March 23, was
    taking 5 hours on Saturday March 24 and after the crash (March 26) is taking
    12 to 15 hours.

    I am not sure whether the crash and the performance are related. The crash
    did NOT happen when the job was running.

    Let us consider this bug for only the LMON crash.

    DIAGNOSTIC ANALYSIS:
    ——————–
    alert_PFNR1011.log:
    ==========================
    Mon Mar 26 18:53:36 2007
    WARNING: inbound connection timed out (ORA-3136)
    Mon Mar 26 18:53:36 2007
    WARNING: inbound connection timed out (ORA-3136)
    Mon Mar 26 18:53:36 2007
    WARNING: inbound connection timed out (ORA-3136)
    Mon Mar 26 18:53:36 2007
    WARNING: inbound connection timed out (ORA-3136)
    Mon Mar 26 18:53:36 2007
    ???????

    Mon Mar 26 18:53:47 2007
    Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_lmon_1349.trc:
    ORA-481: LMON process terminated with error
    Mon Mar 26 18:53:47 2007
    LMON: terminating instance due to error 481
    Mon Mar 26 18:53:47 2007
    Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_lck0_1767.trc:
    ORA-481: LMON process terminated with error
    Mon Mar 26 18:53:47 2007
    Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_dbw2_1485.trc:
    ORA-481: LMON process terminated with error
    Mon Mar 26 18:53:47 2007
    Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_lms0_1354.trc:
    ORA-481: LMON process terminated with error
    Mon Mar 26 18:53:47 2007
    Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_lmd0_1352.trc:
    ORA-481: LMON process terminated with error
    Mon Mar 26 18:53:47 2007
    Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_pmon_1341.trc:
    ORA-481: LMON process terminated with error
    Mon Mar 26 18:53:47 2007
    Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_lms2_1363.trc:
    ORA-481: LMON process terminated with error
    Mon Mar 26 18:53:47 2007
    Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_lms4_1410.trc:
    ORA-481: LMON process terminated with error
    Mon Mar 26 18:53:47 2007
    Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_lms6_1425.trc:
    ORA-481: LMON process terminated with error
    Mon Mar 26 18:53:47 2007
    System state dump is made for local instance
    Mon Mar 26 18:53:47 2007
    Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_lms1_1359.trc:
    ORA-481: LMON process terminated with error
    Mon Mar 26 18:53:47 2007
    Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_lms3_1397.trc:
    ORA-481: LMON process terminated with error
    Mon Mar 26 18:53:47 2007
    Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_lms5_1419.trc:
    ORA-481: LMON process terminated with error
    Mon Mar 26 18:53:47 2007
    Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_lms7_1432.trc:
    ORA-481: LMON process terminated with error
    System State dumped to trace file
    /oracle/g01/admin/PFNR1011/bdump/pfnr1011_diag_1344.trc
    Mon Mar 26 18:53:47 2007
    Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_ckpt_1513.trc:
    ORA-481: LMON process terminated with error
    Mon Mar 26 18:53:47 2007
    Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_j002_5622.trc:
    ORA-481: LMON process terminated with error
    Mon Mar 26 18:53:47 2007
    Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_dbw3_1491.trc:
    ORA-481: LMON process terminated with error
    Mon Mar 26 18:53:47 2007
    Errors in file /oracle/g01/admin/PFNR1011/bdump/pfnr1011_j003_5637.trc:
    ORA-481: LMON process terminated with error
    Mon Mar 26 18:53:52 2007
    Instance terminated by LMON, pid = 1349
    Mon Mar 26 18:54:50 2007
    Starting ORACLE instance (normal)
    Mon Mar 26 18:55:29 2007

    Reviewing pfnr1011_lmon_1349.trc:
    ===================================
    *** 18:53:27.122
    kjfcdrmrfg: SYNC TIMEOUT (275372, 274471, 900), step 31
    Submitting asynchronized dump request [28]

    kjctseventdump-end tail 205 heads 0 @ 0 205 @ 1047039906
    sync() timed out – lmon exiting
    kjfsprn: sync status inst 0 tmout 900 (sec)
    kjfsprn: sync propose inc 8 level 396
    kjfsprn: sync inc 8 level 396

    waiting for ‘ges remote message’ blocking sess=0x0 seq=4510 wait_time=0
    seconds since wait started=442
    waittime=40, loop=0, p3=0
    Dumping Session Wait History
    for ‘ges remote message’ count=1 wait_time=746343
    waittime=40, loop=0, p3=0

    WORKAROUND:
    ———–
    no

    RELATED BUGS:
    ————-
    similar to bug 5399702 – base bug 5181297
    could be bug 4947571 – base bug 4940890

    REPRODUCIBILITY:
    —————-
    no

    TEST CASE:
    ———-
    no

    STACK TRACE:
    ————
    *** 18:53:27.159
    Dumping diagnostic information for ospid 1352:
    OS pid = 1352
    loadavg : 1.44 1.39 1.39
    swap info: free_mem = 34099.71M rsv = 27064.04M
    alloc = 26321.46M avail = 95169.25 swap_free = 95911.83M
    F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY
    TIME CMD
    0 S oracle 1352 1 0 40 20 ? 2646619 ? Mar 23
    console 22:22 ora_lmd0_PFNR1011
    1352: ora_lmd0_PFNR1011
    —————– lwp# 1 / thread# 1 ——————–
    ffffffff7a8cde8c ioctl (9, 373f, 105e6ba10)
    ffffffff7fffc910) + 15f4
    710
    0000000100e662e0 ksliwat (105c00, 2, 8, 91a000160, 0, 0) + b60
    0000000100e66820 kslwaitns (8, 1, 32, 0, 40, 0) + 20
    00000001010d61f4 kskthbwt (8, 1, 32, 0, 40, 0) + d4
    0000000100e667dc kslwait (8, 32, 0, 15e, 190, 0) + 5c
    00000001010e8344 ksxprcv (104fb4, 105d68558, 104fb4128, 1468, 105d68,
    104fb4000) + 364
    0000000101591254 kjctr_rksxp (40, 403fe88f8, 0, ffffffff7fffd978, 14,
    ffffffff7fffd974) + 1f4
    0000000101592e24 kjctrcv (ffffffff79626208, 403fe88f8, 105e992c0,
    ffffffff7fffe1bc, 40, 32) + 164
    000000010157f6a0 kjcsrmg (ffffffff796261f0, 0, 40, 32, 0, 105d71) + 60
    00000001015dc098 kjmdm (a, 44, 4097ab030, 8, 4097ab030, 0) + 2ff8
    0000000101002a60 ksbrdp (105d6b, 380007774, 380000, 38000e, 105c00,
    1015d90a0) + 380
    00000001024219f8 opirip (105d75000, 105c00, 105d7d, 380007000, 105d75,
    105df1ae0) + 338
    00000001002fe790 opidrv (105d77d18, 1, 32, 0, 32, 105c00) + 4b0
    00000001002f8e30 sou2o (ffffffff7ffff468, 32, 4, ffffffff7ffff490,
    1056ac000, 1056ac) + 50
    00000001002bc2ec opimai_real (3, ffffffff7ffff568, 0, 0, 1e42ee4, 14400) +
    10c
    00000001002bc118 main (1, ffffffff7ffff678, 0, ffffffff7ffff570,
    ffffffff7ffff680, ffffffff7aa00140) + 98
    00000001002bc03c _start (0, 0, 0, 0, 0, 0) + 17c

    SUPPORTING INFORMATION:
    ———————–
    will be uploaded

    24 HOUR CONTACT INFORMATION FOR P1 BUGS:
    —————————————-
    Viral Shah – customer – 267-467-6950. This is a sev2 p1 SR.

    DIAL-IN INFORMATION:
    ——————–

    IMPACT DATE:
    ————
    Ct is on pre-production. Planning to go production next week. Ct saidd he
    cannot upgrade to 10.2.0.3.

  6. admin says

    Hdr: 9405510 10.2.0.4 RDBMS 10.2.0.4 RAC PRODID-5 PORTID-197 ORA-481 6960699
    Abstract: LMON TERMINATES INSTANCE DUE TO SYNC TIMEOUT, STEP 31
    PROBLEM:
    ——–
    Two nodes cluster.
    Two node always shutdown abort because of ORA-481

    LMON apparently terminated due to a timeout recorded in the LMON trace file
    kjfcdrmrfg: SYNC TIMEOUT (289446, 288545, 900), step 31

    DIAGNOSTIC ANALYSIS:
    ——————–
    alert_CSP1.log.20100203
    ————
    Thu Jan 28 17:57:09 2010
    Errors in file /cfs/oradata/CSP/admin/CSP/bdump/csp1_lmon_18295.trc:
    ORA-481: LMON process terminated with error
    Thu Jan 28 17:57:09 2010
    LMON: terminating instance due to error 481
    Thu Jan 28 17:57:09 2010
    Errors in file /cfs/oradata/CSP/admin/CSP/bdump/csp1_lms0_18299.trc:
    ORA-481: LMON process terminated with error
    Thu Jan 28 17:57:09 2010
    Errors in file /cfs/oradata/CSP/admin/CSP/bdump/csp1_lms1_18307.trc:
    ORA-481: LMON process terminated with error
    Thu Jan 28 17:57:09 2010
    Errors in file /cfs/oradata/CSP/admin/CSP/bdump/csp1_lmd0_18297.trc:
    ORA-481: LMON process terminated with error
    Thu Jan 28 17:57:09 2010
    System state dump is made for local instance
    System State dumped to trace file
    /cfs/oradata/CSP/admin/CSP/bdump/csp1_diag_18291.trc
    Thu Jan 28 17:57:09 2010
    Errors in file /cfs/oradata/CSP/admin/CSP/bdump/csp1_pmon_18289.trc:
    ORA-481: LMON process terminated with error
    Thu Jan 28 17:57:11 2010
    Shutting down instance (abort)

    alert_CSP2.log.20100203
    ——————
    Thu Jan 28 17:57:10 2010
    Trace dumping is performing id=[cdmp_20100128175709]
    Thu Jan 28 17:57:18 2010
    Reconfiguration started (old inc 64, new inc 66)
    List of nodes:
    1

    From csp1_lmon_18295.trc
    =========
    *** 17:41:47.285
    Begin DRM(1328)
    sent syncr inc 64 lvl 8809 to 0 (64,0/31/0)
    sent synca inc 64 lvl 8809 (64,0/31/0)

    sent syncr inc 64 lvl 8840 to 0 (64,0/38/0)
    sent synca inc 64 lvl 8840 (64,0/38/0)
    End DRM(1328)
    Begin DRM(1329)
    *** 17:56:49.084
    kjfcdrmrfg: SYNC TIMEOUT (289446, 288545, 900), step 31
    Submitting asynchronized dump request [28]
    ==

    WORKAROUND:
    ———–
    Disable DRM as below
    _gc_undo_affinity=FALSE
    _gc_affinity_time=0

    RELATED BUGS:
    ————-
    5181297

    REPRODUCIBILITY:
    —————-

    TEST CASE:
    ———-

    STACK TRACE:
    ————

    SUPPORTING INFORMATION:
    ———————–
    This seems to be very much similar to bug 5181297, which is fixed in 10203.
    But ct is on 10204.

    24 HOUR CONTACT INFORMATION FOR P1 BUGS:
    —————————————-

    DIAL-IN INFORMATION:
    ——————–

    IMPACT DATE:
    ————

  7. admin says

    Hdr: 5407742 10.2.0.2 RDBMS 10.2.0.2 RAC PRODID-5 PORTID-46 ORA-481
    Abstract: LMON CRASHE WITH ORA-481 KJFCDRMRFG: SYNC TIMEOUT STEP 31
    PROBLEM:
    ——–
    12 nodes with 2 instance on each nodes

    They ran into bug 4631662 and upgrade to 10.2.0.2 on 7/9/06.

    Intance 1 of NMC database crashed with ora-481.

    Thu Jul 20 09:25:19 2006
    Errors in file
    /opt/app/oracle/product/10.2.0/RAC/admin/NMC/bdump/nmc1_lmon_9754.trc:
    ORA-481: LMON process terminated with error
    Thu Jul 20 09:25:19 2006
    LMON: terminating instance due to error 481

    This is their production and can’t afford to have instance crashed on the
    time

    DIAGNOSTIC ANALYSIS:
    ——————–
    From the lmon trace:

    nmc1_lmon_9754.trc
    —————————
    *** 09:10:01.138
    sent syncr inc 718 lvl 17953 to 0 (718,0/31/0)
    sent synca inc 718 lvl 17953 (718,0/31/0)
    sent syncr inc 718 lvl 17954 to 0 (718,0/34/0)
    sent synca inc 718 lvl 17954 (718,0/34/0)
    sent syncr inc 718 lvl 17955 to 0 (718,0/36/0)
    sent synca inc 718 lvl 17955 (718,0/36/0)
    sent syncr inc 718 lvl 17956 to 0 (718,0/38/0)
    sent synca inc 718 lvl 17956 (718,0/38/0)
    *** 09:25:04.835
    kjfcdrmrfg: SYNC TIMEOUT (924581, 923680, 900), step 31
    Submitting asynchronized dump request [28]
    KJC Communication Dump:
    state 0x5 flags 0x0 mode 0x0 inst 0 inc 718
    nrcv 3 nsp 3 nrcvbuf 1000
    reg_msg: sz 420 cur 216 (s:0 i:216) max 925 ini 2350
    big_msg: sz 4128 cur 38 (s:0 i:38) max 205 ini 2350
    rsv_msg: sz 4128 cur 0 (s:0 i:0) max 0 tot 1000
    rcvr: id 2 orapid 8 ospid 9788
    rcvr: id 1 orapid 7 ospid 9784
    rcvr: id 0 orapid 6 ospid 9756
    …………….

    kjctseventdump-end tail 247 heads 0 @ 0 247 @ 931725070 247 @ 931725070 247 @

    931725070 247 @ 931725070 247 @ 931725070 247 @ 931725070 247 @ 931725070 247
    931725070 247 @ 931725070 247 @ 931725070 247 @ 931725070
    sync() timed out – lmon exiting
    kjfsprn: sync status inst 0 tmout 900 (sec)
    kjfsprn: sync propose inc 718 level 17956
    kjfsprn: sync inc 718 level 17956
    kjfsprn: sync bitmap 0 1 2 3 4 5 6 7 8 9 10 11
    kjfsprn: dmap ver 718 (step 0)
    kjfsprn: ftdone bitmap 0 1 2 3 5 6 7 8 9 10 11

    This seem to match Bug 5237240 and that bug is at status 30.

    WORKAROUND:
    ———–
    none

    RELATED BUGS:
    ————-
    Bug 5181297 – LMON TERMINATES INSTANCE DUE TO SYNC TIMEOUT (STEP 31) –
    ORA-481
    – fixed in 10.2.0.3

    Bug 5131042 – ORA-481: LMON PROCESS TERMINATED WITH ERROR SYNC TIMEOUT
    – dup of 4903532
    – fix in 10.2.0.3

    REPRODUCIBILITY:
    —————-

    TEST CASE:
    ———-

    STACK TRACE:
    ————
    there is no lmon stack trace

    SUPPORTING INFORMATION:
    ———————–

    24 HOUR CONTACT INFORMATION FOR P1 BUGS:
    —————————————-

    DIAL-IN INFORMATION:
    ——————–

    IMPACT DATE:
    ————

Comment

*

沪ICP备14014813号-2

沪公网安备 31010802001379号