DNS设置引起的登录延迟

一套Linux上的11.1.0.7系统,操作系统管理人员最近对该服务器上的网络配置文件/etc/nsswitch.conf进行了调整,调整前其主机名解析选项为”hosts:files dns” ,调整后被修改成了”hosts:files [NOTFOUND=continue] dns”;此后应用人员尝试在该主机上使用
“sqlplus username/password@connect_string”远程登录数据库都会出现多达五分钟的延迟,使用lsnrctl status命令查看监听器状态,发现LISTENER一切正常;初步可以判断是dns解析导致了长时间的延迟。

针对以上问题,首先想到的是设置client端Oracle network trace以了解造成延迟的具体原因,在$ORACLE_HOME/network/admin/sqlnet.ora配置文件中加入以下记录:

TRACE_LEVEL_CLIENT = 16
TRACE_FILE_CLIENT = client
TRACE_DIRECTORY_CLIENT = [any valid directory path]
TRACE_TIMESTAMP_CLIENT = ON
DIAG_ADR_ENABLED=off

之后再次尝试登录就会触发Oracle Network Trace文件在$TRACE_DIRECTORY_CLIENT指定的目录下产生(如果DIAG_ADR_ENABLED未设置为false,那么11g下TRACE_DIRECTORY_CLIENT并不生效,而会产生在11g特有的diag目录下)。
登录测试产生的trace文件记录:

[02-SEP-2010 07:36:57:719] nsc2addr: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=m218279apss2012-vip)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=MOTOIDP.MOT.COM)(INSTANCE_NAME=MOTOIDP1)(CID=(PROGRAM=sqlplus)(HOST=m218279apss2012.mot.com)(USER=oraoid))))
[02-SEP-2010 07:36:57:719] nttbnd2addr: entry
[02-SEP-2010 07:36:57:719] snlinGetAddrInfo: entry
[02-SEP-2010 07:36:57:719] snlinGetAddrInfo: getaddrinfo() failed with error -2
[02-SEP-2010 07:36:57:719] snlinGetAddrInfo: exit
[02-SEP-2010 07:36:57:719] nttbnd2addr: looking up IP addr for host: m218279apss2012-vip
[02-SEP-2010 07:36:57:719] snlinGetAddrInfo: entry
[02-SEP-2010 07:36:57:719] snlinGetAddrInfo: exit
[02-SEP-2010 07:36:57:719] snlinFreeAddrInfo: entry
[02-SEP-2010 07:36:57:719] snlinFreeAddrInfo: exit
[02-SEP-2010 07:36:57:719] nttbnd2addr: exit
..................
[02-SEP-2010 07:36:57:722] nspsend: packet dump
[02-SEP-2010 07:36:57:722] nspsend: 01 1E 00 00 01 00 00 00  |........|
[02-SEP-2010 07:36:57:722] nspsend: 01 3A 01 2C 0C 41 20 00  |.:.,.A..|
[02-SEP-2010 07:36:57:722] nspsend: 7F FF 7F 08 00 00 01 00  |........|
[02-SEP-2010 07:36:57:722] nspsend: 00 E4 00 3A 00 00 02 00  |...:....|
[02-SEP-2010 07:36:57:722] nspsend: 41 41 00 00 00 00 00 00  |AA......|
[02-SEP-2010 07:36:57:722] nspsend: 00 00 00 00 00 00 00 00  |........|
[02-SEP-2010 07:36:57:722] nspsend: 00 00 00 00 00 00 00 00  |........|
[02-SEP-2010 07:36:57:722] nspsend: 00 00 28 44 45 53 43 52  |..(DESCR|
[02-SEP-2010 07:36:57:722] nspsend: 49 50 54 49 4F 4E 3D 28  |IPTION=(|
[02-SEP-2010 07:36:57:722] nspsend: 41 44 44 52 45 53 53 3D  |ADDRESS=|
[02-SEP-2010 07:36:57:722] nspsend: 28 50 52 4F 54 4F 43 4F  |(PROTOCO|
[02-SEP-2010 07:36:57:722] nspsend: 4C 3D 54 43 50 29 28 48  |L=TCP)(H|
[02-SEP-2010 07:36:57:722] nspsend: 4F 53 54 3D 6D 32 31 38  |OST=m218|
[02-SEP-2010 07:36:57:722] nspsend: 32 37 39 61 70 73 73 32  |279apss2|
[02-SEP-2010 07:36:57:722] nspsend: 30 31 32 2D 76 69 70 29  |012-vip)|
[02-SEP-2010 07:36:57:722] nspsend: 28 50 4F 52 54 3D 31 35  |(PORT=15|
[02-SEP-2010 07:36:57:722] nspsend: 32 31 29 29 28 43 4F 4E  |21))(CON|
[02-SEP-2010 07:36:57:722] nspsend: 4E 45 43 54 5F 44 41 54  |NECT_DAT|
[02-SEP-2010 07:36:57:722] nspsend: 41 3D 28 53 45 52 56 45  |A=(SERVE|
[02-SEP-2010 07:36:57:722] nspsend: 52 3D 44 45 44 49 43 41  |R=DEDICA|
[02-SEP-2010 07:36:57:722] nspsend: 54 45 44 29 28 53 45 52  |TED)(SER|
[02-SEP-2010 07:36:57:722] nspsend: 56 49 43 45 5F 4E 41 4D  |VICE_NAM|
[02-SEP-2010 07:36:57:722] nspsend: 45 3D 4D 4F 54 4F 49 44  |E=MOTOID|
[02-SEP-2010 07:36:57:722] nspsend: 50 2E 4D 4F 54 2E 43 4F  |P.MOT.CO|
[02-SEP-2010 07:36:57:722] nspsend: 4D 29 28 49 4E 53 54 41  |M)(INSTA|
[02-SEP-2010 07:36:57:722] nspsend: 4E 43 45 5F 4E 41 4D 45  |NCE_NAME|
[02-SEP-2010 07:36:57:722] nspsend: 3D 4D 4F 54 4F 49 44 50  |=MOTOIDP|
[02-SEP-2010 07:36:57:722] nspsend: 31 29 28 43 49 44 3D 28  |1)(CID=(|
[02-SEP-2010 07:36:57:722] nspsend: 50 52 4F 47 52 41 4D 3D  |PROGRAM=|
[02-SEP-2010 07:36:57:722] nspsend: 73 71 6C 70 6C 75 73 29  |sqlplus)|
[02-SEP-2010 07:36:57:722] nspsend: 28 48 4F 53 54 3D 6D 32  |(HOST=m2|
[02-SEP-2010 07:36:57:722] nspsend: 31 38 32 37 39 61 70 73  |18279aps|
[02-SEP-2010 07:36:57:722] nspsend: 73 32 30 31 32 2E 6D 6F  |s2012.mo|
[02-SEP-2010 07:36:57:722] nspsend: 74 2E 63 6F 6D 29 28 55  |t.com)(U|
[02-SEP-2010 07:36:57:722] nspsend: 53 45 52 3D 6F 72 61 6F  |SER=orao|
[02-SEP-2010 07:36:57:722] nspsend: 69 64 29 29 29 29        |id))))  |
[02-SEP-2010 07:36:57:722] nspsend: 286 bytes to transport
[02-SEP-2010 07:36:57:722] nspsend: normal exit
[02-SEP-2010 07:36:57:722] nscon: exit (0)
[02-SEP-2010 07:36:57:722] nsdo: nsctxrnk=0
[02-SEP-2010 07:36:57:722] nsdo: normal exit
[02-SEP-2010 07:36:57:722] nsdo: entry
[02-SEP-2010 07:36:57:722] nsdo: cid=0, opcode=68, *bl=512, *what=9, uflgs=0x0, cflgs=0x3
[02-SEP-2010 07:36:57:722] nsdo: rank=64, nsctxrnk=0
[02-SEP-2010 07:36:57:722] nsdo: nsctx: state=2, flg=0x4005, mvd=0
[02-SEP-2010 07:36:57:722] nsdo: gtn=10, gtc=10, ptn=10, ptc=8155
[02-SEP-2010 07:36:57:722] nscon: entry
[02-SEP-2010 07:36:57:722] nscon: recving a packet
[02-SEP-2010 07:36:57:722] nsprecv: entry
[02-SEP-2010 07:36:57:722] nsprecv: reading from transport...
[02-SEP-2010 07:36:57:722] nttrd: entry
[02-SEP-2010 07:41:57:741] nttrd: socket 9 had bytes read=8
[02-SEP-2010 07:41:57:741] nttrd: exit
[02-SEP-2010 07:41:57:741] nsprecv: 8 bytes from transport
[02-SEP-2010 07:41:57:741] nsprecv: tlen=8, plen=8, type=11
[02-SEP-2010 07:41:57:741] nsprecv: packet dump
[02-SEP-2010 07:41:57:741] nsprecv: 00 08 00 00 0B 00 00 00  |........|
[02-SEP-2010 07:41:57:741] nsprecv: normal exit

可以看到以上出现了”snlinGetAddrInfo: getaddrinfo() failed with error -2″的记录,通过关键词”snlinGetAddrInfo”在MOS上搜索可以找到以下Note:

Remote Connections Take Very Long to Establish

Applies to:

Oracle Net Services – Version: 11.1.0.6 to 11.1.0.7 – Release: 11.1 to 11.1
Information in this document applies to any platform.

Symptoms

When you connect remotely, it takes very long until the connection is established. Once connected, everything works fine.

Local bequeath connections work in a timely manner.

Resolution of hostnames is done via a centralized DNS server.

Cause

Listener traces show the listener is wasting time while calling the “snlinGetAddrInfo” function :

2009-07-08 13:58:35.135311 : nttcnp:exit
2009-07-08 13:58:35.135327 : snlinGetAddrInfo:entry
2009-07-08 13:58:55.135643 : snlinGetAddrInfo:getaddrinfo() failed with error -5
2009-07-08 13:58:55.135703 : snlinGetAddrInfo:exit

Solution

We need to make sure that the DNS server(s) configured are reachable.

The “snlinGetAddrInfo” function is a TCP layer function which Oracle uses (since version 11g) for hostname-to-IP mappings.

When the DNS server is unreachable, the listener will wait for some time until it will time out and fail over to the next method configured for resolution of hostnames (usually the local “hosts” file).

Note : On UNIX systems, the order of methods used for resolution of hostnames can be specified via the /etc/nsswitch.conf and /etc/host.conf files :
  • on Linux systems, you can specify the order by setting “hosts” to “files” and / or “dns”
Example for /etc/nsswitch.conf :
hosts: files [NOTFOUND=continue] dns
  • on HP-UX and Solaris systems, you need to specify the order using the “ipnodes” keyword
    Example for /etc/nsswitch.conf :
    ipnodes: files [NOTFOUND=continue] dns
Note : Since 11.2, this behaviour has changed and naming lookup is no longer performed at this stage (the changes was addressed through unpublished Bug 9593134).

这个文档指出snlinGetAddrInfo函数用以主机名到ip地址的映射,并建议使用在/etc/nsswitch.conf网络配置文件中设置过程为”hosts: files [NOTFOUND=continue] dns”,而我们恰恰正是这样做的!另一个文档[ID 803838.1]指出在其他UNIX平台上,可以使用”ipnodes: files [NOTFOUND=continue] dns”方式避免延迟问题,但Linux平台上并没有ipnodes方式。

11g中对hosts的解析实在变得有些西斯底里,这个case通过在移除nsswitch.conf中hosts的dns选项最后解决了,对于Oracle使用最简单的文件解析方式似乎仍是最稳妥的办法,不管版本有多新。

Comments

  1. admin says

    11g Network Layer Does Not Use /etc/hosts on UNIX

    Applies to:

    Oracle Net Services – Version: 11.1.0.6 to 11.1.0.7
    Generic UNIX

    Symptoms

    After upgrading to 11g Oracle functionality bypasses the /etc/hosts file when resolving hostnames to IPs and instead query the DNS server. This can introduce some delay in establishing a connection to a remote host compared with 10g.

    SQL*Plus and tnsping show this behavior but this delay can appear also when opening a dblink or anytime tcpip connections are established by the Oracle Network layer.

    This happens even if  Name Server Switch configuration (nsswitch) specifies the hosts file prior to dns lookups:

    /etc/nsswitch.conf 
    hosts: files [NOTFOUND=continue] dns
       
    or only:
    hosts: files

    If we obtain truss/tusc trace for a SQL*Plus connection we find the following sequence of OS system calls:

    On 10g after reading the nsswitch.conf file, library “libnss_files.so” is loaded then /etc/hosts is read and the socket is opened:

    open(“/etc/nsswitch.conf”, O_RDONLY|0x800, 0666) = 5
    ioctl(5, TCGETA, 0x9fffffffffffaca0) ERR#25 ENOTTY
    read(5, “# \n# / e t c / n s s w i t c “.., 8192) = 92
    read(5, 0x60000000001e6078, 8192) = 0
    close(5) = 0
    open(“/usr/lib/hpux64/libnss_files.so.1”, O_RDONLY|0x800, 0) = 5
    fstat(5, 0x9fffffffffffa720) = 0
    pread(5, “7fE L F 0202010101\0\0\0\0\0\0\0”.., 1024, 0) = 1024
    stat(“/usr/lib/hpux64/dpd”, 0x9fffffffffff9cd0) = 0
    open(“/usr/lib/hpux64/dpd/libnss_files.so.1.bpd”, O_RDONLY|0x800, 0) ERR#2 ENOENT
    getuid() = 305 (305)
    getgid() = 303 (303)
    mmap(NULL, 85872, PROT_READ|PROT_EXEC, MAP_SHARED|MAP_SHLIB, 5, 0) = 0xc0000000008d8000
    mmap(NULL, 3159, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_SHLIB, 5, 131072) = 0x9fffffffbf63d000
    close(5) = 0
    getuid() = 305 (305)
    getgid() = 303 (303)
    open(“/etc/hosts”, O_RDONLY|0x800, 0666) = 5
    ioctl(5, TCGETA, 0x9fffffffffffac30) ERR#25 ENOTTY
    read(5, “# / e t c / h o s t s \n# \n# “.., 8192) = 8192
    lseek(5, 18446744073709549410, SEEK_CUR) = 5986
    close(5) = 0
    socket(AF_INET, SOCK_STREAM, 0) = 5
    connect(5, 0x60000000001f0680, 16) = 0
    getsockname(5, 0x9fffffffffffb750, 0x9fffffffffffb740) = 0
    getsockopt(5, SOL_SOCKET, SO_SNDBUF, 0x9fffffffffffb890, 0x9fffffffffffb894) = 0
    getsockopt(5, SOL_SOCKET, SO_RCVBUF, 0x9fffffffffffb890, 0x9fffffffffffb894) = 0
    setsockopt(5, 0x6, TCP_NODELAY, 0x9fffffffffffb89c, 4) = 0

     

    But on 11g after reading the nsswitch.conf  library “libnss_dns.so” is loaded then /etc/resolv.conf (which specifies available domain name servers) is read and much later a socket for TCP/IP (SOCK_STREAM) is opened. :

    open(“/etc/nsswitch.conf”, O_RDONLY|0x800, 0666) = 5
    ioctl(5, TCGETA, 0x9fffffffffffa360) ERR#25 ENOTTY
    read(5, “# \n# / e t c / n s s w i t c “.., 8192) = 92
    read(5, 0x60000000001c9058, 8192) = 0
    close(5) = 0
    open(“/usr/lib/hpux64/libnss_dns.so.1”, O_RDONLY|0x800, 0) = 5
    fstat(5, 0x9fffffffffff9de0) = 0
    pread(5, “7fE L F 0202010101\0\0\0\0\0\0\0”.., 1024, 0) = 1024
    stat(“/usr/lib/hpux64/dpd”, 0x9fffffffffff9390) = 0
    open(“/usr/lib/hpux64/dpd/libnss_dns.so.1.bpd”, O_RDONLY|0x800, 0) ERR#2 ENOENT
    getuid() = 305 (305)
    getgid() = 303 (303)
    mmap(NULL, 49440, PROT_READ|PROT_EXEC, MAP_SHARED|MAP_SHLIB, 5, 0) = 0xc00000000b054000
    mmap(NULL, 800, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_SHLIB, 5, 65536) = 0x9fffffffbf68d000
    close(5) = 0
    getuid() = 305 (305)
    getgid() = 303 (303)
    getuid() = 305 (305)
    getgid() = 303 (303)
    open(“/test/local/oracle/11.1.0.7/lib/libdl.so.1”, O_RDONLY|0x800, 0) ERR#2 ENOENT
    open(“/test/local/oracle/11.1.0.7/lib32/libdl.so.1”, O_RDONLY|0x800, 0) ERR#2 ENOENT
    getuid() = 305 (305)
    getgid() = 303 (303)
    open(“/usr/lib/hpux64/libdl.so.1”, O_RDONLY|0x800, 0) = 5
    fstat(5, 0x9fffffffffff9cf0) = 0
    read(5, “7fE L F 0202010101\0\0\0\0\0\0\0”.., 64) = 64
    close(5) = 0
    socket(AF_INET, SOCK_DGRAM, 0) = 5
    ioctl(5, SIOCGIFNUM, 0x9fffffffffff9680) = 0
    ioctl(5, SIOCGIFCONF, 0x9fffffffffff9690) = 0
    socket(AF_INET6, SOCK_DGRAM, 0) = 6
    ioctl(6, SIOCGLIFNUM, 0x9fffffffffff9684) = 0
    ioctl(6, SIOCGLIFCONF, 0x9fffffffffff96a0) = 0
    ioctl(5, SIOCGIFFLAGS, 0x9fffffffffff96b0) = 0
    close(5) = 0
    close(6) = 0
    gettimeofday(0x9fffffffffff7dd0, NULL) = 0
    getpid() = 22968 (22967)
    open(“/etc/resolv.conf”, O_RDONLY|0x800, 0666) = 5
    ioctl(5, TCGETA, 0x9fffffffffff7da0) ERR#25 ENOTTY
    read(5, “d o m a i n t e s t . c o m \n”.., 8192) = 453
    read(5, 0x60000000001dddf8, 8192) = 0
    close(5) = 0
    …………….

    ………….

    socket(AF_INET, SOCK_STREAM, 0) = 5
    connect(5, 0x60000000001eba50, 16) = 0
    getsockname(5, 0x9fffffffffff9da0, 0x9fffffffffff94c0) = 0
    getsockopt(5, SOL_SOCKET, SO_SNDBUF, 0x9fffffffffffa000, 0x9fffffffffffa004) = 0
    getsockopt(5, SOL_SOCKET, SO_RCVBUF, 0x9fffffffffffa000, 0x9fffffffffffa004) = 0
    setsockopt(5, 0x6, TCP_NODELAY, 0x9fffffffffffa00c, 4) = 0

     

    Changes

    Nothing was changed in the configuration of the OS, only the upgrade from Oracle 10g to 11g was done.

    Cause

    What was changed between the two is the way Oracle resolves hostnames to IPs, more specifically the system call used to do that.

    Oracle 11g use now getaddrinfo() while 10g used gethostbyname().

    These system functions requires different configuration in /etc/nsswitch.conf.

    gethostbyname() require the use of keyword “hosts” while getaddrinfo() the use the keyword “ipnodes”

     

    Notes:
    Even though this has only been observed on HP-UX and Solaris, this may be UNIX generic.
    With Solaris, ipnodes has a different meaning (specify a file for IPV6 addresses resolution, gethostbyname and getaddrinfo both use hosts or ipnodes file).
    Linux on the other hand does not use ipnodes in nsswitch.conf

    Solution

    Add a line in the /etc/nsswitch.conf file similar to the following:
       
    ipnodes: files [NOTFOUND=continue] dns

    The line starting with keyword “hosts” must not be deleted.

    This way calls made by getaddrinfo() will search first in /etc/hosts then, if the name is not found, will contact the dns server.
    Thus there will be no connection delay for any lookup of host names existing in the local /etc/hosts file.

     

  2. admin says

    Sqlnet connection via TCP Hangs for about 10seconds Before Connection is Established

    Applies to:

    Oracle Server – Enterprise Edition – Version: 10.1.0.2 to 11.2.0.1.0 – Release: 10.1 to 11.2
    Information in this document applies to any platform.

    Symptoms

    10 nodes RAC cluster, sqlnet connection via TCP consistently takes 10 seconds before the connection with the server is finally established.

    $date;sqlplus apps/xxx@dwnodea @time.sql
    Tue Feb  2 17:00:39 CST 2010

    SQL*Plus: Release 11.1.0.7.0 – Production on Tue Feb  2 17:00:39 2010

    Copyright (c) 1982, 2008, Oracle. All rights reserved.

    Connected to:
    Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 – 64bit Production
    With the Partitioning, Real Application Clusters, OLAP and Real Application Testing options

    TO_CHAR(SYSDATE,’YYYY-MM-DDHH24:MI:SS’)
    ————————————————————————————————————————
    2010-02-02 17:00:49

    Sqlnet tracing (both server and client) against sqlplus connection with level 16 shows:

    Server side trace:
    2010-02-02 17:00:39.432277 : nttcnp:exit
    2010-02-02 17:00:39.432308 : snlinGetAddrInfo:entry

    *** 2010-02-02 17:00:49.440
    2010-02-02 17:00:49.440655 : snlinGetAddrInfo:getaddrinfo() failed with error -2
    2010-02-02 17:00:49.440736 : snlinGetAddrInfo:exit
    2010-02-02 17:00:49.440751 : nttcon:exit
    2010-02-02 17:00:49.440767 : nsopen:transport is open
    2010-02-02 17:00:49.440785 : nsoptions:entry
    2010-02-02 17:00:49.440802 : nsoptions:lcl[0]=0x0, lcl[1]=0x0, gbl[0]=0x0, gbl[1]=0x0, cha=0x0

    Client side trace:
    2010-02-02 17:00:39.418598 : nscon:recving a packet
    2010-02-02 17:00:39.418714 : nsprecv:entry
    2010-02-02 17:00:39.418833 : nsprecv:reading from transport…
    2010-02-02 17:00:39.418949 : nttrd:entry
    2010-02-02 17:00:49.444643 : nttrd:socket 12 had bytes read=8

    2010-02-02 17:00:49.444832 : nttrd:exit

    Changes

    This is a new installation.

    Cause

    This issue is caused by invalid nameserver entry in /etc/resolv.conf.

    Examine system configuration files show:

    resolv.conf
    ——–
    nameserver 192.168.1.20
    nameserver 192.168.4.12

    nsswitch.conf
    ———-
    #hosts: db files ldap nis dns
    hosts: files dns

    These are incorrect nameserver per customer configuration, they do not exist. Once they are removed, there is no longer any delay in sqlnet connection.

    Solution

    1. Remove incorrect setting for nameserver in /etc/resolv.conf if DNS is not in use.
    or
    2. Consult with the network administrator to provide correct nameserver IP address if DNS is in use.

  3. admin says

    =========================================================================
    Interim Patch for Base Bug: 9593134
    =========================================================================

    Date: Tue Aug 10 18:18:48 2010
    ————————————————————————-
    Platform Patch for : Linux-x86
    Product Patched : SQLNET
    Product Version # : 11.1.0.7.0
    RAC Rolling Installable : YES

    Bugs Fixed by this patch:
    ————————-
    9593134: DNS OR NIS MIS-CONFIGURATION CAN CAUSE SLOW DATABASE CONNECTS

    Hdr: 8307164 11.1.0.7 NET 11.1.0.7 PRODID-115 PORTID-197
    Abstract: TNSPING 11G USING DNS AND NOT HOSTS FILE

    *** 03/04/09 05:24 am ***
    TAR:
    —-

    Problem:
    ——–
    Customer has 2 oracle homes on the same HP-UX server,
    one 10g and one 11.1.0.7.
    He execute tnsping in both versions, and truss them truss -o out
    tnsping

    In 10g truss shows that the hostname from the connect descriptor is resolved
    using /etc/hosts file but in 11g the same hostname is resolved using dns.
    The same behavior is observed for sqlplus.
    Because of this connections to the database ar slow because of the slow
    hostname resolution using the dns.

    Versions:
    ———

    Diagnostic Analysis:
    ——————–
    nsswitch.conf
    ==============
    hosts: files [NOTFOUND=continue] dns

    Reproducibility:
    —————-
    on customer site, also I observered same behavior on
    internal HP-UX server with 11g database

    Test Case:
    ———-

    Workaround:
    ———–
    none

  4. Hdr: 5047830 9.2.0.6 NET 9.2.0.6 PRODID-115 PORTID-59 2728394
    Abstract: TNSPING HANGS AND CONSUMES MEMORY UUNTIL THE SERVER LOCKS UP

    Problem:
    ——–
    Problem Description
    ~~~~~~~~~~~~~~~~~~~~
    Firstly, the problem with tnsping hanging when locating a bad tnsnames
    alias was reported in Bug2728394 “TNSPING HANGS AND DOES NOT RETURN ERROR
    WHEN SERVICE NAME IS WRONG”
    This was shown for all 9.2.0.x versions.

    HOWEVER, the customer has responded saying that the “workaround” of making
    sure the tnsnames.ora aliases are not corrupt is unacceptable for the
    following severe reason:

    If a tnsping is inadvertantly issued on a production server against a “bad”

    tnsnames.ora alias, the tnsping command hangs, BUT memory resources leak
    and eventually, after a few minutes (depending on the resources available)

    the OS runs out of memory and causes a production outage.
    The solution is to kill any hanging tnsping attempt but this could happen
    without the DBA’s knowledge until the situation is severe.

    Versions:
    ———
    SQL*Net / Net Services 9.2.0.6 and 9.2.0.7
    Tested against HP-UX 11 (customer’s platform) = 9.2.0.6
    Tested against Solaris 2.9 = 9.2.0.7.
    Same results each time.

    Diagnostic Analysis:
    ——————–
    1. Manually create a “bad” tnsnames.ora alias.
    2. Run tnsping against the alias.
    3. Run ‘top’ or some other tool to check memory leak.
    4. Kill the hung process

    Reproducibility:
    —————-
    Everytime in-house and customer’s site.

    Test Case:
    ———-
    Testcase uploaded as testcase.txt

    Workaround:
    ———–
    Kill the hanging tnsping process.
    However, not easy if there is no knowledge of the tnsping that was run.

  5. Lunar says

    感谢ML,学习了,真长知识

Comment

*

沪ICP备14014813号-2

沪公网安备 31010802001379号