当前位置 - 股票行情交易網 - 股票交易 - 如何診斷節點重啟問題 Oracle官方博客

如何診斷節點重啟問題 Oracle官方博客

如何分析這種問題了?先看系統日誌,像他這個是HP-UX,那麽系統日誌為/var/log/syslog/syslog.log,AIX是errpt

在系統日誌中,我看到:

Nov 11 18:43:57 rx8640c syslog: Oracle CSS family monitor shutting down. 3

Nov 11 18:43:59 rx8640c su: + tty root-oracle

Nov 11 18:43:59 rx8640c syslog: Cluster Ready Services completed waiting on dependencies.

在對比ALERT日誌,發現系統基本是在這個時候重啟的

Wed Nov 11 18:43:28 2009

Trace dumping is performing id=[cdmp_20091111184328]

Wed Nov 11 18:57:17 2009

Starting ORACLE instance (normal)

LICENSE_MAX_SESSION = 0

LICENSE_SESSIONS_WARNING = 0

如果是AIX系統,可以用last shutdown看看,HP我不知道是不是這個

這裏,在syslog.log中可以看到,CSS進程shutdown(這個意思是偶猜的),CSS關閉或異常,會自動重啟主機,符合現在的情況

接下來就是分析ORA_CRS_HOME中的ocssd日誌了

[ CSSD]2009-11-11 18:39:18.460 [13] >WARNING: clssgmAssignMemberNo(): grock(#CSS_CLSSOMON) memberNo(1) already assigned

[ CSSD]2009-11-11 18:39:34.313 [14] >WARNING: clssnmPollingThread: node rx8640c (1) at 50% heartbeat fatal, eviction in 14.807 se

conds

[ CSSD]2009-11-11 18:39:35.313 [14] >WARNING: clssnmPollingThread: node rx8640c (1) at 50% heartbeat fatal, eviction in 13.807 se

conds

[ CSSD]2009-11-11 18:39:42.313 [14] >WARNING: clssnmPollingThread: node rx8640c (1) at 75% heartbeat fatal, eviction in 6.807 sec

onds

[ CSSD]2009-11-11 18:39:45.313 [14] >TRACE: clssnmPollingThread: node rx8640c (1) is impending reconfig

[ CSSD]2009-11-11 18:39:45.314 [14] >TRACE: clssnmPollingThread: diskTimeout set to (27000)ms impending reconfig status(1)

[ CSSD]2009-11-11 18:39:46.313 [14] >TRACE: clssnmPollingThread: node rx8640c (1) is impending reconfig

[ CSSD]2009-11-11 18:39:46.314 [14] >WARNING: clssnmPollingThread: node rx8640c (1) at 90% heartbeat fatal, eviction in 2.807 sec

onds

[ CSSD]2009-11-11 18:39:47.313 [14] >TRACE: clssnmPollingThread: node rx8640c (1) is impending reconfig

[ CSSD]2009-11-11 18:39:47.314 [14] >WARNING: clssnmPollingThread: node rx8640c (1) at 90% heartbeat fatal, eviction in 1.807 sec

onds

[ CSSD]2009-11-11 18:39:48.313 [14] >TRACE: clssnmPollingThread: node rx8640c (1) is impending reconfig

[ CSSD]2009-11-11 18:39:48.314 [14] >WARNING: clssnmPollingThread: node rx8640c (1) at 90% heartbeat fatal, eviction in 0.807 sec

onds

[ CSSD]2009-11-11 18:39:49.133 [14] >TRACE: clssnmPollingThread: node rx8640c (1) is impending reconfig

[ CSSD]2009-11-11 18:39:49.134 [14] >TRACE: clssnmPollingThread: Eviction started for node rx8640c (1), flags 0x000f, state 3,

這個日誌信息很明顯了,私有網絡心跳丟失,節點被驅除

至於為什麽私有網絡出現問題,心跳丟失,我想這個不是DBA能處理的了,寫個報告丟給管網絡的去看吧

另外提下,可能造成節點重啟的進程有3個,OCSSD,OPROCD,OCLSOMON

壹般的,OCSSD的原因就是心跳丟失(網絡心跳或者投票磁盤出現問題)和CSS進程請求不到CPU資源和BUG;OPROCD,OCLSOMON的原因是進程請求不到CPU資源和BUG

他這裏在節點重啟前,還順便報了個600錯誤

Wed Nov 11 18:43:27 2009

Errors in file /oracle/app/oracle/admin/ora10g/udump/ora10g1_ora_24884.trc:

ORA-00600: internal error code, arguments: [keltnfy-ldmInit], [46], [1], [], [], [], [], []

確認是個Bug 5486074

ORA-600 [keltnfy-ldminit] can occur in the Server Generated Alert

subsystem when it cannot determine the Host Name or

Network Address. This can be caused by DNS server being unaavilable.

查了下,沒說這個錯誤會導致CSS死亡,主機重啟的,而該錯誤應該是客戶端報出來的。。。

至少說可以確認網絡出現過問題

啟動的時候,報錯

Wed Nov 11 18:58:06 2009

Errors in file /oracle/app/oracle/admin/ora10g/udump/ora10g1_ora_7203.trc:

ORA-00600: internal error code, arguments: [ksprlspeeq3], [65536], [], [], [], [], [], []

Wed Nov 11 18:58:07 2009

Errors in file /oracle/app/oracle/admin/ora10g/udump/ora10g1_ora_7203.trc:

ORA-07445: exception encountered: core dump [kgscDump()+801] [SIGSEGV] [Address not mapped to object] [0x000001004] [] []

ORA-00600: internal error code, arguments: [ksprlspeeq3], [65536], [], [], [], [], [], []

Wed Nov 11 18:58:08 2009

Errors in file /oracle/app/oracle/admin/ora10g/udump/ora10g1_ora_7203.trc:

ORA-07445: exception encountered: core dump [kgscDump()+801] [SIGSEGV] [Address not mapped to object] [0x000001004] [] []

ORA-07445: exception encountered: core dump [kgscDump()+801] [SIGSEGV] [Address not mapped to object] [0x000001004] [] []

ORA-00600: internal error code, arguments: [ksprlspeeq3], [65536], [], [], [], [], [], []

ORA-07445[kgscDump]對應有個Bug 5508574 - OERI[504] / OERI[99999] / Dump [kgscdump] with > 31 CPUs,可是系統只有15C,30核。

ORA-00600[ksprlspeeq3]這個沒找到10203相關的BUG,先也懶的管了

推薦壹個METALINK的note:4.1,這個就是以前的knowledge,裏面有很多歸類的文章,和壹些工具的列表