Archive for July, 2009

TOD – Top 5 reasons the cluster will not start on a node.

I have mentioned before the top two, but here are the top 5 most common reasons that the CRS, CSS, EVM, etc, processes will not start on a node:

1. Network issue – NIC card might be unavailable, VIP assigned to a node on a different system, switch may be bad, etc.
2. Storage issue – verify that all the voting and OCR disks/luns are accessible on all nodes and are owned appropriately (OCR – root:dba & 640, Voting Disks – oracle:dba * 640). You can test by dumping with dd and piping to strings – dd if=/dev/raw/raw2 | strings .
3. Too many sockets – look in /tmp/.oracle and /var/tmp/.oracle and remove any and all socket files. These can build up between reboots after many failed attempts to start the cluster.
4. Filesystem full – if the filesystem containing CRS_HOME is full, or becomes full during the start of a cluster, it will just hang for the 600 second timeout and die. If it can’t write a log, it won’t start up.
5. /etc/oracle missing or corrupt. The /etc/oracle directory is created when root.sh is run for CRS. it contains the OCR location and other files needed by the cluster. If CRS cannot find the location of the OCR files, it will look in default locations and then simply hang while searching for the default 600 seconds.

As always, search out the logs in $CRS_HOME/log/. Expecially the alert.log and the files in crsd, cssd, and client will, while sometimes difficult to decipher, help you find the reason for the cluster problems.

Keep on RAC’n!

, , ,

No Comments

New format for Grumpy-DBA.com

I am experimenting with a new format on my blog site. I had received some comments that with the three column format, long text boxes were being cut off. I started using this theme with only one side column to allow for a wider post format.

No Comments

TOD – ASM diskgroup space from any database

Have you ever noticed that v$asm views are available in all databases? This is true whether using ASM or not because the ASM home and software is the same as the database software. On this note, while the DB and ASM can share a home, they should not in order to preserve the ability to upgrade ASM separately from the database home. Simply, ASM should always be at or above the highest database software version, but that is another post.

Ok, now that I am back from that tangent. I have had users call me in a panic when they run ASM scripts in a database instead of the ASM instance. When querying v$asm_disk and checking on the physical disk/luns, it will always report 100% used:

  SELECT
      NVL(a.name, '[CANDIDATE]')                       disk_group_name
    , b.path                                           disk_file_path
    , b.total_mb                                       total_mb
    , (b.total_mb - b.free_mb)                         used_mb
    , ROUND((1- (b.free_mb / b.total_mb))*100, 2)      pct_used
  FROM
      v$asm_diskgroup a RIGHT OUTER JOIN v$asm_disk b USING (group_number)
  ORDER BY
     a.name;
Disk Group Name Path              File Size (MB) Used Size (MB) Pct. Used
--------------- ----------------- -------------- -------------- ---------
DGROUP1         ORCL:VOL001               33,791         33,791    100.00
                ORCL:VOL002               33,791         33,791    100.00
                ORCL:VOL003               33,791         33,791    100.00
                ORCL:VOL008               33,791         33,791    100.00
                ORCL:VOL005               33,791         33,791    100.00
                ORCL:VOL006               33,791         33,791    100.00
                ORCL:VOL007               33,791         33,791    100.00
                ORCL:VOL004               33,791         33,791    100.00
***************                   -------------- --------------
                                         270,328        270,328
 
                                  -------------- --------------
Grand Total:                             270,328        270,328

Which can induce great panic as it appears that ASM is full, but when properly run from ASM:

Disk Group Name Path              File Size (MB) Used Size (MB) Pct. Used
--------------- ----------------- -------------- -------------- ---------
DGROUP1         ORCL:VOL001               33,791         24,925     73.76
                ORCL:VOL002               33,791         24,911     73.72
                ORCL:VOL003               33,791         24,910     73.72
                ORCL:VOL008               33,791         24,921     73.75
                ORCL:VOL005               33,791         24,921     73.75
                ORCL:VOL006               33,791         24,905     73.70
                ORCL:VOL007               33,791         24,932     73.78
                ORCL:VOL004               33,791         24,919     73.74
***************                   -------------- --------------
                                         270,328        199,344
 
                                  -------------- --------------
Grand Total:                             270,328        199,344

Some users, including my own developers, need to know how much space is available on the diskgroup from within their own database. It was thought that we had to query asm via SQL*net or have ASM populate a file to be brought into the database to determine how much space was available. This idea has several problems, it is clumsy and complicated, and it is not up to date and must be refreshed often. I noticed, however, what while you cannot query v$asm_disk and get accurate results, you can query v$asm_diskgroup and get accurate results from any database.

  SELECT
      name                                     group_name
    , type                                     type
    , total_mb                                 total_mb
    , (total_mb - free_mb)                     used_mb
    , ROUND((1- (free_mb / total_mb))*100, 2)  pct_used
  FROM
      v$asm_diskgroup
  ORDER BY
     name;
 
Disk Group
Name             Type   Total Size (MB) Used Size (MB) Pct. Used
---------------- ------ --------------- -------------- ---------
DGROUP1          EXTERN         270,328        199,344     73.74
                        --------------- --------------
Grand Total:                    270,328        199,344

This query will report the same data from both the ASM instance as well as any databases on the same server. The interesting thing is, even if the database is NOT using ASM, it will report as above since every (10g or 11g) database on a host communicates with ASM via its own “mini-cluster” (ocssd) if not RAC and the RAC cluster if it is.

No Comments

TOD – Recovering a database past the current controlfile

Unless you have a great many database backups to manage, most DBAs using RMAN use the control file as a catalog instead of a catalog database. This simplifies management, but can make some types of recovery difficult. In a typical non-catalog RMAN recovery, the control file is restored and the instance is started for an RMAN restore and recovery. This method will provide a complete recovery to the point that the control file was backed up.

What happens, however, if that control file is lost? You can recover to the previous day’s backup (if you have it), but subsequent backups of the archive logs and then a full backup will be stored in a new control file backup. This is exactly what happened at a client of mine a few weeks ago. The backup script failed to back up the final archive log after the full backup. Due to Murphy’s Law, that was the exact backup that was needed to restore after a storage issue. Without the final archive log, the database threw the classic error:

ORA-01152: file 1 was not restored from a sufficiently old backup
ORA-01110: data file 1: ‘+DGROUP1/dmlive/datafile/system.294.689876999′

I had no previous copy of the control file except that of the night before, I did, however, have all of the archive logs in an RMAN backup set. But how to tell the control file that they exist? With a catalog database, all the backupsets already recorded, but the control file method is only as good as its last writing by RMAN. There is a method, however.

Using the previous nights control file, use RMAN to restore and recover all database and archivelogs. Do NOT try to open the database. Find the filename of the backupset that contains the archivelogs that are needed, and run the following command withing RMAN:

catalog backuppiece 
       '/backups/DMLIVE_9497_1_689621187_2009_Mon_full';

Once this is complete, restart the recovery using the same control file to roll forward through all the newly cataloged archive logs. Once complete, open the database with the resetlogs option and you are all set.

Happy recoveries!

No Comments