Thursday, May 21, 2015

VI “Cheat” Sheet

VI “Cheat” Sheet

File management commands:
:w name Write edit buffer to file name
:wq Write to file and quit
:q! Quit without saving changes
ZZ Same as :wq
:sh Execute shell commands (<ctrl>d)

Window motions:
<ctrl>d Scroll down (half a screen)
<ctrl>u Scroll up (half a screen)
<ctrl>f Page forward
<ctrl>b Page backward
/string Search forward
?string Search backward
<ctrl>l Redraw screen
<ctrl>g Display current line number and file information
n Repeat search
N Repeat search reverse
G Go to last line
nG Go to line n
:n Go to line n
z<CR> Reposition window: cursor at top
z. Reposition window: cursor in middle
z- Reposition window: cursor at bottom

Cursor motions:
H Upper left corner (home)
M Middle line
L Lower left corner
h Back a character
j Down a line
k Up a line
^ Beginning of line
$ End of line
l Forward a character
w One word forward
b Back one word
fc Find c
; Repeat find (find next c)
Input commands (end with Esc):
a Append after cursor
i Insert before cursor
o Open line below
O Open line above
:r file Insert file after current line

Change commands (Input mode):
cw Change word (Esc)
cc Change line (Esc) - blanks line
c$ Change to end of line
rc Replace character with c
R Replace (Esc) - typeover
s Substitute (Esc) - 1 char with string
S Substitute (Esc) - Rest of line with text
. Repeat last change

Changes during insert mode:
<ctrl>h Back one character
<ctrl>w Back one word
<ctrl>u Back to beginning of insert

Deletion commands:
dd or ndd Delete n lines to general buffer
dw Delete word to general buffer
dnw Delete n words
d) Delete to end of sentence
db Delete previous word
D Delete to end of line
x Delete character

Recovering deletions:
p Put general buffer after cursor
P Put general buffer before cursor
Undo commands:
u Undo last change
U Undo all changes on line

Regular expressions (search strings):
^ Matches beginning of line
$ Matches end of line
. Matches any single character
* Matches any previous character
.* Matches any character

Tuesday, May 19, 2015

Configure RMAN

How to configure Oracle RMAN backup for the first time

Step # 1: Connect to Target database(Target DB: The database on which Backup & Recovery to be performed) as sysdba.

[oracle@centos ~]$ sqlplus "/ as sysdba"

SQL*Plus: Release 11.2.0.1.0 Production on Fri Jan 3 11:28:24 2014
Copyright (c) 1982, 2009, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL>

Step # 2: Ensure the database has been configured with ARCHIVELOG mode or not?

SQL> select log_mode from v$database;

LOG_MODE
------------
NOARCHIVELOG

Database is in NOARCHIVELOG mode.

Step # 3: If the database has been configured with ARCHIVELOG mode then skip the Step number 3 to 6, If not then Shutdown the database.

SQL> shutdown immediate;

Database closed.
Database dismounted.
ORACLE instance shut down.

Step # 4: Startup the database in mount state.

SQL> startup mount;

ORACLE instance started.
Total System Global Area 308981760 bytes
Fixed Size 2212896 bytes
Variable Size 163580896 bytes
Database Buffers 138412032 bytes
Redo Buffers 4775936 bytes
Database mounted.

Step # 5: Configure database in ARCHIVELOG mode.

SQL> alter database archivelog;

Database altered.

Step # 6: Alter database to open state.

SQL> alter database open;

Database altered.
SQL> select open_mode from v$database;

OPEN_MODE
--------------------
READ WRITE

Step # 7: Ensure ARCHIVELOG destination.

SQL> archive log list

Database log mode Archive Mode
Automatic archival Enabled
Archive destination USE_DB_RECOVERY_FILE_DEST
Oldest online log sequence 2
Next log sequence to archive 4
Current log sequence 4

In case you wish to change default archive log destination then issue the following command.

SQL> alter system set log_archive_dest_1='location=/home/oracle/arch' scope=both;

System altered.
SQL> archive log list
Database log mode Archive Mode
Automatic archival Enabled
Archive destination /home/oracle/arch
Oldest online log sequence 2
Next log sequence to archive 4
Current log sequence 4

Step # 8: Ensure the flash/fast recovery area location.

SQL> show parameter db_recovery_file_dest

Step # 9: Connect to RMAN prompt with target database.

[oracle@centos ~]$ rman target /

Recovery Manager: Release 11.2.0.1.0 - Production on Fri Jan 3 11:46:22 2014
Copyright (c) 1982, 2009, Oracle and/or its affiliates. All rights reserved.
connected to target database: ORCL (DBID=1363580714)

RMAN>

Step # 10: Configure RMAN with controlfile auto-backup feature that will be auto-backup controlfile in case of major changes done in database.

RMAN> configure controlfile autobackup on;

Step # 11: To enable backup optimization run the following command, by default backup optimization has been configured OFF.

RMAN> configure backup optimization on;

Step # 12: Configure retention policy for backup.

RMAN> CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 7 DAYS;

Step # 13: Connect to the recovery catalog database(RMAN Repository) & Create a tablespace to store RMAN catalog database objects.

[oracle@centos ~]$ sqlplus "/ as sysdba"

SQL> select global_name from global_name;

GLOBAL_NAME
--------------------------------------------------------------------------------
CATALOGD

SQL> create tablespace catalogtbs datafile '/home/oracle/dbfile/catalogtbs1.dbf' size 100M autoextend on maxsize unlimited;

Tablespace created.

Step # 14: Create a RMAN user, assign RMAN tablespace to RMAN user as a default & grant recovery catalog owner,connect & resource privileges to RMAN user.

SQL> create user recoveryman identified by recoveryman;

User created.

SQL> alter user recoveryman default tablespace catalogtbs temporary tablespace temp;

User altered.

SQL> grant recovery_catalog_owner to recoveryman;

Grant succeeded.

SQL> grant connect,resource to recoveryman;

Grant succeeded.

Step # 15: Connect to RMAN on target and recovery catalog database.

[oracle@oracle ~]$ rman target / catalog recoveryman/recoveryman@catalogdb

Recovery Manager: Release 11.2.0.1.0 - Production on Sat Jan 4 14:30:28 2014
Copyright (c) 1982, 2009, Oracle and/or its affiliates. All rights reserved.
connected to target database: ORCL (DBID=1363580714)
connected to recovery catalog database

Step # 16: create catalog by issuing the following command in RMAN prompt.

RMAN> create catalog;

recovery catalog created

Step # 17: After creating catalog, Ensure RMAN repository tables by logging into repository database as RMAN user. This is only for the first time.

[oracle@oracle ~]$ sqlplus "recoveryman/recoveryman@catalogdb"

SQL> show user;

USER is "RECOVERYMAN"

SQL> select table_name from user_tables;

Step # 18: Register database with recovery catalog.

RMAN> register database;

database registered in recovery catalog

starting full resync of recovery catalog

full resync complete

Step # 19: Check whether registration was successful.

RMAN> report schema;

Report of database schema for database with db_unique_name ORCL

List of Permanent Datafiles

===========================

File Size(MB) Tablespace RB segs Datafile Name

---- -------- -------------------- ------- ------------------------

1 670 SYSTEM YES /home/oracle/app/oracle/oradata/orcl/system01.dbf

2 490 SYSAUX NO /home/oracle/app/oracle/oradata/orcl/sysaux01.dbf

3 30 UNDOTBS1 YES /home/oracle/app/oracle/oradata/orcl/undotbs01.dbf

4 5 USERS NO /home/oracle/app/oracle/oradata/orcl/users01.dbf

List of Temporary Files

=======================

File Size(MB) Tablespace Maxsize(MB) Tempfile Name

---- -------- ----------- --------- -------------------------------

1 20 TEMP 32767 /home/oracle/app/oracle/oradata/orcl/temp01.dbf

RMAN> LIST INCARNATION OF DATABASE;

List of Database Incarnations

DB Key Inc Key DB Name DB ID STATUS Reset SCN Reset Time

------- ------- -------- ---------------- ------------ --- --------

89 102 ORCL 1363580714 PARENT 1 15-AUG-09

89 90 ORCL 1363580714 CURRENT 945184 02-JAN-14

Target database is registered with the RMAN.

Now you can backup your target(registered) database as per your convenience.

Monday, May 18, 2015

Undo Space Explained

Oracle Database Undo space explained

Undo something

The Undo tablespace is used for several features: ROLLBACK, READ CONSISTENCY and FLASHBACK technology.

Rollback

Rollback is easy to understand, if you are not happy with some data modifications, you want to ‘undo’ it: Rollback.
The original (non modified) information within a transaction is stored in a separate Undo tablespace, because the database is designed for COMMIT to be fast, not rolling back.

Read Consistency

Another mechanism Undo information is used for is Read Consistency, which means if you run a query at 9:00 for 10 minutes, you want all the data to be from 9:00. You don’t want it to read data that has been modified at 9:02 and 9:06 or data that hasn’t been committed yet.
So, to support Read Consistency, Oracle must keep the original data (committed or not) for these 10 minutes until the query is finished.
The problem is, you actually don’t know how long the query will run for, so the general rule is to set this ‘keep-old-data-period’ to the longest running query. This is because you also want your longest running query to read consistent data.
This ‘keep-old-data-period’ is called ‘UNDO_RETENTION’ and defaults to 900 seconds, which means the database tries to keep all old changed information for 900 seconds.

Flashback

Some Oracle features are build based upon using Undo information, meaning undo is more utilized.
Because ‘old’ data is stored for a certain time (UNDO_RETENTION), one can access this information to have look at data back in time by using FLASHBACK features: ‘How did the contents of this table looked like ten minutes ago?’. This information can be used for recovery from user-errors.
Flashback features using Undo are:

Flashback Query (based on time)
Flashback Versions Query (based on SCN)
Flashback Transaction Query (based on period)
Flashback Table (based on time)

Flashback Drop and Flashback Database do not use Undo information. Flashback Drop is using ‘not yet recycled segment and extents’ and Flashback Database is a separate mechanism using the Flash/Fast Recovery Area, by taking ‘snapshots’ and redo information.

Undo Lifetime

Undo information has different states during it’s lifecycle, depending on running transactions and retention settings.
There are three states or types of extents in the Undo tablespace: ACTIVE, EXPIRED and UNEXPIRED. Oracle is still using Rollback segments, but with Automatic Undo Management these are completely controlled by Oracle.

ACTIVE

Active undo extents are used by transactions and will always be active, because they are needed for Rollback. The UNDO_RETENTION setting is not used here, because one can not say something like: ‘after 900 seconds you are not allowed to rollback anymore…’
You will get ‘ORA-30036 unable to extend segment in Undo tablespace‘ errors when no more space is left to store ACTIVE Undo. This will automatically rollback the transaction causing it. The NOSPACEERRCNT column in V$UNDOSTAT is a good indication how many times this has occurred.

EXPIRED

Expired extents are not used by transactions, the data in these extends is committed and the UNDO_RETENTION time has passed, so it is not needed for Read Consistency.

UNEXPIRED

Unexpired extents are non-active extents that still honour UNDO_RETENTION. The transactions belonging to these undo extents are committed, but the retention time has not passed: You still want/need these for Read Consistency!
When the Undo mechanism requires more extents for ACTIVE extents, it is allowed to steal UNEXPIRED extents when there are no EXPIRED extents left for reuse and it can not allocate more free extents (autoextend maxsize reached or fixed tablespace size). One can check the steal-count in UNXPSTEALCNT in V$UNDOSTAT.
You will get ‘ORA-01555 snapshot too old‘ errors if no Read Consistency information for a query is available. The SSOLDERRCNT in V$UNDOSTAT will show a count of these errors.

Summary

Active undo is used by active transactions: rollback and read consistency.
Expired undo is old and can be reused.
Unexpired undo is used for read consistency, retention time has not passed yet.
Unexpired undo can be stolen for Active undo. If this is happening you can get ORA-01555 before Undo retention has passed.
Unexpired undo can be secured by setting the RETENTION GUARANTEE option when creating the Undo tablespace (see Undo Sizing).

Undo extent status examples

With the next query you go through the contents of the Undo tablespace and sum the extent types:

select status,
  round(sum_bytes / (1024*1024), 0) as MB,
  round((sum_bytes / undo_size) * 100, 0) as PERC
from
(
  select status, sum(bytes) sum_bytes
  from dba_undo_extents
  group by status
),
(
  select sum(a.bytes) undo_size
  from dba_tablespaces c
    join v$tablespace b on b.name = c.tablespace_name
    join v$datafile a on a.ts# = b.ts#
  where c.contents = 'UNDO'
    and c.status = 'ONLINE'
);

It will sum the three types of extents and shows the distribution of them within the Undo tablespace. ‘Free’ extents are not shown.

‘Normal’ operation

STATUS            MB       PERC
--------- ---------- ----------
ACTIVE            10          4
EXPIRED          110         43
UNEXPIRED         25         10

This is an example of ‘normal’ contents of the Undo tablespace. The system is using ACTIVE extents, some are UNEXPIRED used for read consistency and there are EXPIRED extents which can be reused.

Out of Free/EXPIRED extents

STATUS            MB       PERC
--------- ---------- ----------
ACTIVE           230         90
EXPIRED            0          0
UNEXPIRED         26         10

When the system is under load and the EXPIRED extents are near 0%, the total of ACTIVE and UNEXPIRED is near 100% and the Undo tablespace is not able to extend, Oracle will steal UNEXPIRED extents for ACTIVE extents. If this is the case you might expect ORA-01555 errors, because Undo retention can not be met.

Out of Undo space

STATUS            MB       PERC
--------- ---------- ----------
ACTIVE           255        100
EXPIRED            0          0
UNEXPIRED          1          0

When the system is under load and the ACTIVE extents are near 100%, the total of EXPIRED and UNEXPIRED is near 0% and the Undo Tablespace is not able to extend, Oracle is not able to allocate free extents or steal UNEXPIRED extents for ACTIVE extents. If this is the case you might expect ORA-30036 errors.

Retention to large or UNDO to small?

STATUS            MB       PERC
--------- ---------- ----------
ACTIVE             2          1
EXPIRED            0          0
UNEXPIRED        254         99

In this case, all undo extents are used for the retention period. It might be the retention is to large, or the UNDO tablespace is to small. A DBA must investigate this and take a decision!

Undo Sizing

Storing undo data for a certain amount of time will need space and based on the activity on the database system, it is written at a certain ‘rate’.
From this you can deduct an equation: RATE x RETENTION = SPACE. Some overhead must be added, but that varies between database versions used and data types stored.
If you look at the undo equation, the Undo tablespace size or the retention time can be fixed. A fixed rate can not be set, because it depends on database load.
Since Oracle 10g, the database will be more efficient if the same record is updated more than once in a transaction, it will re-use those ACTIVE extents.

Fixed Size

When the Undo tablespace size is fixed (datafile autoextend=NO), Oracle tunes the Retention Time for the amount of Undo data it is generating to fit into the Undo tablespace. The UNDO_RETENTION parameter will now be used as a minimum, but may automatically be tuned larger when enough space is available.
One can check the tuned Undo retention time in V$UNDOSTAT, using the TUNED_UNDORETENTION column.
In Oracle 9i, it seems Oracle is not actually tuning this, but is only trying to maintain the Undo retention time. Also the TUNED_UNDORETENTION column is absent in 9i.
When you choose the Undo tablespace to be fixed, you can use the Undo Advisor to estimate the needed sizing.

Fixed Size, out of UNEXPIRED extents? Check TUNED_UNDORETENTION!

STATUS            MB       PERC
--------- ---------- ----------
ACTIVE             2          1
EXPIRED            0          0
UNEXPIRED        254         99

Because Oracle is able to extend the retention time, more UNEXPIRED extents are created. In this case, if the Undo tablespace is full, check the TUNED_UNDORETENTION against UNDO_RETENTION. If the tuned retention is much larger, 99% full does not mean a problem!
Take a look at the following query, it will calculate the UNDO total with the following assumption: ACTIVE takes what is needs, EXPIRED ‘is empty’ and UNEXPIRED will be re-calculated against the division of UNDO_RETENTION/TUNED_UNDORETENTION.

BREAK ON REPORT
COMPUTE SUM OF MB ON REPORT
COMPUTE SUM OF PERC ON REPORT
COMPUTE SUM OF FULL ON REPORT

select status,
 round(sum_bytes / (1024*1024), 0) as MB,
 round((sum_bytes / undo_size) * 100, 0) as PERC,
 decode(status, 'UNEXPIRED', round((sum_bytes / undo_size * factor) * 100, 0),
                'EXPIRED',   0,
                             round((sum_bytes / undo_size) * 100, 0)) FULL
from
(
 select status, sum(bytes) sum_bytes
 from dba_undo_extents
 group by status
),
(
 select sum(a.bytes) undo_size
 from dba_tablespaces c
 join v$tablespace b on b.name = c.tablespace_name
 join v$datafile a on a.ts# = b.ts#
 where c.contents = 'UNDO'
 and c.status = 'ONLINE'
),
(
 select tuned_undoretention, u.value, u.value/tuned_undoretention factor
 from v$undostat us
 join (select max(end_time) end_time from v$undostat) usm
    on usm.end_time = us.end_time
 join (select name, value from v$parameter) u
    on u.name = 'undo_retention'
);

When running this query, the next result will show when UNDO_RETENTION = 900 and TUNED_UNDORETENTION is about 1800 seconds:

STATUS            MB       PERC       FULL
--------- ---------- ---------- ----------
ACTIVE             2          1          1
EXPIRED            0          0          0
UNEXPIRED        254         99         50
          ---------- ---------- ----------
sum              256        100         51

Unexpired at 99% is not really a problem here, because the tuned retention is twice as large as the desired retention!
Since 10gR2, a maximum retention is introduced. The longest period of tuned undo I have seen is 96 hours. Automatic tuning retention can also be turned off using the hidden ‘_undo_autotune=false’ parameter (don’t use until Oracle suggested this hidden parameter). See also My Oracle Support Note: Full UNDO Tablespace In 10gR2 [ID 413732.1].

Fixed/Auto Retention

If the Undo tablespace is configured with the autoextend option for the data files, Oracle sets the Retention Time to the time it takes for the longest-running query to run. This can result in a large Undo tablespace if there are un-tuned queries running on your system.
Again in 9i, even though it is called Automatic Undo Management, UNDO_RETENTION parameter seems always ‘fixed’, but it does mean you don’t have to bother about Rollback Segments.

Shrink Undo tablespace

The Undo tablespace can only grow larger, but it can not shrink by itself. If you want to shrink the Undo tablespace, create a new one and set the UNDO_TABLESPACE parameter to the new Undo tablespace.

Retention Guaranteed

When you create the Undo tablespace with the RETENTION GUARANTEE option, UNEXPIRED Undo information will never get stolen. Set this if you want to guarantee Read Consistency or when you want to use Flashback with a guaranteed point-in-time!
Beware that when this is set, the chance of ORA-30036 errors increases. It’s your choice: ORA-30036 or ORA-01555…

Setting the UNDO_RETENTION parameter to the longest running query

A good practice is to set the UNDO_RETENTION parameter to the longest running query, to avoid ORA-01555 (read consistency) errors. To get a good indication about the longest running query in the last 7 days, try:

select max(maxquerylen) from v$undostat;

One can also try V$SESSION_LONGOPS and V$TRANSACTION.
If you want to increase your Flashback period, take the largest of these two.

How much Undo will this generate?

Again take a look at V$UNDOSTAT and the UNDOBLKS column in particular.
Multiply these UNDOBLKS (per 10 minutes by default) times your BLOCKSIZE times the MAXQUERYLEN.
For a worst case scenario size you can calculate much undo would have been generated when you multiply the highest rate with the longest query:

select
  round(max(undoblks/600)*8192*max(maxquerylen)/(1024*1024)) as "UNDO in MB"
from v$undostat;

But, it could be your longest running query will not run when the most undo is generated…

Undo Advisor

The Undo Advisor can be found in the Oracle Enterprise Manager or by using the DBMS_ADVISOR package.

Undo Advisor

When opening the Undo advisor, it will show the current retention time and tablespace settings, but also shows analysis results, potential problems and recommendations.
These results are based on a 7 days period, analysing longest query or Flashback duration and Undo generation rates. This can be made visible thought the Undo graph.

Undo Retention Graph

New: When unchanged, this shows the current retention time setting. You can change the UNDO_RETENTION to a new value by selecting a dot on the line in the graph.
Auto-tuned Undo retention: This is the retention time Oracle can currently hold. It will use the UNDO_RETENTION as a minimum.
Best Possible Undo retention: With the current size (fixed) or maxsize (autoextend) of the Undo tablespace, this is the retention time it possibly could hold.
Oracle uses the statistics in the V$UNDOSTAT view to tune the Undo mechanism. A DBA can also use this view, together with V$ROLLSTAT to get a good indication of current workload. The DBA_HIST_UNDOSTAT view contains statistical snapshots of V$UNDOSTAT information.
All this information is based on the selected Analysis Time Period with the Undo generation rate within that period.

LOBs

Automatic Undo Management not supported for LOBs. Undo information for LOBs is not stored Undo tablespace, but in the segment itself. For LOBs, the database uses the UNDO_RETENTION as a minimum, but when space becomes a problem, the UNEXPIRED Undo information for the LOB may be reused.

Conclusion

Even though Automatic Undo Management is able to tune itself, one needs to keep an eye on transaction duration, retention time and space consumed.
With the addition of Flashback technology to the database, the Undo tablespace is now also used to recover from user errors. When Retention Guaranteed is used, more priority is given to support Read Consistency and Flashback operations, because Oracle will not steal UNEXPIRED extents.
Furthermore the Undo Advisor can be used to visualise retention time vs. space consumed, but in case of Undo related errors, a DBA still needs to analyse V$UNDOSTAT and related views to solve problems other than ‘just enlarge’ the Undo tablespace.
Information in this article is based on my own experience and derived from articles and documentation found on the internet.

ASM Rebalance Operation

When will my rebalance complete

This has to be one of the top ASM questions people ask me. But if you expect me to respond with a number of minutes, you will be disappointed. After all, ASM has given you an estimate, and you still want to know when exactly is that rebalance going to finish. Instead, I will show you how to check if the rebalance is actually progressing, what phase it is in, and if there is a reason for concern.

Understanding the rebalance

As explained in the rebalancing act, the rebalance operation has three phases - planning, extents relocation and compacting. As far as the overall time to complete is concerned, the planing phase time is insignificant so there is no need to worry about it. The extent relocation phase will take most of the time, so the main focus will be on that.I will also show what is going on during the compacting phase.

It is important to know why the rebalance is running. If you are adding a new disk, say to increase the available disk group space, it doesn't really matter how long it will take for the rebalance to complete. OK maybe it does, if your database is hung because you ran out of space in your archive log destination. Similarly if you are resizing or dropping disk(s), to adjust the disk group space, you are generally not concerned with the time it takes for the rebalance to complete.

But if a disk has failed and ASM has initiated rebalance, there may be legitimate reason for concern. If your disk group is normal redundancy AND if another disk fails AND it's the partner of that disk that has already failed, your disk group will be dismounted, all your databases that use that disk group will crash and you may lose data. In such cases I understand that you want to know when that rebalance will complete. Actually, you want to see the relocation phase completed, as once it does, all your data is fully redundant again.

Extents relocation

To have a closer look at the extents relocation phase, I drop one of the disks with the default rebalance power:

SQL> show parameter power

NAME TYPE VALUE
------------------------------------ ----------- ----------------
asm_power_limit integer 1

SQL> set time on
16:40:57 SQL> alter diskgroup DATA1 drop disk DATA1_CD_06_CELL06;

Diskgroup altered.

Initial estimated time to complete is 26 minutes:

16:41:21 SQL> select INST_ID, OPERATION, STATE, POWER, SOFAR, EST_WORK, EST_RATE, EST_MINUTES from GV$ASM_OPERATION where GROUP_NUMBER=1;

INST_ID OPERA STAT POWER SOFAR EST_WORK EST_RATE EST_MINUTES
---------- ----- ---- ---------- ---------- ---------- ---------- -----------
3 REBAL WAIT 1
2 REBAL RUN 1 516 53736 2012 26
4 REBAL WAIT 1

About 10 minutes into the rebalance, the estimate is 24 minutes:

16:50:25 SQL> /

INST_ID OPERA STAT POWER SOFAR EST_WORK EST_RATE EST_MINUTES
---------- ----- ---- ---------- ---------- ---------- ---------- -----------
3 REBAL WAIT 1
2 REBAL RUN 1 19235 72210 2124 24
4 REBAL WAIT 1

While that EST_MINUTES doesn't give me much confidence, I see that the SOFAR (number of allocation units moved so far) is going up, which is a good sign.

ASM alert log shows the time of the drop disk, the OS process ID of the ARB0 doing all the work, and most importantly - that there are no errors:

Wed Jul 11 16:41:15 2012
SQL> alter diskgroup DATA1 drop disk DATA1_CD_06_CELL06
NOTE: GroupBlock outside rolling migration privileged region
NOTE: requesting all-instance membership refresh for group=1
...
NOTE: starting rebalance of group 1/0x6ecaf3e6 (DATA1) at power 1
Starting background process ARB0
Wed Jul 11 16:41:24 2012
ARB0 started with pid=41, OS id=58591
NOTE: assigning ARB0 to group 1/0x6ecaf3e6 (DATA1) with 1 parallel I/O
NOTE: F1X0 copy 3 relocating from 0:2 to 55:35379 for diskgroup 1 (DATA1)
...

ARB0 trace file should show which file extents are being relocated. It does, and that is how I know that ARB0 is doing what it is supposed to do:

$ tail -f /u01/app/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_arb0_58591.trc
...
ARB0 relocating file +DATA1.282.788356359 (120 entries)
*** 2012-07-11 16:48:44.808
ARB0 relocating file +DATA1.283.788356383 (120 entries)
...
*** 2012-07-11 17:13:11.761
ARB0 relocating file +DATA1.316.788357201 (120 entries)
*** 2012-07-11 17:13:16.326
ARB0 relocating file +DATA1.316.788357201 (120 entries)
...

Note that there may be lot of arb0 trace files in the trace directory, so that's why we need to know the OS process ID of the ARB0 actually doing the rebalance. That information is in the alert log of the ASM instance performing the rebalance.

I can also look at the pstack of the ARB0 process to see what is going on. It does show me that ASM is relocating extents (key functions on the stack being kfgbRebalExecute - kfdaExecute - kffRelocate):

# pstack 58591
#0 0x0000003957ccb6ef in poll () from /lib64/libc.so.6
...
#9 0x0000000003d711e0 in kfk_reap_oss_async_io ()
#10 0x0000000003d70c17 in kfk_reap_ios_from_subsys ()
#11 0x0000000000aea50e in kfk_reap_ios ()
#12 0x0000000003d702ae in kfk_io1 ()
#13 0x0000000003d6fe54 in kfkRequest ()
#14 0x0000000003d76540 in kfk_transitIO ()
#15 0x0000000003cd482b in kffRelocateWait ()
#16 0x0000000003cfa190 in kffRelocate ()
#17 0x0000000003c7ba16 in kfdaExecute ()
#18 0x0000000003d4beaa in kfgbRebalExecute ()
#19 0x0000000003d39627 in kfgbDriver ()
#20 0x00000000020e8d23 in ksbabs ()
#21 0x0000000003d4faae in kfgbRun ()
#22 0x00000000020ed95d in ksbrdp ()
#23 0x0000000002322343 in opirip ()
#24 0x0000000001618571 in opidrv ()
#25 0x0000000001c13be7 in sou2o ()
#26 0x000000000083ceba in opimai_real ()
#27 0x0000000001c19b58 in ssthrdmain ()
#28 0x000000000083cda1 in main ()

After about 35 minutes the EST_MINUTES dropps to 0:

17:16:54 SQL> /

INST_ID OPERA STAT POWER SOFAR EST_WORK EST_RATE EST_MINUTES
---------- ----- ---- ---------- ---------- ---------- ---------- -----------
2 REBAL RUN 1 74581 75825 2129 0
3 REBAL WAIT 1
4 REBAL WAIT 1

And soon after that, the ASM alert log shows:

Disk emptied
Disk header erased
PST update completed successfully
Disk closed
Rebalance completed

Wed Jul 11 17:17:32 2012
NOTE: GroupBlock outside rolling migration privileged region
NOTE: requesting all-instance membership refresh for group=1
Wed Jul 11 17:17:41 2012
GMON updating for reconfiguration, group 1 at 20 for pid 38, osid 93832
NOTE: group 1 PST updated.
SUCCESS: grp 1 disk DATA1_CD_06_CELL06 emptied
NOTE: erasing header on grp 1 disk DATA1_CD_06_CELL06
NOTE: process _x000_+asm2 (93832) initiating offline of disk 0.3916039210 (DATA1_CD_06_CELL06) with mask 0x7e in group 1
NOTE: initiating PST update: grp = 1, dsk = 0/0xe96a042a, mask = 0x6a, op = clear
GMON updating disk modes for group 1 at 21 for pid 38, osid 93832
NOTE: PST update grp = 1 completed successfully
NOTE: initiating PST update: grp = 1, dsk = 0/0xe96a042a, mask = 0x7e, op = clear
GMON updating disk modes for group 1 at 22 for pid 38, osid 93832
NOTE: cache closing disk 0 of grp 1: DATA1_CD_06_CELL06
NOTE: PST update grp = 1 completed successfully
GMON updating for reconfiguration, group 1 at 23 for pid 38, osid 93832
NOTE: cache closing disk 0 of grp 1: (not open) DATA1_CD_06_CELL06
NOTE: group 1 PST updated.
Wed Jul 11 17:17:41 2012
NOTE: membership refresh pending for group 1/0x6ecaf3e6 (DATA1)
GMON querying group 1 at 24 for pid 19, osid 38421
GMON querying group 1 at 25 for pid 19, osid 38421
NOTE: Disk in mode 0x8 marked for de-assignment
SUCCESS: refreshed membership for 1/0x6ecaf3e6 (DATA1)
NOTE: stopping process ARB0
SUCCESS: rebalance completed for group 1/0x6ecaf3e6 (DATA1)
NOTE: Attempting voting file refresh on diskgroup DATA1

So the estimated time was 26 minutes and the rebalance actually took about 36 minutes (in this particular case the compacting took less than a minute so I have ignored it). That is why it is more important to understand what is going on, then to know when will the rebalance complete.

Note that the estimated time may also be increasing. If the system is under heavy load, the rebalance will take more time - especially with the rebalance power 1. For a large disk group (many TB) and large number of files, the rebalance can take hours and possibly days.

If you want to get an idea how long will a drop disk take in your environment, you need to test it. Just drop one of the disks, while your system is under normal/typical load. Your data is fully redundant during such disk drop, so you are not exposed to a disk group dismount in case its partner disk fails during the rebalance.

Compacting

In another example, to look at the compacting phase, I add the same disk back, with rebalance power 10:

17:26:48 SQL> alter diskgroup DATA1 add disk '/o/*/DATA1_CD_06_celll06' rebalance power 10;

Diskgroup altered.

Initial estimated time to complete is 6 minutes:

17:27:22 SQL> select INST_ID, OPERATION, STATE, POWER, SOFAR, EST_WORK, EST_RATE, EST_MINUTES from GV$ASM_OPERATION where GROUP_NUMBER=1;

INST_ID OPERA STAT POWER SOFAR EST_WORK EST_RATE EST_MINUTES
---------- ----- ---- ---------- ---------- ---------- ---------- -----------
2 REBAL RUN 10 489 53851 7920 6
3 REBAL WAIT 10
4 REBAL WAIT 10

After about 10 minutes, the EST_MINUTES drops to 0:

17:39:05 SQL> /

INST_ID OPERA STAT POWER SOFAR EST_WORK EST_RATE EST_MINUTES
---------- ----- ---- ---------- ---------- ---------- ---------- -----------
3 REBAL WAIT 10
2 REBAL RUN 10 92407 97874 8716 0
4 REBAL WAIT 10

And I see the following in the ASM alert log

Wed Jul 11 17:39:49 2012
NOTE: GroupBlock outside rolling migration privileged region
NOTE: requesting all-instance membership refresh for group=1
Wed Jul 11 17:39:58 2012
GMON updating for reconfiguration, group 1 at 31 for pid 43, osid 115117
NOTE: group 1 PST updated.
Wed Jul 11 17:39:58 2012
NOTE: membership refresh pending for group 1/0x6ecaf3e6 (DATA1)
GMON querying group 1 at 32 for pid 19, osid 38421
SUCCESS: refreshed membership for 1/0x6ecaf3e6 (DATA1)
NOTE: Attempting voting file refresh on diskgroup DATA1

That means ASM has completed the second phase of the rebalance and is compacting now. If that is true, the pstack should show kfdCompact() function. Indeed it does:

# pstack 103326
#0 0x0000003957ccb6ef in poll () from /lib64/libc.so.6
...
#9 0x0000000003d711e0 in kfk_reap_oss_async_io ()
#10 0x0000000003d70c17 in kfk_reap_ios_from_subsys ()
#11 0x0000000000aea50e in kfk_reap_ios ()
#12 0x0000000003d702ae in kfk_io1 ()
#13 0x0000000003d6fe54 in kfkRequest ()
#14 0x0000000003d76540 in kfk_transitIO ()
#15 0x0000000003cd482b in kffRelocateWait ()
#16 0x0000000003cfa190 in kffRelocate ()
#17 0x0000000003c7ba16 in kfdaExecute ()
#18 0x0000000003c4b737 in kfdCompact ()
#19 0x0000000003c4c6d0 in kfdExecute ()
#20 0x0000000003d4bf0e in kfgbRebalExecute ()
#21 0x0000000003d39627 in kfgbDriver ()
#22 0x00000000020e8d23 in ksbabs ()
#23 0x0000000003d4faae in kfgbRun ()
#24 0x00000000020ed95d in ksbrdp ()
#25 0x0000000002322343 in opirip ()
#26 0x0000000001618571 in opidrv ()
#27 0x0000000001c13be7 in sou2o ()
#28 0x000000000083ceba in opimai_real ()
#29 0x0000000001c19b58 in ssthrdmain ()
#30 0x000000000083cda1 in main ()

The tail on ARB0 trace file now shows relocating just 1 entry at the time (another sign of compacting):

$ tail -f /u01/app/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_arb0_103326.trc
ARB0 relocating file +DATA1.321.788357323 (1 entries)
ARB0 relocating file +DATA1.321.788357323 (1 entries)
ARB0 relocating file +DATA1.321.788357323 (1 entries)
...

The V$ASM_OPERATION keeps showing EST_MINUTES=0 (compacting):

17:42:39 SQL> /

INST_ID OPERA STAT POWER SOFAR EST_WORK EST_RATE EST_MINUTES
---------- ----- ---- ---------- ---------- ---------- ---------- -----------
3 REBAL WAIT 10
4 REBAL WAIT 10
2 REBAL RUN 10 98271 98305 7919 0

The X$KFGMG shows REBALST_KFGMG=2 (compacting):

17:42:50 SQL> select NUMBER_KFGMG, OP_KFGMG, ACTUAL_KFGMG, REBALST_KFGMG from X$KFGMG;

NUMBER_KFGMG OP_KFGMG ACTUAL_KFGMG REBALST_KFGMG
------------ ---------- ------------ -------------
1 1 10 2

Once the compacting phase completes, the alert log shows "stopping process ARB0" and "rebalance completed":

Wed Jul 11 17:43:48 2012
NOTE: stopping process ARB0
SUCCESS: rebalance completed for group 1/0x6ecaf3e6 (DATA1)

In this case, the extents relocation took about 12 minutes and the compacting took about 4 minutes.

The compacting phase can actually take significant amount of time. In one case I have seen the extents relocation run for 60 minutes and the compacting after that took another 30 minutes. But it doesn't really matter how long it takes for the compacting to complete, because as soon as the second phase of the rebalance (extent relocation) completes, all data is fully redundant and we are not exposed to disk group dismount due to partner disk failure.

Changing the power

Rebalance power can be changed dynamically, i.e. during the rebalance, so if your rebalance with the default power is 'too slow', you can increase it. How much? Well, do you understand your I/O load, your I/O throughput and most importantly your limits? If not, increase the power to 5 (just run 'ALTER DISKGROUP ... REBALANCE POWER 5;') and see if it makes a difference. Give it 10-15 minutes, before you jump to conclusions. Should you go higher? Again, as long as you are not adversely impacting your database I/O performance, you can keep increasing the power. But I haven't seen much improvement beyond power 30.

The testing is the key here. You really need to test this under your regular load and in your production environment. There is no point testing with no databases running and on a system that runs off different storage system.

DB Health Check

--############# Start Verify Instance Status #############
set linesize 500
set pagesize 500
column HOST_NAME format a15
column STATUS format a8
column INSTANCE_NAME format a13
column STIME format a35
column uptime format a55
select Host_Name, Status, database_status, Instance_Name
      ,'Started At: ' || to_char(startup_time,'DD-MON-YYYY HH24:MI:SS') stime
      ,'Uptime: ' || floor(sysdate - startup_time) || ' days(s) ' ||
       trunc( 24*((sysdate-startup_time) -
       trunc(sysdate-startup_time))) || ' hour(s) ' ||
       mod(trunc(1440*((sysdate-startup_time) -
       trunc(sysdate-startup_time))), 60) ||' minute(s) ' ||
       mod(trunc(86400*((sysdate-startup_time) -
       trunc(sysdate-startup_time))), 60) ||' seconds' uptime
from v$instance
/
--############# END Verify Instance Status #############
--#
--#
--############# START Verify ASM Diskgroup Sizes #############
set linesize 132
set pagesize 100
select name,state,total_mb,free_mb,round(((TOTAL_MB - FREE_MB) / TOTAL_MB)*100) as PCT_USED from v$asm_diskgroup
where TOTAL_MB <> 0
/
--############# END Verify ASM Diskgroup Sizes #############
--#
--#
--############# START Verify FLASHBASK AREA USAGE #############
set lines 100
col name format a20
select    name
,    floor(space_limit / 1024 / 1024) "Size MB"
,    ceil(space_used / 1024 / 1024) "Used MB"
,round((floor(space_limit / 1024 / 1024) - ceil(space_used / 1024 / 1024) )/ floor(space_limit / 1024 / 1024)*100,2) as PCT_FREE
,round((ceil(space_used / 1024 / 1024) / floor(space_limit / 1024 / 1024))*100,2) as PCT_USED
from    v$recovery_file_dest
order by name
/
--############# END Verify FLASHBASK AREA USAGE #############
--#
--#
--############# START ASM Candidate Disks Check #############
set linesize 132
set pagesize 100
col disk for a20
select path DISK,OS_MB "Size in MB",header_status "Disk Type"   from v$asm_disk where header_status IN('CANDIDATE','FORMER') order by os_mb desc
/
--############# END ASM Candidate Disks Check #############
--#
--#
--############# START Verify RMAN BACKUPS #############
set pagesize 200
set linesize 200
COL STATUS FORMAT a25
COL INPUT_BYTES_DISPLAY FORMAT a20
COL OUTPUT_BYTES_DISPLAY FORMAT a20
COL hrs    FORMAT 999.99
SELECT SESSION_KEY, INPUT_TYPE, STATUS,
       TO_CHAR(START_TIME,'mm/dd/yy hh24:mi') start_time,
       TO_CHAR(END_TIME,'mm/dd/yy hh24:mi')   end_time,
       ELAPSED_SECONDS/3600                   hrs,
       INPUT_BYTES_DISPLAY,
       OUTPUT_BYTES_DISPLAY
FROM V$RMAN_BACKUP_JOB_DETAILS
where START_TIME between sysdate -30 and sysdate
ORDER BY SESSION_KEY
/
--############# END Verify RMAN BACKUPS #############
--#
--#
--############# START Tablespace Free Report #############
set pagesize 500
column tablespace_name                format a20        heading 'Tablespace'
column used_pct_of_max                format 999        heading 'Used % of Max'
column actual_free_gb                 format 999        heading 'Actual Free GB'
with
tbs_auto as
     (select distinct tablespace_name, autoextensible
      from dba_data_files
      where autoextensible = 'YES'),
files as
     (select tablespace_name, count (*) tbs_files,
             sum (bytes) total_tbs_bytes
      from dba_data_files
      group by tablespace_name),
fragments as
     (select tablespace_name, count (*) tbs_fragments,
             sum (bytes) total_tbs_free_bytes,
             max (bytes) max_free_chunk_bytes
      from dba_free_space
      group by tablespace_name),
autoextend as
     (select tablespace_name, sum (size_to_grow) total_growth_tbs
      from (select tablespace_name, sum (maxbytes) size_to_grow
            from dba_data_files
            where autoextensible = 'YES'
            group by tablespace_name
            union
            select   tablespace_name, sum (bytes) size_to_grow
            from dba_data_files
            where autoextensible = 'NO'
            group by tablespace_name)
      group by tablespace_name)
select a.tablespace_name,
       round((((files.total_tbs_bytes - fragments.total_tbs_free_bytes)/autoextend.total_growth_tbs)*100)) used_pct_of_max,
       round(autoextend.total_growth_tbs/1024/1024/1024 - round((files.total_tbs_bytes - fragments.total_tbs_free_bytes)/1024/1024/1024)) actual_free_gb
       from dba_tablespaces a, files, fragments, autoextend, tbs_auto
where a.tablespace_name = files.tablespace_name
and a.tablespace_name = fragments.tablespace_name
and a.tablespace_name = autoextend.tablespace_name
and a.tablespace_name = tbs_auto.tablespace_name(+)
-- and round((((files.total_tbs_bytes - fragments.total_tbs_free_bytes)/autoextend.total_growth_tbs)*100)) >=$pct_thresh
order by used_pct_of_max desc
/
--############# END Tablespace Free Report #############
--#
--#
--############# START Review ALERT LOG for ERRORS #############
SELECT TO_CHAR (ORIGINATING_TIMESTAMP) ORIGINATING_TIMESTAMP, MESSAGE_TEXT MESSAGE_TEXT
FROM X$DBGALERTEXT
WHERE ORIGINATING_TIMESTAMP > (SYSDATE - 1)
AND MESSAGE_TEXT LIKE '%ORA-%'
UNION ALL
SELECT TO_CHAR (ORIGINATING_TIMESTAMP) ORIGINATING_TIMESTAMP, MESSAGE_TEXT MESSAGE_TEXT
FROM X$DBGALERTEXT
WHERE ORIGINATING_TIMESTAMP > (SYSDATE - 1)
AND MESSAGE_TEXT LIKE '%Global Enqueue Services Deadlock detected%'
ORDER BY ORIGINATING_TIMESTAMP DESC
/
select ORIGINATING_TIMESTAMP, message_text from X$DBGALERTEXT
where ORIGINATING_TIMESTAMP >= sysdate -1
order by ORIGINATING_TIMESTAMP desc
/
select ORIGINATING_TIMESTAMP, message_text from X$DBGALERTEXT
where ORIGINATING_TIMESTAMP BETWEEN '03-JUL-13 7:00:00.000 AM -04:00' AND '03-JUL-13 10:30:00.000 AM -04:00'
order by ORIGINATING_TIMESTAMP
/
--############# END Review ALERT LOG for ERRORS #############