Data Corruption is something that no one really wants. Yet every now and then, we see some customers run into this. Any such issues are almost always the result of disk or SAN level issues. You will see errors like this:

SQL1655C The operation could not be completed due to an error accessing data on disk

The options to get to a consistent state are very intricate and involve a lengthy process.
One of the DB2 gurus I work with compared it to building your own ladder and climbing one step at a time versus taking an elevator (backup and restore). Anyways, here’s what this whole process entails. Please go grab a beverage or something while you read this..

First, you need to run a db2 inspect to find out which objects are in fact corrupted within the database

db2 connect to sample
db2 inspect check database results keep sample_inspect

the output from this will be written to where ever your db2dump is located. cd to that location

cd /db2inst1/sqllib/db2dump/

then use the db2instpf to analyze this. I used basic awk. You can choose to do this however you like:

cp sample_inspect t1
db2inspf t1 t1.txt -e -w
grep "Error: In page" t1.txt | awk '{print $10}' | sort | uniq

this will give you a unique set of results with the object ID’s that are corrupted. Then you can use the following db2 select statement to find the names of those objects.

db2 "select char(tabname,20), char(tabschema,20) from syscat.tables where tableid=20 and tbspaceid=4"

After you have done this for all the object ID’s that you found on the inspect output, you will have a list of what you are working with here. We found 12 tables and considered ourselves fortunate that it was just that.

Now you need to use db2dart to export data from these objects:

db2dart sample /ddel
The following is an example output generated by the previous command:
Connecting to Buffer Pool Services...

Table object data formatting start.
Please enter
Table ID or name, tablespace ID, first page, num of pages:
260,3,0,0

11 of 11 columns in the table will be dumped.
Column numbers and datatypes of the columns dumped:
0 SMALLINT
1 INTEGER
2 DECIMAL
3 FLOAT
4 REAL -SINGLE PRECISION FLOATING-POINT
5 BIGINT
6 CHAR() -FIXED LENGTH CHARACTER STRING
7 VARCHAR() -VARIABLE LENGTH CHARACTER STRING
8 DATE
9 TIME
10 TIMESTAMP
Default filename for output data file is TS3T260.DEL,
do you wish to change filename used? y/n
n

Filename used for output data file is TS3T260.DEL. If existing file, data will be appended to it.

Formatted data being dumped ...
Dumping Page 0 ....
Dumping Page 1 ....
Dumping Page 2 ....
...
Dumping Page 832 ....
Dumping Page 833 ....
Dumping Page 834 ....
Dumping Page 835 ....
Dumping Page 836 ....
Dumping Page 837 ....
Dumping Page 838 ....
Dumping Page 839 ....
Table object data formatting end.

Note that any data that db2dart will dump will exclude the pages where the corruption actually existed, so there will likely be missing data.

The next step is to take a db2look output for all the objects that we initially identified. Make sure you use all the options to grab all associated grants etc. Here’s what I used:

db2look -d sample -t <tabname> -e -xd 

After this, you need to drop these objects one at a time, recreate empty tables and load with the data from the db2dart dumps.
We got to this point and were fairly confident we can get to a point where the application can be brought back up with no issue. Then I got an error while trying to drop the 10th table on my list.

db2 drop table db2inst1.PGMFILE
DB21034E The command was processed as an SQL statement because it was not a
valid Command Line Processor command. During SQL processing it returned:
SQL1655C The operation could not be completed due to an error accessing data
on disk. SQLSTATE=58030

I tried to do a load from /dev/null to truncate it so i can load, but that didn’t work. Then I tried to rename the table and leave it alone and create an empty table so I can load it with the db2dart dump. But that can’t be done as well. The problem with doing anything with this table is that if the table header and such is corrupted, db2 does not know how to drop it.

To be able to overcome this problem, we contacted IBM, just in case they had a solution for this issue. A really smart IBM specialist got back really quick and proposed to do this:

db2dart <db-name> /MT /OI <obj-id> /TSI <tablespace-id> /PW <service-password>

Please know that to be able to do this, you will have to contact them so they can create a service password for you. They will need db2level info to be able to do this.
This command essentially puts the table in drop pending state:

$ db2dart gen9 /MT /OI 781 /TSI 4 /PW xyz
Mark table phase start.
Set drop-pending (unavailable) state for table objects.
Set state for DAT object 781 tablespace 4.
Modification for page (obj rel 0, pool rel 2720) of pool ID (4) obj ID (781), written out to disk successfully.
Modification written out successfully.
Set state for INX object 781 tablespace 3.
Modification for page (obj rel 0, pool rel 2720) of pool ID (3) obj ID (781), written out to disk successfully.
Modification written out successfully.
Mark table phase end.

The requested DB2DART processing has completed successfully!
Complete DB2DART report found in:
/lawprodrecover/db2inst1/sqllib/db2dump/DART0000/GEN9.RPT

As you can see, this was a special command which is used with a service password that they created for us to be able to do this. The password is valid only for 7 days. Anyways, I dropped the PGMFILE table, recreated it and loaded it with the db2dart dump that we had.
That’s it. When you are done with this process for all the corrupt objects that you identified, your database should be in good shape.

As a workaround, I also created an empty database, called samplenew, then used the db2move utilitity to export the entire original database and then loaded the new database using db2move. The tables that are corrupted will not be loaded, since db2move most likely wasn’t able to export them. You will need the db2dart dumps to load those tables.