Wednesday, December 21, 2011

Who Done It?

Have you ever had a production application or database go offline for no apparent reason? What would you do?

I recently encountered a situation with a client where a production application was not running and the database appeared to be inaccessible. Predictably enough, the developers were surrounding the DBA trying to get him to restart SQL services or possibly the entire server. It fixed it last time…

sp_who2

sp_who2 had been run, but would not return positive results. Instead, it gave output similar to the following:
Msg 952, Level 16, State 1, Procedure sp_who2, Line 89
Database 'DATABASENAME' is in transition. Try the statement
later.

Then, when you tried to expand the view for this database in SSMS you would get the following error:

Database Not Accessible

dbcc checkdb had been run, and gave these results:
Msg 2520, Level 16, State 5, Line 1
Could not find database 'DATABASENAME'. The database either does
not exist, or was dropped before a statement tried to use it.
Verify if the database exists by querying the sys.databases
catalog view.

They were getting ready to look at restoring from a backup. Wait a second hoss…

DMVs to the Rescue

When I was asked to look at this, I went straight to the DMVs for help.  In fact, if you go back and read the error message, it tells you to do what I'm about to do next.

First, I queried sys.databases to verify the status of this database:

[sql]

-- check status of databases
select name, state_desc
from sys.databases
where name = 'DATABASENAME';
go

[/sql]

This confirmed for me that the database was present, but offline.

Suspicious Minds

I have to admit, I was suspicious of the developers; they kept letting out little hints that they knew the database was offline before anyone had looked at it. So I did a little more digging…

Next, I ran another DMV query to find out what activity was happening on the server:

[sql]

-- current activity on the server
select r.session_id, s.host_name, s.program_name, s.login_name,
DB_NAME(r.database_id) as 'Database', r.status,
r.command, r.percent_complete, st.text
from sys.dm_exec_requests r
join sys.dm_exec_sessions s
on r.session_id = s.session_id
cross apply sys.dm_exec_sql_text(r.sql_handle) st
where s.is_user_process = 1
and r.session_id <> @@SPID
order by r.session_id;
go

[/sql]

These results showed me that someone had just issued the following command:

[sql]

-- the offending command
alter database [DATABASENAME] set offline;
go

[/sql]

Connecting the Dots

I was able to tell that a user had run this command from a desktop computer, using SSMS, and was logged in with the application’s SQL login credentials.  Suddenly I'm thinking of CLUE...

The reason we were getting the ‘in transition’ error is because several developers had multiple sessions connected and were blocking each other from finishing.

Once we had this basic information, we were able to have the Network and Server teams track down who the user was. At this point in time, all the developers that were surrounding the DBA disappeared and disavowed all knowledge of anyone running anything.

Kill 'Em All, Let the DBA Sort 'Em Out

Next, we killed the offending session:

[sql]

-- kill the offending session
kill SESSIONID;
go

[/sql]

After the session was killed, it was time to check the consistency of the database.

Run checkdb:

[sql]

-- check database consistency
dbcc checkdb('DATABASENAME')
with all_errormsgs, no_infomsgs;
go

[/sql]

Everything looked good, so we were ready to bring the database back online:

[sql]

-- bring the database back online
alter database [DATABASENAME] set online;
go

[/sql]

And of course, run a new FULL backup and save off the existing FULL and LOG backups in case they were needed later.

Post Mortem

In the post mortem, we discovered a few things...

The login passwords for DEV and PRD were set to the same value. This allowed the developers to obtain front door access to the production servers. Needless to say, this was immediately changed.

This also prompted some interesting discussions around security best practices and the need to verify backup files in case you ever need them.

Remember, disaster can strike at any moment. The key is to remain calm and work your way through the problem. Take every setback as a learning experience, and use it to make the customer’s environment better.

Forgiveness

Also, resist the urge to throw someone under the bus. Rather, try to help each other understand the reasons for security and separation of duties and environments. Look at situations like this as a learning opportunity, not a burning opportunity.

2 comments:

  1. i check all process but no process shown in sysprocesses and Sp_who/sp_who2.
    I had check database id from.
    select * from sys.databases
    then
    select * from sys.sysprocesses WHERE dbid = 84

    SELECT * FROM sys.dm_exec_requests WHERE database_id = 84

    select * from sys.dm_tran_locks where resource_database_id=84
    After that you will get procees id.

    then kill corresponding session and check database status.

    ReplyDelete
  2. Big thx. Great blog.

    ReplyDelete