Monday, June 3, 2013

Remote Trace Slowing Down Server

Recently, I was troubleshooting a server where users were complaining about slow response times. Some folks had already looked over the server and didn't find anything. When looking at task manager, the CPU and memory looked fine.

However, when you tried to connect SQL Server Management Studio it took nearly 20 seconds to connect, open a query, and run a simple DMV select query.

So I decided to look at the SQL Server wait stats; they showed trace-related types as the top three wait types.

Remote Trace Total Waits

This is a fairly typical query, but it focuses on the total wait time. I was curious how long each wait was taking. So I worked up a new query that gave me the average wait times.

Remote Trace Average Waits

This not only confirmed that the trace was causing things to slow down, but it also showed that the application was experiencing nearly 10 second wait times. Turns out, there was a piece of auditing software in the mix that was writing a trace to a file share on a remote virtual server.

Going one step further, I adjusted the query one more time to see which were the top resource waits. You either have signal waits, or resource waits. Signal waits indicate CPU pressure. Resource waits mean the database is waiting on a resource.

Remote Trace Resource Waits

After much nail biting and cajoling we were able to redefine the trace to write to a local drive. Once we did this, the wait times dropped immediately, and the application responsiveness returned to what it had been previously.

Remote Trace Fixed Waits

When looking at SQL Server wait stats, be sure to slice and dice the numbers a few different ways to make sure you are seeing the complete picture. Don't just look at the total wait times, also check the average wait times, resource waits, and signal waits.


Here is the final query...


[sql]

-- Steven Ormrod
-- 2/25/13
-- waits - top waits
-- Version: 3.3
-- Source: http://stevenormrod.com/
-- License: Creative Commons. Attribution-NonCommercial CC BY-NC
-- http://creativecommons.org/licenses/by-nc/3.0/

/*
-- clear wait stats
dbcc sqlperf('sys.dm_os_wait_stats', clear)
*/-- wait stats, cumulative, filtered

select
wait_type,
waiting_tasks_count as 'count',
wait_time_ms as 'total wait ms',
max_wait_time_ms as 'max wait ms',
cast(wait_time_ms * 1.0 / waiting_tasks_count as numeric(10, 2)) as 'avg wait ms',
wait_time_ms - signal_wait_time_ms as 'resource wait ms',
signal_wait_time_ms as 'signal wait ms',
cast(100.0 * wait_time_ms / sum(wait_time_ms) over() as numeric(5,2)) as 'percent wait time',
cast(100.0 * (wait_time_ms - signal_wait_time_ms) / sum(wait_time_ms - signal_wait_time_ms) over() as numeric(5,2)) as 'percent resource time',
cast(100.0 * signal_wait_time_ms / sum(signal_wait_time_ms) over() as numeric(5,2)) as 'percent signal time'
into #waits
from sys.dm_os_wait_stats
where waiting_tasks_count > 0
-- filter the 'harmless' waits
and wait_type not in (
'BAD_PAGE_PROCESS', 'BROKER_EVENTHANDLER', 'BROKER_RECEIVE_WAITFOR', 'BROKER_TASK_STOP', 'BROKER_TO_FLUSH', 'BROKER_TRANSMITTER',
'CHECKPOINT_QUEUE', 'CLR_AUTO_EVENT', 'CLR_MANUAL_EVENT', 'CLR_SEMAPHORE',
'DBMIRROR_EVENTS_QUEUE', 'DBMIRRORING_CMD', 'DISPATCHER_QUEUE_SEMAPHORE',
'FT_IFTS_SCHEDULER_IDLE_WAIT', 'FT_IFTSHC_MUTEX',
'KSOURCE_WAKEUP',
'LAZYWRITER_SLEEP', 'LOGMGR_QUEUE',
'ONDEMAND_TASK_QUEUE',
'PREEMPTIVE_OS_AUTHENTICATIONOPS', 'PREEMPTIVE_OS_GETPROCADDRESS',
'REQUEST_FOR_DEADLOCK_SEARCH',
'RESOURCE_QUEUE',
'SLEEP_BPOOL_FLUSH', 'SLEEP_SYSTEMTASK', 'SLEEP_TASK', 'SQLTRACE_BUFFER_FLUSH', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', 'SQLTRACE_LOCK', 'SQLTRACE_WAIT_ENTRIES',
'TRACEWRITE',
'WAITFOR',
'XE_DISPATCHER_JOIN', 'XE_DISPATCHER_WAIT', 'XE_TIMER_EVENT'
)
go

select top 5
wait_type, [avg wait ms]
-- , [max wait ms]
--, *
from #waits
order by 'avg wait ms' desc;
go

select top 5
wait_type, [percent resource time]
--, *
from #waits
order by 'percent resource time' desc;
go

drop table #waits
go

[/sql]