After a recent SharePoint 2007 migration exercise at a large customer, we started experiencing performance issues reported by end users when the system was under load. In the Windows Event Logs we saw the following event:

Event Type: Warning
Event Source: W3SVC-WP
Event ID: 2262
Description:
ISAPI 'C:\WINDOWS\Microsoft.NET\Framework64\v2.0.50727\aspnet_isapi.dll' reported itself as unhealthy for the following reason: 'Deadlock detected'.

IIS 6.0 would shortly recycle the web application pool, but during that period users directed to that WFE server by the load balancer weren't able to have their requests served. After looking through obvious causes, we ended up obtaining a dump file of the worker process. We used the process of orphaning a worker process but a dump file can just as easily be obtained using common tools such as IISState or IIS DebugDiag.

Once the dump file was obtained I performed the following analysis.

Start WinDBG and point it the dump file. WinDBG is part of the Debugging Tools for Windows

Since this is a SharePoint issue, I suspected that we may need to investigate both managed and native code. To assist with debugging managed code we need to do a bit of extra preparatory work:

  • Load the SOS (Son of Strike) extension: .load SOS
  • Make available a copy of mscordacwks.dll to WinDBG. Note that this has to match the version on the server where the dump file was taken from. It's easiest to just copy this from the server (from the %systemroot%\Microsoft.Net\Framework\<frameworkVersion> folder. It will have to be renamed to mscordacwks_xxx_xxx_2.0.50727.yyyy.dll - where XXX is either I386 or AMD64 depending on the platform, and yyyy is the version number (this can be found by right-clicking on the .dll file and checking the version number). WinDBG will tell you what the actual filename you need is if you get it wrong. You can just copy-n-paste that value to assist in renaming the file
  • To make available this file copy it into a folder of your choosing and add it to the WinDBG symbols path. I copied it into c:\temp, and to add that to the path type: .sympath+ c:\temp
  • Add the Microsoft public symbol server to the path (if it isn't already). To do this type: .symfix
  • Now acquire symbols for Microsoft modules in the dump by typing in: .reload

Now we can begin analysis of the dump file. The first command gives us an indication of how long threads have been running in the process:

0:031> !runaway
 User Mode Time
  Thread       Time
  13:1218      0 days 0:00:56.437
  37:138c      0 days 0:00:32.718
  31:b80       0 days 0:00:32.406
  43:1208      0 days 0:00:30.125
  33:17ec      0 days 0:00:23.578
  38:fa8       0 days 0:00:14.937
  36:147c      0 days 0:00:13.609
  40:1680      0 days 0:00:11.468
  15:153c      0 days 0:00:08.359
  44:15a4      0 days 0:00:07.968
  16:12e4      0 days 0:00:06.968
  27:1554      0 days 0:00:05.750
  25:16b0      0 days 0:00:05.656
  24:15f8      0 days 0:00:05.453
  35:17a8      0 days 0:00:05.406
  26:d98       0 days 0:00:05.359
  41:cb0       0 days 0:00:05.296
  11:d50       0 days 0:00:05.031

We can see here that we have quite a few very long running threads (there were 71 threads in total in this dump file - I have truncated the list).

The next step is to have a look at the longest running threads and see what they are doing. To change to a thread type: ~<threadnumber> s. In this case we change to thread 13 (the longest running thread) and then dump the unmanaged stack:

0:013> ~13 s
ntdll!NtWaitForSingleObject+0xa:
00000000`77ef0a3a c3              ret
0:013> k
Child-SP          RetAddr           Call Site
00000000`0309aa68 000007ff`771d5280 ntdll!NtWaitForSingleObject+0xa
00000000`0309aa70 000007ff`7731173e mswsock!WSPRecv+0x66b
00000000`0309abb0 000007ff`770f3518 ws2_32!WSARecv+0x166
00000000`0309ac60 000007ff`72499c0e wsock32!recv+0x38
00000000`0309acc0 000007ff`5fe51ae7 dbnetlib!ConnectionRead+0x4fe
00000000`0309ada0 000007ff`5fe52f2d sqloledb!CDataSource::ConnectionTransact+0xf7
00000000`0309ae20 000007ff`5feaac34 sqloledb!CDBConnection::SendPacket+0x25d
00000000`0309aee0 000007ff`5fe44b69 sqloledb!CStmt::SQLExecRPC+0x4d4
00000000`0309aff0 000007ff`5fe45df3 sqloledb!CCommand::ExecuteHelper+0x2a9
00000000`0309b090 000007ff`5fe46cc7 sqloledb!CCommand::Execute+0xc73
00000000`0309b180 000007ff`5aa0e4cc sqloledb!CImpICommandText::Execute+0x187
00000000`0309b200 000007ff`5aa0f65b oledb32!CCommandText::DoExecute+0x4fc
00000000`0309b440 00000000`0a913a72 oledb32!CCommandText::Execute+0x8ab
00000000`0309b6f0 00000000`0a84241f STSWEL!Voledb::ExecQuery+0x37e
00000000`0309b850 00000000`0a8846bd STSWEL!VdocumentStore::httpGetDocument+0xbdf
00000000`0309c850 00000000`0a886720 STSWEL!VhttpManager::loadFileCore+0x5c5
00000000`0309d0f0 00000000`3569fa9b STSWEL!VhttpManager::loadFileAndMetaInfo+0xc4
00000000`0309d1b0 00000000`356aaa8c OWSSVR!GetExtensionVersion+0x890af
00000000`0309d570 00000642`7f600887 OWSSVR!GetExtensionVersion+0x940a0
00000000`0309d8f0 00000642`806070ff mscorwks!DoCLRToCOMCall+0x177

Note: to dump the managed stack type !clrstack (this requires SOS extension to be loaded). I did dump the managed stack but it isn't relevant in this case. The oldest item on the native stack is a call from managed code into native code (mscorwks!DoCLRToCOMCall) and examining the managed stack doesn't tell us anything.

From the native stack we can see a few things (read a stack from the bottom up):

  • We have some SharePoint related components (OWSSVR and STSWEL) that are called.
  • Eventually these call into OleDb
  • OleDb then appears to call into the SQL Server OleDb Provider (sqloledb)
  • The SQL Server OleDb Provider then calls into WinSock to send a command across the network to a remote SQL Server
  • The stack ends with the Windows Socket in a receive state awaiting a response (ws2_32!WSARecv)

I examined most of the threads running in the process, and a vast majority of ASP.NET worker threads where in a similar state. Our next step is to try to figure out what SQL command or stored procedure we are calling.

To do this, I took a bit of a guess (I suppose I could have look at the definitions of the OleDb APIs etc). I suspected that a function such as oledb32!CCommandText::Execute would probably have, as one of it's parameters, the actual command to be executed (mirroring the OleDb connection and command objects' .Execute method)

Using the kb command we can get parameter information. The address of the third parameter was 00000000`0309bc40 (note this dump is from an x64 system).

Our next step is to examine some memory around that parameter address. It turns out that approximately 240 bytes further we can find the stored procedure being called:

0:013> dc 00000000`0309bc40+0x240
00000000`0309be80  0ef44048 00000000 110a0008 00000000  H@..............
00000000`0309be90  0f04afe8 00000000 00000000 00000000  ................
00000000`0309bea0  0309bea8 00000000 003f007b 0063003d  ........{.?.=.c.
00000000`0309beb0  006c0061 0020006c 00720070 0063006f  a.l.l. .p.r.o.c.
00000000`0309bec0  0046005f 00740065 00680063 006f0044  _.F.e.t.c.h.D.o.
00000000`0309bed0  00460063 0072006f 00740048 00700074  c.F.o.r.H.t.t.p.
00000000`0309bee0  00650047 00280074 002c003f 002c003f  G.e.t.(.?.,.?.,.
00000000`0309bef0  002c003f 002c003f 002c003f 002c003f  ?.,.?.,.?.,.?.,.

Looking through the other threads, the majority are calling this stored procedure, with a few calling a different one. it maybe that this one sproc is blocking itself, or the interaction of these two are blocking each other. Or possible some other issue in the cluster. For now, our job is done, and we hand over to the DBA team to do some investigation into what is happening in SQL Server.

Note: Due to blog spam, comments are disabled. Please use the contact form for questions.