Friday, June 12, 2009

Gathering Boot Events and Calculating and Reporting Uptime

Uptime! Part of the Trinity, along with Single Sign On and ..... Ok so its a big topic. Something some clever people write theses about and do PhD's about.

In our case we have a reasonably simple requirement. We need to have 99% uptime for servers. Uptime is between support hours, during the week, excluding public holidays. Great.

Lets first start with what data we will use to calculate uptime. We are using the Windows event logs, in particular the System event log and EventID's 6006, 6008 and 6009. So 1st off I go and grab these events from a servers event log. I use Log Parser. The reason I use Log Parser is I can filter on EventID easily. Herewith the PS code, to get the last 45 days worth of boot events:


function Get-BootEvents ($svr)
{
$evt = 'D:\Data\BootEvents\'+$svr+'_BootEvents.csv'
$myQuery = new-object -com MSUtil.LogQuery
$objInputFormat = New-Object -com MSUtil.LogQuery.EventLogInputFormat
$objOutputFormat = New-Object -com MSUtil.LogQuery.CSVOutputFormat
$strQuery = "Select ComputerName, TimeGenerated, EventID, EventTypeName, Message INTO "+$evt+" FROM \\"+$svr+"\System WHERE TimeWritten >= SUB( TO_LOCALTIME(SYSTEM_TIMESTAMP()), TIMESTAMP('0000-02-15', 'yyyy-MM-dd')) AND (EventID = '6006' OR EventID = '6008' OR EventID = '6009') ORDER BY TimeGenerated"
$myQuery.ExecuteBatch($strQuery, $objInputFormat, $objOutputFormat)
}


I run this for each server in a list.

$sl = gc 'D:\Powershell\servers.txt'
foreach ($svr in $sl) {write-host "Getting boot events for $svr" -foregroundcolor "Green"; Get-BootEvents $svr}


Then I copy these csv files to my DBA SQL Server and Bulk Insert them into a BootEvents Table. BootEvents Table code:


CREATE TABLE [dbo].[BootEvents](
[ComputerName] [varchar](128) NULL,
[TimeGenerated] [datetime] NULL,
[EventID] [int] NULL,
[EventTypeName] [varchar](100) NULL,
[Message] [varchar](500) NULL
)



The bulk insert function is below:



function BulkInsert-BootEvents ($fl)
{
$SqlConnection = New-Object System.Data.SqlClient.SqlConnection
$SqlConnection.ConnectionString = "Server=DBASQL1;Database=master;Integrated Security=True"
$SqlCmd = New-Object System.Data.SqlClient.SqlCommand
#$SqlCmd.CommandText = "use DBA; TRUNCATE TABLE BootEvents;"
#$SqlCmd.Connection = $SqlConnection
#$SqlConnection.Open()
#$sqlCmd.ExecuteNonQuery()
#$SqlConnection.Close()
$SqlCmd.CommandText = "BULK INSERT DBA..BootEvents FROM '"+$fl+"' WITH (FIELDTERMINATOR = ',', FIRSTROW = 2, ROWTERMINATOR = '\n')"
$SqlCmd.Connection = $SqlConnection
$SqlConnection.Open()
$sqlCmd.ExecuteNonQuery()
$SqlConnection.Close()
}



I run the following to Bulk Insert each csv into my BootEvents Table:

cpi d:\data\bootevents\*bootevents.csv \\DBASQL1\d$
$l = gci '\\DBASQL1\d$\*' -include *bootevents.csv | select fullname
foreach ($fl in $l) {BulkInsert-BootEvents $fl.fullname}



Now I have all the boot events for all my servers in a SQL Table. The 1st thing I do is create some T-SQL functions to work out the number of work seconds of downtime between boot events or for a hang boot event, and also if a particular date is a public holiday. I also have a function to convert seconds to a string of hours, minutes and seconds, which is more readable.

Convert Seconds to string

--drop function ms

CREATE function ufn_ConvertSec
(@s bigint)
RETURNS VARCHAR(50)
as
BEGIN
DECLARE @st VARCHAR(50)
DECLARE @m bigint
DECLARE @h int
DECLARE @d int

BEGIN
SET @h = @s/3600
SET @s = @s - (@h*3600)
SET @m = @s/60
SET @s = @s - (@m*60)
SET @st = CAST(@h AS VARCHAR(10))+' hour(s); '+ CAST(@m AS VARCHAR(15))+' minute(s); '+ CAST(@s AS VARCHAR(5))+' second(s)'
END
RETURN @st
END

--select dbo.ufn_ConvertSec(29098)



Check if Date is Public Holiday


CREATE FUNCTION ufn_IsHoliday (@Dt DATETIME)
RETURNS INT

AS

BEGIN
DECLARE @IsHol INT
DECLARE @PublicHolidays TABLE
(
DATE DATETIME,
HolidayName VARCHAR(50)
)
INSERT @PublicHolidays
VALUES('2009/04/10', 'Good Friday')
INSERT @PublicHolidays
VALUES('2009/04/13', 'Easter Monday')
INSERT @PublicHolidays
VALUES('2009/04/22', 'Voting Day')
INSERT @PublicHolidays
VALUES('2009/04/27', 'Freedom Day')
INSERT @PublicHolidays
VALUES('2009/05/01', 'Workers Day')
INSERT @PublicHolidays
VALUES('2009/06/16', 'Youth Day')
INSERT @PublicHolidays
VALUES('2009/08/10', 'Womens Day')
INSERT @PublicHolidays
VALUES('2009/09/24', 'Heritage Day')
INSERT @PublicHolidays
VALUES('2009/12/16', 'Day of Reconciliation')
INSERT @PublicHolidays
VALUES('2009/12/25', 'Christmas Day')

--DECLARE @Dt DATETIME
--SET @Dt = '2009/06/17 07:17:16'
--SELECT @Dt
SET @DT = SUBSTRING(CONVERT(VARCHAR(30), @Dt, 111), 1, 10) + ' 00:00:00.000'
--SELECT @Dt

IF (SELECT COUNT(*) FROM @PublicHolidays WHERE DATE = @DT) > 0
SET @IsHol = 1
ELSE
SET @IsHol = 0
RETURN(@IsHol)
END

select dbo.ufn_IsHoliday('2009/06/16 07:17:16')



Calculate Work Seconds between two dates

CREATE FUNCTION ufn_WorkSec (@Down datetime, @Up DATETIME)
RETURNS int
AS
BEGIN

DECLARE @WorkSec INT
SET @WorkSec = 0

WHILE @Down <= @Up
BEGIN
IF dbo.ufn_IsHoliday(@Down)=1
BEGIN
GOTO TICK
END
ELSE
IF (DATEPART(weekday, @Down) < 6 AND (DATEPART(hour, @Down) BETWEEN 7 AND 17))
BEGIN
WHILE (DATEPART(hour, @Down) < 18 AND @Down <= @Up)
BEGIN
SET @WorkSec = @WorkSec + 1
SET @Down = DATEADD(second, 1, @Down)
END
END

TICK:
SET @Down = DATEADD(day, 1, SUBSTRING(CONVERT(VARCHAR(30), @Down, 111), 1, 10) + ' 06:59:59.000')
SET @Down = DATEADD(second, 1, @Down)
END

RETURN(@WorkSec)
END



Now I'm ready to calculate the uptime percentage and insert the summarized data into a summary table called UptimeSummary. Code for table:


CREATE TABLE UptimeSummary
(
ComputerName VARCHAR(128),
Downtime VARCHAR(100),
UptimeSince DATETIME,
UptimePercent Decimal(18,4)
)


Now comes the T-SQL magic to use all the above code and churn out the uptime percentage:


declare @bootevents TABLE
(ID INT IDENTITY(1,1),
ComputerName VARCHAR(128),
TimeGenerated DATETIME,
EventID INT,
EventTypeName VARCHAR(100),
Message VARCHAR(500)
)
DECLARE @UptimeSummary TABLE
(
ComputerName VARCHAR(128),
DowntimeSec INT,
WorkSec INT,
UptimePercent Decimal(18,4)
)

INSERT @bootevents
select * from bootevents

DECLARE @StartDate DATETIME
DECLARE @EndDate DATETIME
DECLARE @WorkSec INT
DECLARE @Downtime INT
DECLARE @Up DATETIME
DECLARE @Down DATETIME
DECLARE @EventID INT

SET @StartDate = GETDATE()-45
SET @EndDate = GETDATE()
SET DATEFIRST 1
SET @WorkSec = (SELECT dbo.ufn_WorkSec(@StartDate, @EndDate))
DECLARE @Server VARCHAR(128)
DECLARE ServerC CURSOR
FOR SELECT DISTINCT ComputerName FROM @BootEvents ORDER BY ComputerName ASC
OPEN ServerC
FETCH NEXT FROM ServerC INTO @Server
WHILE @@FETCH_STATUS = 0
BEGIN

DECLARE @ComputerName VARCHAR(128)
DECLARE @ID INT
DECLARE @TimeGenerated DATETIME
DECLARE DownUp Cursor
FOR SELECT ID, TimeGenerated, ComputerName FROM @BootEvents WHERE EventID = 6006 AND ComputerName = @Server ORDER BY TimeGenerated ASC
OPEN DownUp
FETCH NEXT FROM DownUp INTO @ID, @TimeGenerated, @ComputerName
WHILE @@FETCH_STATUS = 0
BEGIN
SET @Downtime = 0
SET @Down = (SELECT TimeGenerated FROM @BootEvents WHERE ID = @ID)
SET @ID = @ID + 1
SET @Up = (SELECT TimeGenerated FROM @BootEvents WHERE ID = @ID AND EventID = 6009)
SET @Downtime = @Downtime + (SELECT dbo.ufn_WorkSec(@Down, @Up))
FETCH NEXT FROM DownUp INTO @ID, @TimeGenerated, @ComputerName
END
CLOSE DownUp
DEALLOCATE DownUp

DECLARE @Message VARCHAR(500)
DECLARE DownUpHang Cursor
FOR SELECT ID, TimeGenerated, ComputerName, Message FROM @BootEvents WHERE EventID = 6008 AND ComputerName = @Server ORDER BY TimeGenerated ASC
OPEN DownUpHang
FETCH NEXT FROM DownUpHang INTO @ID, @TimeGenerated, @ComputerName, @Message
WHILE @@FETCH_STATUS = 0
BEGIN
SET @Up = (SELECT TimeGenerated FROM @BootEvents WHERE ID = @ID)
SET @Down = (SELECT dbo.ufn_DateUnexpected([message]) FROM @BootEvents WHERE ID = @ID)
SET @Downtime = @Downtime + (SELECT dbo.ufn_WorkSec(@Down, @Up))
--SELECT @ID2, @TimeGenerated2, @ComputerName2, @Message, @Up, @Down, @Downtime
FETCH NEXT FROM DownUpHang INTO @ID, @TimeGenerated, @ComputerName, @Message
END
CLOSE DownUpHang
DEALLOCATE DownUpHang

INSERT @UptimeSummary
SELECT @ComputerName, @Downtime AS DowntimeSec, @WorkSec AS WorkSec,
CAST((@WorkSec-@Downtime) AS Decimal(18,4))/CAST(@WorkSec AS Decimal(18,4))*100 AS UptimePercent

FETCH NEXT FROM ServerC INTO @Server
END
CLOSE ServerC
DEALLOCATE ServerC

TRUNCATE TABLE UptimeSummary

INSERT UptimeSummary
SELECT ComputerName, dbo.ufn_ConvertSec(DowntimeSec) AS Downtime, @StartDate AS UptimeSince, UptimePercent
FROM @UptimeSummary
ORDER BY UptimePercent ASC


Now that I have Summarized uptime data for each server in a SQL Table I can write some Reporting Services reports to show this data. I've written two reports. One that shows the contents of the UptimeSummary table and one that shows Boot Events for a particular server. From the Summary report you can drill through to the Boot Events report for a server by clicking on the server name.

All the above, with the inclusion of "Hangtime" in the downtime calculation, prompted us to monitor for currently hanging servers. So we don't only reactively report on servers that hung, but now proactively report on servers that are currently hanging. I'll cover this in a subsequent post.

One Excon to another: "I thought you was hung?" "I is!!"

5 comments:

  1. Great idea, bro! But where's the definition for "ufn_DateUnexpected"? Thanks.

    ReplyDelete
  2. Nevermind. I figured it out and now the whole thing works perfectly. Thank you very much, Farmer Dave!

    ReplyDelete
  3. OK. I haven't figured it out yet. Where's the definition for ufn_DateUnexpected?

    ReplyDelete
  4. Got it. Took me a little longer than most.

    ReplyDelete