So, XML data compression. From the reading you may have done already, you don't get too much return from conventional compression techniques like native data compression in SQL 2008. Some folk have used transparent data encryption and got a 30% compression. More PT than its worth. What really grates me is the repetition of data and elements in each XML record. The nature of the beast. This was the primary target for my compression technique. Basically, I use xpath query and show the XML element names of the XML string. So if an XML string has 15 elements, the names of these will be repeated twice, once at the beginning of the element and once at the end. I then replace each of these element names with a 2 character string, made up of special characters. So element name PostalAddress may be replaced with ## for example. I iterate through the element names with a cursor. Its the simplest way I could think of doing this. There are probably smarter ways, but my method works quite efficiently.
One thing to note, not all XML data returns great compression using the above technique. After much testing I found that XML of length 4000 - 6000 bytes returned the most compression. In some cases as much as 140 fold compression!!! Yes, thats correct. So a 10000 record sample originally used 33256 KB, after compression this was 272 KB. The trick is to be able to get the compressed XML string and the entire record to smaller than 4000 bytes, and then apply Table compression at the page level.
Another note, the element names are replaced with 2 byte special character strings. This mapping of original element name to 2 byte string is saved in a mapping table, almost like an encryption key. Lose this table and you lose your mapping. The data in the XML string is not changed, only the XML schema, so if you know the schema you can still extract the data.
The 1st bit of code you'll need is a view that generates a few thousand 2 character strings:
CREATE VIEW uv_Charr AS WITH charr(CharNumber, CharVal) AS ( SELECT 192 AS CharNumber, CHAR(192) AS CharVal UNION ALL SELECT 193 AS CharNumber, CHAR(193) AS CharVal UNION ALL SELECT 194 AS CharNumber, CHAR(194) AS CharVal UNION ALL .... SELECT 255 AS CharNumber, CHAR(255) AS CharVal ) Select c1.CharVal+c2.CharVal Charr FROM charr c1 CROSS JOIN charr c2 --Avoid 215 and 248 go all the way up to 255. This will give you 3721 2 byte stringsThe rest of the code basically gets the element names, joins these to the data from the above view and then updates the XML string. The below code saves the mapping of element names to 2 byte strings to the MQMessageDetailLogNodeMap. The query below is against a table called MQMessageDetailLog
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[MQMessageDetailLogNodeMap]') AND type in (N'U')) DROP TABLE [dbo].[MQMessageDetailLogNodeMap] GO SET NOCOUNT ON DECLARE @StartTime DATETIME DECLARE @EndTIME DATETIME SET @StartTime = GETDATE() DECLARE @Char TABLE (ID INT IDENTITY(1,1), CharVal CHAR(2)) INSERT @Char SELECT Charr FROM uv_Charr DECLARE @Xs VARCHAR(MAX) DECLARE @Xso VARCHAR(MAX) DECLARE @XX XML DECLARE @Nv1 TABLE (NodeName XML) DECLARE @Nv2 TABLE (ID INT IDENTITY(1,1), NodeName XML) SET @Xso = (SELECT TOP 1 CAST(MessageData AS VARCHAR(MAX)) FROM MQMessageDetailLog) SET @XX = @Xso INSERT @Nv1 SELECT Nodes.Name.query('local-name(.)') FROM @XX.nodes('//*') As Nodes(Name) INSERT @Nv2 SELECT DISTINCT CAST(NodeName AS VARCHAR(255)) FROM @Nv1 SELECT CAST(NodeName AS VARCHAR(255)) AS NodeName, c.CharVal INTO MQMessageDetailLogNodeMap from @Nv2 n inner join @Char c on n.ID = c.ID WHERE LEN(CAST(NodeName AS VARCHAR(255))) > 5 ORDER BY LEN(CAST(NodeName AS VARCHAR(255))) DESC DECLARE @MessageLogID BIGINT DECLARE @MessageXML XML DECLARE UpdateAll CURSOR FOR ELECT MQMessageHeaderLogID, MessageData FROM dbo.MQMessageDetailLog OPEN UpdateAll FETCH NEXT FROM UpdateAll INTO @MessageLogID, @MessageXML WHILE @@FETCH_STATUS = 0 BEGIN DECLARE @NodeName VARCHAR(255) DECLARE @CharVal CHAR(2) DECLARE Upd CURSOR FOR SELECT CAST(NodeName AS VARCHAR(255)) AS NodeName, c.CharVal from @Nv2 n inner join @Char c on n.ID = c.ID OPEN Upd FETCH NEXT FROM Upd INTO @NodeName, @CharVal WHILE @@FETCH_STATUS = 0 BEGIN SET @Xso = REPLACE(REPLACE(REPLACE(@Xso, ' ', ' '), ' ', ' '), ' ', ' ') SET @Xso = REPLACE(@Xso,@NodeName, @CharVal) UPDATE dbo.MQMessageDetailLog SET MessageData = CAST(@Xso AS VARCHAR(MAX)) WHERE MQMessageHeaderLogID = @MessageLogID FETCH NEXT FROM Upd INTO @NodeName, @CharVal END CLOSE Upd DEALLOCATE Upd FETCH NEXT FROM UpdateAll INTO @MessageLogID, @MessageXML END CLOSE UpdateALL DEALLOCATE UpdateALL SET @EndTime = GETDATE() SELECT DATEDIFF(ms, @StartTime, @EndTime) ALTER INDEX idx_MQMessageHeaderLogID ON [dbo].MQMessageDetailLog REBUILD PARTITION = ALL WITH ( PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, ONLINE = OFF, SORT_IN_TEMPDB = OFF, FILLFACTOR = 100, DATA_COMPRESSION = PAGE)If you need any more information on the above, give me a shout.
No comments:
Post a Comment