I am just a medium, SQL Server the Goal

Monthly Archives: October 2012

SQL Server Record Structures–Part 4

Previous Posts in this series.

  1. SQL Server Record Structures-Part 1
  2. SQL Server Record Structures-Part 2
  3. SQL Server Record Structures-Part 3

In the pervious posts I have discussed the structure of the data record in SQL (without compression or sparse columns). We also discussed a few special cases which affect how the record is stored in SQL. In the ensuing post, I would be talking about record structures when data compression is enabled on the table/index.

Data compression was first introduced in SQL with SQL Server 2005, in the form of VarDecimal Storage, where in any column with Decimal/Numeric data type was converted into VarDecimal storage in order to utilize only the required amount of bytes. This obviously was subject to enabling VarDecimal Storage on the DB and the table.

With SQL Server 2008, two new compression techniques were introduced. VarDecimal Storage was deprecated in this version of SQL as row compression achieves the almost the same results.

Row Compression

When Row compression is enabled on the table/Index all columns in the table are converted to use variable length storage, using only enough bytes to store the data. For example if a column is declared as Char(50) and has the following string as data “SQLUninterrupted” instead of using the 50 bytes of storage( as would be the case when no compression is enabled), SQL only uses 16 bytes when compression is enabled.

Some points to keep in mind when enabling Row Compression on a table/Index

  1. Compression does not change the Max Size limits on the record or Index keys.
  2. Compression cannot be implemented if the maximum size of the record, plus the Compression Information overhead exceed 8060 bytes.
  3. Compression can be enabled on individual partitions of a Partitioned table/Index.
  4. Changing the compression setting of a heap table requires a rebuilt of all the non-clustered indexes on the table.

Page Compression

When Page Compression is enabled on a Table/Index it takes a 3 level approach. First row compression is enabled on the page, followed by a Prefix and Dictionary compression. Details of how Prefix and dictionary compression work is available in Book Online topic “Page Compression Implementation”. When Page Compression is enabled, a special record called the Compression Information Record is added to the page. This CI record stores the Prefix/Dictionary compression information.

Some points to keep in mind when enabling Page Compression on a table/index.

  1. Page Compression only kicks in when the Page is full and a new row is being inserted. If enough space can be saved to accommodate the new record and the compression information on the page, the page will be compressed. Otherwise it would not be compressed.
  2. Enabling Page Compression on a heap table does not enforce compression on the pages until the heap is rebuilt.
  3. New pages being added to the heap are also not compressed until the heap is rebuilt.
  4. Non-Leaf pages of an index cannot be page compressed. This is to avoid the overhead of uncompressing the record during index operations.

Compression Record Structure

When compression is enabled on a table, the data records are stored in a new format commonly called the CD (Column Descriptor) format. This format is entirely different from the regular record structure used when there is no compression or no sparse columns defined in the table. The Structure of the record takes the following format.


Record Header Region

This is a 1 byte section with the following information

  • Bit 0 – Type of Record. This is set to 1 for CD format Records
  • Bit 1 – Indicates if the record has versioning information.
  • Bit 2-4 – Indicates the formation stored in the row. Details of these bits can be found in the Microsoft® SQL Server® 2008 Internals by Kalen Delaney, Paul S. Randal, Kimberly L. Tripp , Conor Cunningham , Adam Machanic
  • Bit 6 – Indicates if the record contains a long data region or not.
  • Bit 7 – Not used.

Column Descriptor Region

This region is composed of two parts, the first part is either 1 byte or 2 bytes and indicated the number of columns in the table. The second part is 4 bit (per column) descriptor which indicates the length of the column. The 4 bits can be used to represented 16 different values, of which only 13 are used with SQL 2008. For details of the descriptors refer the SQL Server 2008 Internals book mentioned above.


Short Data Region

Contains columns which are less than equal to 8 bytes in length. Since the length of the columns are not directly stored in this region, in tables with a lot of columns, it can be expensive to access the last column, as this would require processing all the previous columns (length of each column) and then computing the start and end of the desired column. To mitigate this, the columns are divided into clusters of 30 columns each. A Short Data Cluster Array(byte array) is then created in this region to store the length of each cluster. Since there are only 30 columns in each cluster (with each column being max of 8 bytes), a single cluster can have a max length 240 bytes.


Long Data Region

All columns with data length grater than 8 bytes are stored in the long data region. Since the length of these columns is not stored in the CD regions (each CD descriptor is only 4 bits, whereas the column length can be anything), the Long Data region requires a column offset array to correctly locate the column data in the Long Data Region.

Again, in order to reduce the cost of locating a column value in this region, we have a Long Data Cluster array, which  stores the number of columns in each clusters (again each cluster represents up to 30 columns).


Note: This illustration has been taken from the Microsoft® SQL Server® 2008 Internals by Kalen Delaney, Paul S. Randal, Kimberly L. Tripp , Conor Cunningham , Adam Machanic

Special Information Section

Contains optional information about forwarding/back pointer or the Versioning information.

Row Compression Example

Create Table RowCompressionExample


Col1 int, 

Col2 bigint, 

Col3 char(40), 

Col4 char(30), 

col5 numeric(18,7), 

Col6 varchar(300), 

col7 datetime, 

col8 varchar(400), 

col9 char(100), 

col10 char(100)



-- Enable Row Compression on the table



----- Insert 2 records into the table using

insert into RowCompressionExample values

(10, 345678345, 'Sourabh', 'Agarwal', 123345456.3456, 'This is first Long data', getdate(), 

'This is a Second long Data', 'This is a third long Data', 'short')


insert into RowCompressionExample values

(76854, 2000, 'Short1', 'LongDataValue1', 3847.34, 'This is first Long data', getdate(), 

'This is a Second long Data', 'This is a third long Data', 'LongDataRegion')


-- Use DBCC IND to display the Pages used for this Table.. 

DBCC IND(10,'RowCompressionExample',-1)


--- Use DBCC PAGE to check the records structure on the Page

DBCC TRACEON (3604, 1)

DBCC PAGE(10,1,285,3)

I have created a sample table with 10 columns of varying size and then enabled Row Compression on the Table. I then inserted two records in the table. Now Lets look at the record structure from the output of the DBCC Page command.


As indicated by the DBCC PAGE output, the total record length is only 128 bytes as compared to 362 bytes that might have been required to store the same record in the uncompressed format.

Next we also have the CD Array displayed in the output. Since there are 10 Columns in the table, there are 10 entries in the CD array.

As can be seen, the columns Col6, Col8 and Col 9 have values longer than 8 bytes and hence are marked as Long in the CD array. Also notice that since there are only 10 columns, all the columns are part of the same cluster (cluster 0).

Page Compression Example

I used a similar table structure  as above, to demonstrate Page Compression. As I mentioned earlier, Page compression does not kick in until the Page is full and a new record has to be inserted into the table.

I created the following tabled and enabled Page Compression on the Table.

Create Table PageCompressionExample


Col1 int, 

Col2 bigint, 

Col3 char(40), 

Col4 char(30), 

col5 numeric(18,7), 

Col6 varchar(300), 

col7 datetime, 

col8 varchar(400), 

col9 char(100), 

col10 char(100)



-- Enable Row Compression on the table


Next I inserted 64 records into the table using the following script

declare @count int = 0

While(@count < 32)


insert into PageCompressionExample values

(10, 345678345, 'Sourabh', 'Agarwal', 123345456.3456, 'This is first Long data', getdate(), 

'This is a Second long Data', 'This is a third long Data', 'short')

insert into PageCompressionExample values

(76854, 2000, 'Short1', 'LongDataValue1', 3847.34, 'This is first Long data', getdate(), 

'This is a Second long Data', 'This is a third long Data', 'LongDataRegion')

set @count = @Count+1


At this point, there are two data pages in the table with row compression only. So why did the Page Compression not take effect?

The reason is that on a heap, PAGE Compression would not take effect until the heap table has been rebuilt. So lets rebuild the table..


Now there is only one Page in the table with Page Compression enabled..

When Page Compressed, SQL Server adds a new record to the top of the page (right after the Header Info) called the Compression Information record. Additionally SQL server adds information in the page header to indicate that the page is “Page Compressed”. If the value of m_typeFlagBits = 0x80, then the page is Page Compressed.

In the next part of this blog series I will talk about the Compression Information record structure.


SQL Server Record Structures–Part 3

In the previous posts in this series I had discussed about Regular Data Record structures and the special case of Row Forwarding. In this post I will talk about two other special cases and how the record structures are changed or impacted with they are present.

  • Row Versioning
  • Ghost Records/Ghost Cleanup

Row Versioning

Row Versioning was first introduced in SQL Server 2005. Several SQL Server features (Online Index Rebuild, CheckDB, etc.) and the 2 new row versioning based Isolations Levels (Snapshot Isolation and Read Committed Isolation level) in SQL use Row Versioning. There is a plethora of content on MSDN and other blogs about how the row versioning based Isolations work, it is for this very reason I wont be talking about these features in this post. Instead, what I would concentrate is how row versioning based isolation levels change the record structure in SQL Server.

Row Versioning is applicable to all types of records (data, Index, text) in SQL. When any of the row versioning based Isolation level is enabled on the database, for any update on an existing record in the table, SQL Server creates a last committed copy of the record and put this copy in the version store in the Tempdb database. Any other operation that attempt to read this record (i.e. before the update transaction has committed) would read the record from the version store.

The original record, which remains in the database is modified to append a 14 Byte Versioning Tag. This 14 byte is used to present the Time Stamp of when the record was created and a pointer to the pervious version of the record in the Version Store.


--- create a table for testing for the versioning, insert two records in the table. 


Create Table VersionedRecords (Col1 int Identity(1,1), Col2 varchar(100), Col3 Datetime, Col4 char(50))


insert into VersionedRecords(col2, col3, Col4) values (Replicate ('Sourabh', 10), getdate(), Replicate('AAA', 10))


insert into VersionedRecords(col2, col3, Col4) values (Replicate ('Agarwal', 14), getdate(), Replicate('BBB', 10))



-- Now lets check the Record sizes for this Table.


DBCC IND ('DatabaseName', 'VersionedRecords',-1)

DBCC PAGE(10, 1, 281, 3)

the Record Sizes are as below….


The sizes for the records 143 and 171 bytes respectively, which is in accordance with what I discussed in the part 1 of this blog series.

Now lets enable the Snapshot Isolation on the database.

-- Enable Snapshot Isolation in the DataBase.

Alter Database DatabaseName Set ALLOW_SNAPSHOT_ISOLATION  ON

Once the Database option was enabled, I ran an Update Statement on the Table using another connection.

-- Run the Update transaction so that the Versioned Records are created

begin transaction 

update VersionedRecords set Col2 = Replicate('Sourabh', 14) where Col1 = 1

now if we dump the data page again, we see that the record size for the 1st record changes, while the one for 2nd record remains the same.


we can see that there is a increase of 42 bytes in the record (28 extra bytes to store the new value + 14 bytes for the versioning tag). details of the versioning information (14 byte) will be discussed in another blog at a later time.

Side Pointers

  1. Existing records in a table are not immediately modified when Row Versioning is enabled on the data, they are only done when a subsequent update operation happens on the record.
  2. All new records added to the table (after changing the database setting) would have the Versioning bytes appended to them.
  3. Versioning store is located in the tempdb, and can cause of I/O activity in the TempDB. This can lead to performance issues on the server, if Tempdb storage is not carefully planned.
  4. Multiple updates can be done on the same record which can lead to a chain of versioning records. Any subsequent read operation has to traverse this chain to get the correct version of the record. This can be time consuming and hence can lead to performance issues.
  5. Version store cleanup is a background process and there can be scenarios where the rate of cleanup is less than the rate of generation of version records. This can lead to increased Tempdb usage.

Ghost Records/Ghost Cleanup

In SQL Server when a record is deleted from the table, the physical storage for the record is not immediately destroyed or deleted, instead the record is marked for deletion and the actual deletion happens at a much later point in time. These records which are marked for Deletion are called Ghost Records. This process significantly increases the performance of the delete operations. How? I will leave it up to you to figure that out.

In the 2 byte record header, the combination of bits 1-3 is used to represent whether the record is ghost record or not.

Bits 1 through 3 Taken as a three-bit value

5 indicates a ghost index record,

6 indicates a ghost data record, and

7 indicates a ghost version record.

The actual deletion of the records is performed by the Ghost Cleanup Task, which runs at an interval 10 seconds (SQL 2008 and above) and 5 seconds (SQL 2005 and below). The deleted records are added to the delete queue of the ghost cleanup task, during a table scan.

Another important thing to remember is that the records are not really deleted, just the space is marked as not being used.

SQL Server Record Structures–Part 2

In the first part of this series, I had discussed about the regular record structure used in SQL Server. In this post I would be talking about a special case, Row Forwarding and how they effect the record structures in SQL.

In order to understand Row Forwarding, it’s important to first understand how non clustered indexes created on a Heap table works.

in case of heap tables, non-clustered indexes on their leaf pages have the RID along with the index key values. This RID value helps link the non-clustered index to the heap table during a scan or a seek. Consider the following example. This RID value is combination of PageID:RowID identifying a physical record in the table.

Assume we have a heap table with the following columns (Col1, Col2, Col3, Col4, Col5). Also assume there is a non-clustered index on the column (col1).

The index leaf page would have a similar structure (this is just an illustration!)

Index Key(a)  RID
1 222:0
2 222:1
3 222:2
4 222:3
5 222:4
6 222:5
7 222:6


now assume if we are running the following query, against the table

Select Col1, Col2, Col3 from Table1 

where Col1 = SomeValue

this query would can potentially use the Non Clustered Index (depending on whether the cost of using the index is lesser than table scan on not). If the query uses the NCI, then it can get the values of Col1 quiet easily, but for the values of the columns Col2 and Col3, it has to piggy back on the RID value to reach the actual data record in the table (PageID and RowID) and get the values from there.

This operation in SQL is called the Bookmark Lookup (SQL 2000) or the Lookup Operation (SQL 2005 and onwards).

Now assume we have a table with some Variable length columns. When the values in the variable length columns are updated, the update might result in an increase in the size of the column. SQL server might not be able to fit this new record on the same page and may cause a Page Split, thereby moving the current record and potentially other records on the page to a different page.

Now if there was a Non-clustered index on this table, then the non-clustered index would have to able to be modified to reflect movement of the rows. This would make the update operations very expensive.

So instead of having to update the non-clustered index, SQL server creates a pointer/stub at the initial location of the record to point to the new address of the record. This way, when the NCI scan or seek reaches the record, it simply reads the pointer record and reach the new location of the Row. The pointers are called forwarding Pointers/records and the actual record is called the Forwarded Records.

The same record can be modifies multiple times and a new forwarded record might have to be created. This could potentially lead to having a chain of forwarding/forwarded records. In reality this does not happen. What actually happens is that the Forwarded Record also contains a back-pointer to the forwarding record. So when the multiple changes are being made to the record, the engine just takes the new location of the record and updates the original forwarding record, to point to this new location.

Row Forwarding is bad for performance. Also, row forwarding only happens in HEAP Tables.

Example: Examining the Forwarding Record and Forwarded Record

CREATE TABLE ForwardingRecord


Col1 int NOT NULL,

Col2 char(1000) NOT NULL,

Col3 varchar(3000) NULL,

Col5 varchar(4100) NOT NULL



Insert into ForwardingRecord values (1, Replicate('a',1000), replicate('b',1000),replicate('b',1000))

Insert into ForwardingRecord values (2, Replicate('a',1000), replicate('b',1000),replicate('b',1000))


DBCC IND('DatabaseName','ForwardingRecord',-1)


DBCC PAGE(10,1,228,3)

The result of the DBCC Page, would show that there are two records on the page. Each record is about 3017 bytes in size.

now lets update the second record in just a way that it causes a Page Split.

Update ForwardingRecord set Col5 = Replicate('v', 4100) where col1=2

Dumping the same page again and looking at the Slot 1 (the original record was here) we see,


As can be seen, the forwarding record has the information about where the forwarded record exists. Let’s now try to dump the Page 280 and check the record.


as can be seen that the Forwarded record has information about the forwarding record. As Paul Randal mentions in this blog, the back pointer is 10 bytes in size.

SQL Server Record Structures–Part 1

Recently while talking about the SQL Server Storage engine at one of the customer workshops, I was asked about how data is actually stored in the tables and indexes. With the exception of Column Store Indexes (*introduced in SQL 2012), data in SQL Server is stored in rows.

A data row in SQL Server can be stored in one of 3 formats, depending on what SQL Server feature is enabled in the database.

  • Regular Data Record: The regular Data row structure which have been in use since the early days of SQL Server.
  • Compression Row Structure : When Row/Page compression is enabled on the Table.
  • Sparse Column Row Structure: When Sparse Columns are defined in the table.

Index Records are in two formats

  • Leaf record structure
  • Non-Leaf record structure

Additionally SQL Server has a separate type of record structure to store LOB data. LOB data can either be stored as

  • Off Row Data
  • Row-overflow data

In addition to the above mentioned data structures, SQL also has a Versioning Records (for all data, Index, text types) when row versioning is enabled or as used by some internal SQL features.

In the first part of the blog series I would be talking about the regular data row structure as has been used since the early days of SQL Server, the other record structures for compression and or sparse columns would be discussed in later posts. The Data records is part of a Heap Table (tables with no clustered index) or the leaf level of a clustered index. A Data Record is composed from the values of all the columns in the table. Another important point to keep in mind is that in SQL Server, the terms Row/Record/Slots are all synonymous and can be used interchangeably.


The record can be divided into 8 sections as depicted in the picture above.

  1. 2 Bytes Record Status (Record header) : Currently only 9 bits of this is being used. The header contains information about what kind of record it is. Whether it is versioned or ghosted record and so on. For more information on these 2 bytes refer, Microsoft® SQL Server® 2008 Internals by Kalen Delaney, Paul S. Randal, Kimberly L. Tripp , Conor Cunningham , Adam Machanic
  2. The second 2 bytes (Length of Fixed Length Columns) indicate the total length of fixed length columns (int, date, float, money char etc.)
  3. The 3rd section of N bytes is the actual storage for the Fixed length columns. Yes, you got it right, SQL Server stores all the fixed length columns together (irrespective of where they are defined in the table).
  4. 2 Bytes for Number of Columns – 2 bytes are used to store the number of columns in the record. In the default scenario (no sparse columns) a table in SQL 2008 can have up to 1024 column, which means at least 10 bits would be required to store this information.
  5. Null Bitmap (1 bit per column in the table): 1 bit would be used per column, to indicate whether the column can have a null value or not. This is done for all the columns in the table.
  6. 2 bytes – Number of variable length columns in the table: A count of the variable length columns in the table.
  7. Variable length Column offset (2 bytes per variable length column): This offset stores the ending offset for each variable length column in the table. This allows SQL Server to efficiently calculate the start,end and length of each variable length columns. 2 bytes is required because the column offset can be anywhere on the 8KB page.
  8. N bytes – For the variable length columns storage: Storage for the column values.

To understand this further, lets take few examples. In all the examples below, the following DBCC PAGE/DBCC TRACEON commands would be used to display the page/record structure.

Example 1: Table With All fixed length columns

Consider a SQL Table with the following structure

create table FixedLenghtColumns


col1 int,

col2 float, 

col3 datetime, 

col4 char(25)



-- Insert a record into the table

Insert into FixedLenghtColumns values (1, 1.0001, getdate(), 'Fixed Length Columns')


-- DBCC IND Command to display the allocations for this table

DBCC IND('Database Name','FixedLenghtColumns',-1)


-- DBCC PAGE Command to display the Page structure

DBCC TRACEON (3604,-1)

DBCC PAGE (10, 1, 276, 3)

Lets look at the page and record structure from the DBCC Page Output.


The fields marked in Red boxes are important.

pminlen – Indicates the minimum length of a record in the table. This table has 4 columns with a total length of 45 bytes. Add to it the Row header and 2 bytes for the length of the fixed length columns.

m_slotCnt: This field indicate the number of slots or records on the page.

Record Size/Record Attributes: This indicate the actual size of the this record and what are the attributes that this record has. For example this record contains a NULL_BITMAP field.

The “Slot 0 Offset 0x60 Length 52” section indicate that this is the first record on the page, and the record starts at offset 0x60 (decimal 96). Which is valid since the Page Header occupies the first 96 bytes of a SQL Server Page. Please note all offsets start at 0.

Now lets see why the record occupies 52 bytes. The first 4 bytes of the record are fixed for the record header and the length of fixed length columns. Next the total length of the fixed length columns is 45 bytes. Then we need 2 bytes for the number of columns and 1 byte for the Null Bitmap (4 bits for the four columns, but since it all byte allocation, we need 1 byte). This is visible in the record details we see in the DBCC PAGE output.


As can be seen, the first column value starts at offset 0x4 (decimal-4). Which is after the first 4 bytes for the Row header and the length of the fixed length columns. The other column values follow the similar pattern.

Example 2: Tables with mix of variable and fixed length columns

Lets consider another example, where the table contains a mix of fixed and variable length data types.

CREATE TABLE MixDataTypeColumns


Col1 int NOT NULL,

Col2 char(25) NOT NULL,

Col3 varchar(60) NULL,

Col4 money NOT NULL,

Col5 varchar(20) NOT NULL



Insert into MixDataTypeColumns values (10, 'Sourabh Kumar Agarwal', 'This is a test', 3764.846, 'last column')


DBCC IND('DatabaseName','MixDataTypeColumns',-1)


DBCC PAGE(10,1,274,3)

The output of the DBCC PAGE indicates the following…


The Record length here is 75 which can be summed up as follows

2 bytes Record Header+ 2 bytes for the length of fixed length data types columns + 37 bytes for the fixed length data types columns + 2 Bytes of No of Columns + 1 bytes (*5 bits) for NULL_BITMAP + 2 bytes for number of Variable length data type columns + 2*2 bytes for the Variable length column offset + 25 bytes used for the variable length column values.

Also as visible from the column offset information from the snapshot above, we can see that the fixed length columns are stored first (notice the column offsets) and then the variable length columns are stored, Which is

  • Col1 (first fixed length Column) – 0x4 – decimal 8
  • Col2 (second fixed length Column) – 0x8 – decimal 8
  • Col4 (last fixed length Column) – 0x21 – decimal 33
  • Col3 (first variable length Column) – 0x32 – decimal 50
  • Col5 (first fixed length Column) – 0x40 – decimal 64

In the next post, I will talk about 3 special cases involving versioning, ghost records and row forwarding.