In databases designed without close oversight and tight discipline, it is common to see prolific if not abusive use of varchar and nvarchar columns with excessive max lengths, examples: varchar(8000), nvarchar(4000) and either ASCII or Unicode with the max option. The business analyst might be non-committal or would like to allow for unspecified very long fields. The developer who does not know to demand prudence might think that a column with fat max length is magically handled by SQL Server without any apparent cost or penalty. Note, not many developers are aware of the maximum index key size. But we should know that"Magic always comes with a Price, Dearie."
Let's start with the test table:
CREATE TABLEdbo.Products GroupIDintNOT NULL,ProductIDintNOT NULLIDENTITY(1,1),ProductNamenvarchar(2000) NOT NULL,ProductAltnvarchar(4000) NOT NULL,ProductDescnvarchar(max) NOT NULL,IntroDatedateNOT NULL,IDintNOT NULL,INDEXUCXUNIQUE CLUSTERED(GroupID,ProductID))
The table's string columns are populated with fields of short to intermediate length, from 5 to 30 characters on average. Executing queries, each of which groups on one of the columns will auto-generate statistics. Statistics are explicitly updated with FULLSCAN here for test consistency and determinacy purposes.
Consider queries to each of the string columns of the following form:
SELECTID,ProductNameFROMdbo.ProductsWHEREGroupID=1
The execution plan is a simple clustered index seek:
The index seek details are below:
Note the Estimated Row Size 2015 B, third row from bottom. This comprises 11 bytes overhead, 4 bytes for the integer NOT NULL column, and 2000 bytes for the nvarchar(2000) column. In Unicode, each character is 2 bytes, so the assumption seems to be average length is one-half the maximum length.
The nvarchar(4000) column Estimated Row Size is 4015 B, comprising 11 bytes overhead, 4 bytes integer, and 4000 bytes based on one-half the variable length column max length. For the nvarchar(max) column, Estimated Rows Size is 4039 B, having an extra 23 bytes.
From data distribution stats, SQL Server knows for FULLSCAN, the actual, or estimate in case of partial sample, average length of the column ProductName is 21 bytes.
I was advised that the SQL Server query optimizer can in fact use the knowledge from data distribution statistics if it has reason to access said statistics. But it will not do so if it is only for estimating row size. The rule method is employed when statistics access is not otherwise needed. In this case of one-half the maximum length, and 4000+ for the max option.
The approach suggested for evaluating this was to apply an additional WHERE clause argument of ISDATE(Column) IN (0,1), which is always true and does not affect either estimated or actual rows.
SELECTID,ProductNameFROMdbo.ProductsWHEREGroupID=1ANDISDATE(ProductName) IN (0,1)
This approach did not change the Estimated Row Size in my test case, and it might be version specific (SQL Server version 2019 in the test system).
Edit: I was actually supposed to use WHERE 'real SARG' AND (ProductName <> 'NonExistentValue' OR ISDATE(ProductName) IN (0,1)). The ISDATE output is always 0 or 1, so the string inequality does not impact result rows.
The approach that did affect Estimated Row Size is as below:
SELECTID,ProductNameFROMdbo.ProductsWHEREGroupID=1ANDProductName<>'Rumpelstiltskin'
For this approach to produce correct query results, it must employ a non-existent value and is valid for NOT NULL columns. The Estimated Row Size is now 36 B, corresponding to 11 bytes overhead, 4 bytes for the integer, and 21 bytes for the string.
The queries above have zero Serial, Required, Desired, Granted and MaxUsedMemory because each row is sent to the client without need for retention.
<MemoryGrantInfoSerialRequiredMemory="0"SerialDesiredMemory="0"GrantedMemory="0"MaxUsedMemory="0"/>
The following queries need internal memory for intermediate results:
SELECTID,MAX(ProductName)FROMdbo.ProductsWHEREGroupID=1GROUP BYISELECTID,MAX(ProductName)FROMdbo.ProductsWHEREGroupID=1ANDISDATE(ProductName) IN (0,1)GROUP BYIDSELECTID,MAX(ProductName)FROMdbo.ProductsWHEREGroupID=1ANDProductName<>'Rumpelstiltskin'GROUP BYID
For the first and second queries, the Memory Grant Info is mostly the same. The first is:
<MemoryGrantInfoSerialRequiredMemory="1024"SerialDesiredMemory="252632"RequiredMemory="9416"DesiredMemory="261064"RequestedMemory="261064"GrantWaitTime="0"GrantedMemory="261064"MaxUsedMemory="260296"MaxQueryMemory="3004400"LastRequestedMemory="0"IsMemoryGrantFeedbackAdjusted="No: First Execution"/>
Recommended next reads
The second differs only insignificantly in MaxUsedMemory of 260040, slightly lower than 260296 in the first. The desired memory of 251064 about 1.3X the size of the 101694 rows of 2015 bytes, for room to store the data plus working space? It would appear that SQL Server is very good at reusing workspace and not allocating more memory than necessary? The IsMemoryGrantFeedbackAdjusted value is "No: First Execution".
The third query has memory grant as below, much lower desired at 15944 and MaxUsed even less at 8504. This is 2.37X the data size, and the execution has a SORT operation, whereas the first two employed Hash Match.
<MemoryGrantInfoSerialRequiredMemory="512"SerialDesiredMemory="11104"RequiredMemory="5320"DesiredMemory="15944"RequestedMemory="15944"GrantWaitTime="0"GrantedMemory="15944"MaxUsedMemory="8504"MaxQueryMemory="3004400"LastRequestedMemory="0"IsMemoryGrantFeedbackAdjusted="No: First Execution"/>
The plan XML does in fact confirm the use of statistics for the string column:
<StatisticsInfoDatabase="[TestDB]"Schema="[dbo]"Table="[Products]"Statistics="[_WA_Sys_00000003_25869641]"ModificationCount="0"SamplingPercent="100"LastUpdate="2021-12-02T09:57:41.05"/>
In the second execution of the above set, Adaptive Feedback increased memory allocation for the first two, the third query had a good request in the first place.
Summary
SQL Server does or could have reasonably good information to formulate a good execution plan including memory grant assessments. There are data correlation scenarios in which a combination of individual column averages can diverge significantly in specific actual queries, but such is life.
The decision to not investigate data distribution statistics only for row size estimation is a good decision in transactional queries (highly selective indexes, low number of rows touched). The rule method of one-half the max length is reasonable for an expert designed database, but perhaps not for a nitwit designer.
Statistics should be employed for high row count queries beyond some threshold level as this could have great out-of-box impact in DW and reporting environments.
There are memory max and min Grant hints, but I have not had success in using this. Perhaps the undocumented subtleties elude me for now.
Another possibility is yet another query hint option to direct accessing data distribution statistics for column size information. Microsoft tries very hard to work well out of box without deep knowledge of SQL Server internals. When they accomplish this, and people actually upgrade to that version, then it is time to put me into retirement or pasture.
Would it be a good idea to incorporate memory grant size into the plan cost model? This would favor plans having lower memory grant. Note the idea it to favor, not mandate.
Per the kindly advice from Darth Vader to Director Krennic: "Don't on your own ambition," Every unnecessarily fat column adds to the risk of choking for misguided ambition to meet stupid requirements.