When is too many columns too many columns? – SQLServerCentral Forums (2024)

Kevlarmpowered

Say Hey Kid

Points: 713
More actions
January 12, 2018 at 3:25 pm

#334478

I know you can have some insane number of columns in SQL server... but just because you could, doesn't mean you should. Right now I am battling with a table that has almost 900 columns. 34 of those columns are dedicated to integer values which define what user is defined to what part of the process. 280 columns are defined to statuses of the particular thing. It is actually 140 datetimevalues, but when it was built, it was built to split datetimeinto two columns, one for date and one for time.
Locking is the first thing that comes to mind as a high traffic table is going to lock the entire thing up if you are updating one or many of the columns...
Jeff Moden

SSC Guru

Points: 1003577
More actions
January 12, 2018 at 6:29 pm

#1975820

Kevlarmpowered - Friday, January 12, 2018 3:25 PM
I know you can have some insane number of columns in SQL server... but just because you could, doesn't mean you should. Right now I am battling with a table that has almost 900 columns. 34 of those columns are dedicated to integer values which define what user is defined to what part of the process. 280 columns are defined to statuses of the particular thing. It is actually 140 datetimevalues, but when it was built, it was built to split datetimeinto two columns, one for date and one for time.
Locking is the first thing that comes to mind as a high traffic table is going to lock the entire thing up if you are updating one or many of the columns...
There's actually nothing wrong with a 900 column table IF the rules of normalization were applied... which also usually means that you won't have a table with 900 columns. 😉
I've worked with tables with similar insane numbers of columns. Most of those had to deal with CDRs (Call Detail Records). For those, it made no sense to split the columns off into separate tables.
If you status/datetime columns have a lot of NULLs in them, it may be worthwhile to build a "sister" table in the form of an EAV to store those statuses. A whole lot of people will tell you that EAVs are the wrong way to go but this is one of those places where they can really shine if you put some thought into it.
--Jeff Moden
RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
Change is inevitable... Change for the better is not.
Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Grant Fritchey

SSC Guru

Points: 398535
More actions
January 15, 2018 at 7:33 am

#1975895

I'm with Jeff. The right number of columns is the number of columns needed to support the object stored inside the database, whether normalized or in a star schema, whatever. There isn't a number that's too many.
That said, it sounds like, based on your description, that you're not dealing with a properly structured table. 180 columns to define a user and 280 columns to define a thing... That can't possibly be normalized or a fact table. It sounds like multiple functions have been jammed into a single table. In that case, it's not that there are too many columns, but that the structure is wrong. Bad structures will absolutely cause performance headaches, maintenance problems and even development and testing problems. Just overloading a column causes problems, let alone overloading a table.
----------------------------------------------------
The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore Roosevelt
The Scary DBA Author of: SQL Server 2022 Query Performance Tuning, 6th Edition and SQL Server Execution Plans, 3rd Edition
Product Evangelist for Red Gate Software
Eirikur Eiriksson

SSC Guru

Points: 182862
More actions
January 15, 2018 at 7:53 am
See Also
A Workaround for the Access 255 Columns Limit What Do You Mean, SQL Can't Do Big Data?How to Optimize your SQL Database to handle millions of records ?7 Most Popular Real-Time Databases

#1975899

Grant Fritchey - Monday, January 15, 2018 7:33 AM
I'm with Jeff. The right number of columns is the number of columns needed to support the object stored inside the database, whether normalized or in a star schema, whatever. There isn't a number that's too many.
That said, it sounds like, based on your description, that you're not dealing with a properly structured table. 180 columns to define a user and 280 columns to define a thing... That can't possibly be normalized or a fact table. It sounds like multiple functions have been jammed into a single table. In that case, it's not that there are too many columns, but that the structure is wrong. Bad structures will absolutely cause performance headaches, maintenance problems and even development and testing problems. Just overloading a column causes problems, let alone overloading a table.
+100
😎
There are exceptions to this though, i.e. CDRs, Surveys etc. In those cases I'd recommend looking into the sparse column features and limitations.
T.Ashish

SSCarpal Tunnel

Points: 4248
More actions
January 15, 2018 at 11:31 pm

#1975957

900 column table must be having a lot of indexes and a complex primary key. What will happen to Update and Delete performance on such tables?
Grant Fritchey

SSC Guru

Points: 398535
More actions
January 16, 2018 at 6:13 am

#1975974

T.Ashish - Monday, January 15, 2018 11:31 PM
900 column table must be having a lot of indexes and a complex primary key. What will happen to Update and Delete performance on such tables?
Lots of activity.
Seriously though, pretty much what you would expect. You have to write the data once to the store (heap or clustered) and then you have to do the appropriate writes to each applicable index. Any of those writes could lead to the addition of new pages or page splits, with a whole bunch more writes. You do that every single time.
I've said it before, I'll say it again, a properly normalized database enhances performance within an RDBMS. If you really only need a flat file, use a flat file storage system or a document database, not an RDBMS. They will work better with relational data than they work with non-relational data. They will work better with a star schema than with a flat file system. The whole "flat tables are faster" myth makes me insane. It generally comes from "flat tables are faster than my horrifically bad normalization/poor choice in fact table definition." Yes, writing mediocre code runs faster than writing bad code. Well done. What if we actually write good code?
----------------------------------------------------
The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore Roosevelt
The Scary DBA Author of: SQL Server 2022 Query Performance Tuning, 6th Edition and SQL Server Execution Plans, 3rd Edition
Product Evangelist for Red Gate Software
Jeff Moden

SSC Guru

Points: 1003577
More actions
January 16, 2018 at 7:20 am

#1975980

Grant Fritchey - Tuesday, January 16, 2018 6:13 AM
If you really only need a flat file, use a flat file storage system or a document database, not an RDBMS.
I'll say, "It Depends". Such a thing is certainly better than storing large XML or JSON junk and may be entirely necessary because an RDBMS will still handle files (once imported into a table) better than a whole lot of other things especially if it is data such as CDRs or similar information.
Yes, writing mediocre code runs faster than writing bad code. Well done. What if we actually write good code?
+1000 to that!
--Jeff Moden
RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
Change is inevitable... Change for the better is not.
Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
See Also
Why MySQL Could Be Slow With Large Tables
Jeff Moden

SSC Guru

Points: 1003577
More actions
January 16, 2018 at 7:25 am

#1975982

Sorry... posted to wrong thread on this post and moved it.
--Jeff Moden
RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
Change is inevitable... Change for the better is not.
Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Kevlarmpowered

Say Hey Kid

Points: 713
More actions
January 16, 2018 at 12:43 pm

#1976044

Thanks all for your input... unfortunately, I didn't build these tables. They were built as data warehouse tables because they didn't want users writing SQL, so to make life easier, they made tables that had lots of columns so end users could basically select * from factTable where createDate >= dateadd(year,-2,getdate()). To avoid having to do "joins" they ETL everything into single tables for users too. So there is lots of repetition. This table likely has 900 columns of which a bunch are duplicated in other tables... but because they say "joins are bad" they ETL everything into one giant flat table. Sometimes it's values... sometimes it is just a flag 0 or 1.
My thought at doing this was to create a sister table with a 1:1 relationship with the original fact table so that I could offload the user assignment and date and time columns to a separate table, but that's still pushing 314 columns into a table. My next thought was to pivot those columns and have a table dedicated to statuses. So for every "thing"I would have a 1:many relationship. Each "thing" could have multiple users assigned (bonus it would allow multiple people with the same role to be assigned to a thing which can't be done now -- if multiple people are assigned, to a thing, the most recent assignment overwrites the existing assignment). As it stands, eventually all of those 34 user assignment columns will be populated and each of those 280 date and time columns will have values in them as well.
I was just trying to wrap my head around this because yes you can have 1000+ columns per table... but my brain says it's a bad idea because locking. As the activity on the table grows, each update statement is going to lock the entire row. If we are rapidly reassigning and tagging dates/times, that's going to be a problem with growth. Going verticle would increase the number of rows, but would lessen the row locking as things are inserted/updated.
For explanation purposes ... a "thing" is assigned to a user. A user can do many different steps which is why there are 34 user assignment columns and 240 date and time columns. One user could do 10-12 steps to a particular "thing".
Kevlarmpowered

Say Hey Kid

Points: 713
More actions
February 1, 2018 at 3:25 pm

#1978277

Grant Fritchey - Tuesday, January 16, 2018 6:13 AM
The whole "flat tables are faster" myth makes me insane. It generally comes from "flat tables are faster than my horrifically bad normalization/poor choice in fact table definition." Yes, writing mediocre code runs faster than writing bad code. Well done. What if we actually write good code?
I'm 100% going to use this as I came across a table today with 1060 columns...
I think if I recommend abandoning our RDBMS for a flat file system my mugshot will end up on a wall with the title of "wanted dead or alive".
Grant Fritchey

SSC Guru

Points: 398535
More actions
February 2, 2018 at 7:14 am

#1978347

You might want to explore putting this into a columnstore index, see if that helps. It really depends on the types of queries being run against it, but pivoted storage, which you're looking at anyway, plus compression... it might make a big difference (might not).
----------------------------------------------------
The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore Roosevelt
The Scary DBA Author of: SQL Server 2022 Query Performance Tuning, 6th Edition and SQL Server Execution Plans, 3rd Edition
Product Evangelist for Red Gate Software
RonKyle

SSC-Dedicated

Points: 31531
More actions
February 2, 2018 at 8:29 am

#1978355

That can't possibly be normalized or a fact table.
Without a review of the data, this cannot be definitively established. That said, that is a lot of columns. I've never had anything close to that.
If it is a fact table, which isn't clear to me that it is, and it is otherwise correctly modelled, the number of columns could be reduced with the creation of one or more junk dimensions. It it's an OLTP table and normal forms have been followed, if there are columns that don't apply to many or most records, these could be broken out into a smaller table. [The theoretical example of a company car comes to mind: if most people don't have one, but no one has more than one, it's probably better that this is it's own table even though it could be in the employee table].
As you say, you didn't build the tables, so you may be stuck. Hopefully the table has a proper key and you can walk through the normal forms if it's an OLTP table or establish that it's all at the same level of granularity if an OLAP table.
Kevlarmpowered

Say Hey Kid

Points: 713
More actions
February 2, 2018 at 11:27 am

#1978396

It is a fact table...
Before I came across this, I had never seen a table as wide as this before so I just wanted to consult with the online community to see what the experts say if I was crazy or there was something I was missing. I personally would have broken it down to be more normalized. For example, there are 200 columns dedicated to status... each "thing" could have 100 statuses defined by a datetime(but since they split the datetimeinto two columns one date and one time... it's 200 columns). If the "thing" does not reach that status, they populate the columns with a fake null 19000101 date and 000000 time. If the "thing" reaches the status it gets populated with a date and time column. If the thing hits a status twice, the newer status date overwrites the older status date and time columns.
Imagine this over 200 columns

ID Status1Date Status1Time Status2Date Status2Time Status3Date Status3Time Status4Date Status4Time
12345 20180202 27900 19000101 0 19000101 0 20170715 0
RonKyle

SSC-Dedicated

Points: 31531
More actions
February 2, 2018 at 11:45 am

#1978398

Breaking out thedate and time is the correct approach if both items are dimensionalized. Otheriwse the row count gets out of hand quickly.
As for the dates, if it's an accumulating snapshot fact table (and it sounds as if it is) lots of dates are not uncommon. In fact, it's a typical characteristic. I've designed ones that captured the major milestones: create date/create time, customer contact date/customer contact time, scheduled date/scheduled time, closed date/closed time. Not all the dates had times dimensions associated with them. Not all statuses have independent columns however. There are two dates that map to 'N/A' and 'Unknown.' An open order does not have a closed date (NULL in the OLTP system, but 'N/A' in the warehouse). If it's in one of the closed statuses and does not have a closed date, a NULL value is then mapped to Unknown. These are the only two dates with a string that is not a date. All the other dates have string values that displays the date.
Based on what you're saying, on it's face it doesn't sound wrong. Without reviewing data that can't be known for certain. Hopefully this helps you decided. You could always try a fact tablethat has a granularity of item status change. If that meets your reporting needs it would save a lot of space. But that would make some things harder to report on than your current model.
Kevlarmpowered

Say Hey Kid

Points: 713
More actions
February 2, 2018 at 12:12 pm

#1978401

Oh... I already reverse engineered it and unpivoted it back to a vertical for my data model. There are many more rows, but the import and rendering in PowerBI works significantly faster/better. This added overhead though worries me should they decide to change something in the base layer. I'll have to chase it down in my reverse engineering again.
I also took out the fake nulls and the data size dropped significantly. I handle the NULLs on the application side converting NULL to either "not applicable" or "not available". Moving less data helped the import process... there was a 40% data savings but removing the fake nulls. It was tricky to determine when n/a was actually a NULL versus when the user did not answer the question. "not available" versus "not answered".

Viewing 15 posts - 1 through 15 (of 47 total)

You must be logged in to reply to this topic. Login to reply

As someone deeply immersed in the realm of SQL Server development and T-SQL, I can attest to the complexities and challenges associated with database design, especially when it comes to handling a vast number of columns. The discussion revolves around a table with almost 900 columns, sparking a debate on whether such a design is optimal and the potential issues it may introduce.

The primary concern raised by the original poster, Kevlarmpowered, is the impact of having an extensive number of columns on SQL Server performance, particularly in high-traffic scenarios where locking becomes a significant consideration. Jeff Moden, an experienced contributor, introduces the concept of normalization and emphasizes that there's nothing inherently wrong with a table having many columns if normalization rules are appropriately applied. However, he acknowledges that such tables are rare in practice.

Grant Fritchey supports the idea that the right number of columns is determined by the needs of the object stored in the database. He highlights the importance of proper table structure and normalization, suggesting that the described table with 900 columns may indicate a structural issue rather than a problem with the number of columns per se.

Eirikur Eiriksson agrees with Jeff and Grant, emphasizing that the key is to have the number of columns needed to support the object stored, whether in a normalized or star schema. He acknowledges exceptions, such as Call Detail Records (CDRs), where a large number of columns may be justifiable.

T.Ashish raises concerns about the potential impact on performance, particularly with a table of 900 columns having numerous indexes and a complex primary key. Grant Fritchey responds by stating that there will be a lot of activity, and the performance will be as expected—data needs to be written to the store and appropriate writes to each applicable index.

The discussion delves into the importance of proper database design, normalization, and the impact on performance. Jeff Moden later suggests the use of Entity-Attribute-Value (EAV) tables for scenarios with a high number of NULLs in certain columns, despite the common criticism of EAVs.

Kevlarmpowered provides context about the origin of the table, explaining that it was built as a data warehouse table to simplify queries for end-users who were discouraged from writing SQL. This led to the creation of tables with numerous columns and data denormalization for ease of querying.

The conversation also touches upon the idea of "flat tables are faster" being a myth, with Grant Fritchey expressing frustration about the misconception. He suggests exploring columnstore indexes for the table in question.

RonKyle contributes by suggesting that the wide table described may be a fact table but questions its normalization and structure. Kevlarmpowered further explains the nature of the table, detailing the columns dedicated to status and the challenges of dealing with numerous datetime columns.

In summary, the experts in the discussion emphasize the importance of proper database design, normalization, and considering the specific needs of the stored objects when deciding on the number of columns. The challenges of performance, especially in scenarios with a large number of columns, are acknowledged, and alternative approaches, such as EAV tables or columnstore indexes, are suggested based on specific use cases.

ID	Status1Date	Status1Time	Status2Date	Status2Time	Status3Date	Status3Time	Status4Date	Status4Time
12345	20180202	27900	19000101	0	19000101	0	20170715	0