When is too many columns too many columns? – SQLServerCentral Forums (2024)

  • Kevlarmpowered

    Say Hey Kid

    Points: 713

    More actions

    January 12, 2018 at 3:25 pm

    #334478

    I know you can have some insane number of columns in SQL server... but just because you could, doesn't mean you should. Right now I am battling with a table that has almost 900 columns. 34 of those columns are dedicated to integer values which define what user is defined to what part of the process. 280 columns are defined to statuses of the particular thing. It is actually 140 datetimevalues, but when it was built, it was built to split datetimeinto two columns, one for date and one for time.

    Locking is the first thing that comes to mind as a high traffic table is going to lock the entire thing up if you are updating one or many of the columns...

  • Jeff Moden

    SSC Guru

    Points: 1003577

    More actions

    January 12, 2018 at 6:29 pm

    #1975820

    Kevlarmpowered - Friday, January 12, 2018 3:25 PM

    I know you can have some insane number of columns in SQL server... but just because you could, doesn't mean you should. Right now I am battling with a table that has almost 900 columns. 34 of those columns are dedicated to integer values which define what user is defined to what part of the process. 280 columns are defined to statuses of the particular thing. It is actually 140 datetimevalues, but when it was built, it was built to split datetimeinto two columns, one for date and one for time.

    Locking is the first thing that comes to mind as a high traffic table is going to lock the entire thing up if you are updating one or many of the columns...

    There's actually nothing wrong with a 900 column table IF the rules of normalization were applied... which also usually means that you won't have a table with 900 columns. 😉

    I've worked with tables with similar insane numbers of columns. Most of those had to deal with CDRs (Call Detail Records). For those, it made no sense to split the columns off into separate tables.

    If you status/datetime columns have a lot of NULLs in them, it may be worthwhile to build a "sister" table in the form of an EAV to store those statuses. A whole lot of people will tell you that EAVs are the wrong way to go but this is one of those places where they can really shine if you put some thought into it.

    --Jeff Moden

    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Grant Fritchey

    SSC Guru

    Points: 398535

    More actions

    January 15, 2018 at 7:33 am

    #1975895

    I'm with Jeff. The right number of columns is the number of columns needed to support the object stored inside the database, whether normalized or in a star schema, whatever. There isn't a number that's too many.

    That said, it sounds like, based on your description, that you're not dealing with a properly structured table. 180 columns to define a user and 280 columns to define a thing... That can't possibly be normalized or a fact table. It sounds like multiple functions have been jammed into a single table. In that case, it's not that there are too many columns, but that the structure is wrong. Bad structures will absolutely cause performance headaches, maintenance problems and even development and testing problems. Just overloading a column causes problems, let alone overloading a table.

    ----------------------------------------------------
    The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore Roosevelt
    The Scary DBA
    Author of: SQL Server 2022 Query Performance Tuning, 6th Edition and SQL Server Execution Plans, 3rd Edition
    Product Evangelist for Red Gate Software

  • Eirikur Eiriksson

    SSC Guru

    Points: 182862

    More actions

    Grant Fritchey - Monday, January 15, 2018 7:33 AM

    I'm with Jeff. The right number of columns is the number of columns needed to support the object stored inside the database, whether normalized or in a star schema, whatever. There isn't a number that's too many.

    That said, it sounds like, based on your description, that you're not dealing with a properly structured table. 180 columns to define a user and 280 columns to define a thing... That can't possibly be normalized or a fact table. It sounds like multiple functions have been jammed into a single table. In that case, it's not that there are too many columns, but that the structure is wrong. Bad structures will absolutely cause performance headaches, maintenance problems and even development and testing problems. Just overloading a column causes problems, let alone overloading a table.

    +100
    😎

    There are exceptions to this though, i.e. CDRs, Surveys etc. In those cases I'd recommend looking into the sparse column features and limitations.

  • T.Ashish

    SSCarpal Tunnel

    Points: 4248

    More actions

    January 15, 2018 at 11:31 pm

    #1975957

    900 column table must be having a lot of indexes and a complex primary key. What will happen to Update and Delete performance on such tables?

  • Grant Fritchey

    SSC Guru

    Points: 398535

    More actions

    January 16, 2018 at 6:13 am

    #1975974

    T.Ashish - Monday, January 15, 2018 11:31 PM

    900 column table must be having a lot of indexes and a complex primary key. What will happen to Update and Delete performance on such tables?

    Lots of activity.

    Seriously though, pretty much what you would expect. You have to write the data once to the store (heap or clustered) and then you have to do the appropriate writes to each applicable index. Any of those writes could lead to the addition of new pages or page splits, with a whole bunch more writes. You do that every single time.

    I've said it before, I'll say it again, a properly normalized database enhances performance within an RDBMS. If you really only need a flat file, use a flat file storage system or a document database, not an RDBMS. They will work better with relational data than they work with non-relational data. They will work better with a star schema than with a flat file system. The whole "flat tables are faster" myth makes me insane. It generally comes from "flat tables are faster than my horrifically bad normalization/poor choice in fact table definition." Yes, writing mediocre code runs faster than writing bad code. Well done. What if we actually write good code?

    ----------------------------------------------------
    The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore Roosevelt
    The Scary DBA
    Author of: SQL Server 2022 Query Performance Tuning, 6th Edition and SQL Server Execution Plans, 3rd Edition
    Product Evangelist for Red Gate Software

  • Jeff Moden

    SSC Guru

    Points: 1003577

    More actions

    January 16, 2018 at 7:20 am

    #1975980

    Grant Fritchey - Tuesday, January 16, 2018 6:13 AM

    If you really only need a flat file, use a flat file storage system or a document database, not an RDBMS.

    I'll say, "It Depends". Such a thing is certainly better than storing large XML or JSON junk and may be entirely necessary because an RDBMS will still handle files (once imported into a table) better than a whole lot of other things especially if it is data such as CDRs or similar information.

    Yes, writing mediocre code runs faster than writing bad code. Well done. What if we actually write good code?

    +1000 to that!

    --Jeff Moden

    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden

    SSC Guru

    Points: 1003577

    More actions

    January 16, 2018 at 7:25 am

    #1975982

    Sorry... posted to wrong thread on this post and moved it.

    --Jeff Moden

    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Kevlarmpowered

    Say Hey Kid

    Points: 713

    More actions

    January 16, 2018 at 12:43 pm

    #1976044

    Thanks all for your input... unfortunately, I didn't build these tables. They were built as data warehouse tables because they didn't want users writing SQL, so to make life easier, they made tables that had lots of columns so end users could basically select * from factTable where createDate >= dateadd(year,-2,getdate()). To avoid having to do "joins" they ETL everything into single tables for users too. So there is lots of repetition. This table likely has 900 columns of which a bunch are duplicated in other tables... but because they say "joins are bad" they ETL everything into one giant flat table. Sometimes it's values... sometimes it is just a flag 0 or 1.

    My thought at doing this was to create a sister table with a 1:1 relationship with the original fact table so that I could offload the user assignment and date and time columns to a separate table, but that's still pushing 314 columns into a table. My next thought was to pivot those columns and have a table dedicated to statuses. So for every "thing"I would have a 1:many relationship. Each "thing" could have multiple users assigned (bonus it would allow multiple people with the same role to be assigned to a thing which can't be done now -- if multiple people are assigned, to a thing, the most recent assignment overwrites the existing assignment). As it stands, eventually all of those 34 user assignment columns will be populated and each of those 280 date and time columns will have values in them as well.

    I was just trying to wrap my head around this because yes you can have 1000+ columns per table... but my brain says it's a bad idea because locking. As the activity on the table grows, each update statement is going to lock the entire row. If we are rapidly reassigning and tagging dates/times, that's going to be a problem with growth. Going verticle would increase the number of rows, but would lessen the row locking as things are inserted/updated.

    For explanation purposes ... a "thing" is assigned to a user. A user can do many different steps which is why there are 34 user assignment columns and 240 date and time columns. One user could do 10-12 steps to a particular "thing".

  • Kevlarmpowered

    Say Hey Kid

    Points: 713

    More actions

    February 1, 2018 at 3:25 pm

    #1978277

    Grant Fritchey - Tuesday, January 16, 2018 6:13 AM

    The whole "flat tables are faster" myth makes me insane. It generally comes from "flat tables are faster than my horrifically bad normalization/poor choice in fact table definition." Yes, writing mediocre code runs faster than writing bad code. Well done. What if we actually write good code?

    I'm 100% going to use this as I came across a table today with 1060 columns...

    I think if I recommend abandoning our RDBMS for a flat file system my mugshot will end up on a wall with the title of "wanted dead or alive".

  • Grant Fritchey

    SSC Guru

    Points: 398535

    More actions

    February 2, 2018 at 7:14 am

    #1978347

    You might want to explore putting this into a columnstore index, see if that helps. It really depends on the types of queries being run against it, but pivoted storage, which you're looking at anyway, plus compression... it might make a big difference (might not).

    ----------------------------------------------------
    The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore Roosevelt
    The Scary DBA
    Author of: SQL Server 2022 Query Performance Tuning, 6th Edition and SQL Server Execution Plans, 3rd Edition
    Product Evangelist for Red Gate Software

  • RonKyle

    SSC-Dedicated

    Points: 31531

    More actions

    February 2, 2018 at 8:29 am

    #1978355

    That can't possibly be normalized or a fact table.

    Without a review of the data, this cannot be definitively established. That said, that is a lot of columns. I've never had anything close to that.

    If it is a fact table, which isn't clear to me that it is, and it is otherwise correctly modelled, the number of columns could be reduced with the creation of one or more junk dimensions. It it's an OLTP table and normal forms have been followed, if there are columns that don't apply to many or most records, these could be broken out into a smaller table. [The theoretical example of a company car comes to mind: if most people don't have one, but no one has more than one, it's probably better that this is it's own table even though it could be in the employee table].

    As you say, you didn't build the tables, so you may be stuck. Hopefully the table has a proper key and you can walk through the normal forms if it's an OLTP table or establish that it's all at the same level of granularity if an OLAP table.

  • Kevlarmpowered

    Say Hey Kid

    Points: 713

    More actions

    February 2, 2018 at 11:27 am

    #1978396

    It is a fact table...

    Before I came across this, I had never seen a table as wide as this before so I just wanted to consult with the online community to see what the experts say if I was crazy or there was something I was missing. I personally would have broken it down to be more normalized. For example, there are 200 columns dedicated to status... each "thing" could have 100 statuses defined by a datetime(but since they split the datetimeinto two columns one date and one time... it's 200 columns). If the "thing" does not reach that status, they populate the columns with a fake null 19000101 date and 000000 time. If the "thing" reaches the status it gets populated with a date and time column. If the thing hits a status twice, the newer status date overwrites the older status date and time columns.

    Imagine this over 200 columns

    IDStatus1DateStatus1TimeStatus2DateStatus2TimeStatus3DateStatus3TimeStatus4DateStatus4Time
    123452018020227900190001010190001010201707150
  • RonKyle

    SSC-Dedicated

    Points: 31531

    More actions

    February 2, 2018 at 11:45 am

    #1978398

    Breaking out thedate and time is the correct approach if both items are dimensionalized. Otheriwse the row count gets out of hand quickly.

    As for the dates, if it's an accumulating snapshot fact table (and it sounds as if it is) lots of dates are not uncommon. In fact, it's a typical characteristic. I've designed ones that captured the major milestones: create date/create time, customer contact date/customer contact time, scheduled date/scheduled time, closed date/closed time. Not all the dates had times dimensions associated with them. Not all statuses have independent columns however. There are two dates that map to 'N/A' and 'Unknown.' An open order does not have a closed date (NULL in the OLTP system, but 'N/A' in the warehouse). If it's in one of the closed statuses and does not have a closed date, a NULL value is then mapped to Unknown. These are the only two dates with a string that is not a date. All the other dates have string values that displays the date.

    Based on what you're saying, on it's face it doesn't sound wrong. Without reviewing data that can't be known for certain. Hopefully this helps you decided. You could always try a fact tablethat has a granularity of item status change. If that meets your reporting needs it would save a lot of space. But that would make some things harder to report on than your current model.

  • Kevlarmpowered

    Say Hey Kid

    Points: 713

    More actions

    February 2, 2018 at 12:12 pm

    #1978401

    Oh... I already reverse engineered it and unpivoted it back to a vertical for my data model. There are many more rows, but the import and rendering in PowerBI works significantly faster/better. This added overhead though worries me should they decide to change something in the base layer. I'll have to chase it down in my reverse engineering again.

    I also took out the fake nulls and the data size dropped significantly. I handle the NULLs on the application side converting NULL to either "not applicable" or "not available". Moving less data helped the import process... there was a 40% data savings but removing the fake nulls. It was tricky to determine when n/a was actually a NULL versus when the user did not answer the question. "not available" versus "not answered".

Viewing 15 posts - 1 through 15 (of 47 total)

You must be logged in to reply to this topic. Login to reply

As someone deeply immersed in the realm of SQL Server development and T-SQL, I can attest to the complexities and challenges associated with database design, especially when it comes to handling a vast number of columns. The discussion revolves around a table with almost 900 columns, sparking a debate on whether such a design is optimal and the potential issues it may introduce.

The primary concern raised by the original poster, Kevlarmpowered, is the impact of having an extensive number of columns on SQL Server performance, particularly in high-traffic scenarios where locking becomes a significant consideration. Jeff Moden, an experienced contributor, introduces the concept of normalization and emphasizes that there's nothing inherently wrong with a table having many columns if normalization rules are appropriately applied. However, he acknowledges that such tables are rare in practice.

Grant Fritchey supports the idea that the right number of columns is determined by the needs of the object stored in the database. He highlights the importance of proper table structure and normalization, suggesting that the described table with 900 columns may indicate a structural issue rather than a problem with the number of columns per se.

Eirikur Eiriksson agrees with Jeff and Grant, emphasizing that the key is to have the number of columns needed to support the object stored, whether in a normalized or star schema. He acknowledges exceptions, such as Call Detail Records (CDRs), where a large number of columns may be justifiable.

T.Ashish raises concerns about the potential impact on performance, particularly with a table of 900 columns having numerous indexes and a complex primary key. Grant Fritchey responds by stating that there will be a lot of activity, and the performance will be as expected—data needs to be written to the store and appropriate writes to each applicable index.

The discussion delves into the importance of proper database design, normalization, and the impact on performance. Jeff Moden later suggests the use of Entity-Attribute-Value (EAV) tables for scenarios with a high number of NULLs in certain columns, despite the common criticism of EAVs.

Kevlarmpowered provides context about the origin of the table, explaining that it was built as a data warehouse table to simplify queries for end-users who were discouraged from writing SQL. This led to the creation of tables with numerous columns and data denormalization for ease of querying.

The conversation also touches upon the idea of "flat tables are faster" being a myth, with Grant Fritchey expressing frustration about the misconception. He suggests exploring columnstore indexes for the table in question.

RonKyle contributes by suggesting that the wide table described may be a fact table but questions its normalization and structure. Kevlarmpowered further explains the nature of the table, detailing the columns dedicated to status and the challenges of dealing with numerous datetime columns.

In summary, the experts in the discussion emphasize the importance of proper database design, normalization, and considering the specific needs of the stored objects when deciding on the number of columns. The challenges of performance, especially in scenarios with a large number of columns, are acknowledged, and alternative approaches, such as EAV tables or columnstore indexes, are suggested based on specific use cases.

When is too many columns too many columns? – SQLServerCentral Forums (2024)
Top Articles
Latest Posts
Article information

Author: Amb. Frankie Simonis

Last Updated:

Views: 6249

Rating: 4.6 / 5 (76 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Amb. Frankie Simonis

Birthday: 1998-02-19

Address: 64841 Delmar Isle, North Wiley, OR 74073

Phone: +17844167847676

Job: Forward IT Agent

Hobby: LARPing, Kitesurfing, Sewing, Digital arts, Sand art, Gardening, Dance

Introduction: My name is Amb. Frankie Simonis, I am a hilarious, enchanting, energetic, cooperative, innocent, cute, joyous person who loves writing and wants to share my knowledge and understanding with you.