What Do You Mean, SQL Can't Do Big Data? (2024)

You keep reading and hearing that SQL is not suitable for developing big data systems. It supposedly doesn’t have the performance or scalability that big data systems require. Some even define the term big data by stating that big data is data that’s too big for SQL.

Not always do these authors and speakers use SQL but the term relational instead. It’s incorrect to use the two terms ‘relational’ and ‘SQL’ interchangeably. First of all, not many relational systems exist that don’t support SQL, so to which products are they referring? Secondly, most SQL systems are not 100% relational, but let’s not get into that old discussion. They use the term relational, but mean SQL.

Here are a couple of such quotes:

Let’s be clear, these statements make no sense in several ways.

First of all, SQL is a language, it’s not a product. SQL doesn't have a performance, scalability, or price. A specificSQL product has a performance level and may or may not have problems with supporting big data. For example, some SQL products have a very small footprint making them suitable to run on small devices, such as SQLite. Such SQL systems are definitely not built for big data systems. But on the other end of the scale there are SQL systems that are developed for storing and analyzing big data, such as Amazon RedShift, Exasol, HP/Vertica, IBM PureData Systems for Analytics (Netezza), Kognitio, and the Teradata databases and Teradata Aster.

Second, SQL systems have proven that they can be used in big data systems. For example, the amount of data eBay processes every day adds up to an astonishing 50 petabytes. And they use Teradata. There are many more organizations that use SQL products to run big data systems.

Evidently, there are use cases of big data for which specific SQL products are not the right data storage technology and where, for example, Hadoop or NoSQL products make more sense. But the opposite is true as well, for some big data use cases a specific SQL product is preferred.

The point is that you can’t make those types of generalized remarks about SQL. Like you can’t say that movies are too long or that books are difficult to read. You have to be specific, you have to indicate to which products you refer. Not all SQL products are created equal.

And don’t forget that more and more massive big data systems developed on Hadoop use some SQL-on-Hadoop engine, such as Apache Hive or Impala. Isn’t a SQL-on-Hadoop engine running on Hadoop a SQL product? A study by TDWI has indicated that 28% of the organizations already use Hadoop and 22% of all the organizations use SQL-on-Hadoop, and 36% of the organizations plan to use Hadoop within three years and approximately the same percentage plans on using a SQL-on-Hadoop engine. In other words, due to these SQL-on-Hadoop engines, more and more SQL products are used for running big data systems.

Let’s stop saying that SQL is not suitable for big data. It’s a pointless statement. Big data systems can be developed and are being developed with SQL products. Besides NoSQL products and Hadoop they are an effective technology to support big data systems.

SQL vs. Relational Systems:

The article conflates SQL with relational databases. SQL is a language used to interact with databases and is not inherently tied to relational databases exclusively. While most relational databases support SQL, not all SQL-based systems are strictly relational. Some systems offer SQL interfaces but may use alternative data storage and retrieval methods.

Performance and Scalability:

The argument that relational databases or SQL are inherently incapable of handling big data due to performance or scalability issues is inaccurate. The capability of handling large volumes of data and scaling effectively depends on the specific implementation of the SQL-based system or database product being used. Numerous SQL-based systems have been engineered explicitly for big data processing, providing high performance and scalability.

SQL Products for Big Data:

Several SQL-based systems, such as Amazon RedShift, Teradata, and others mentioned, are purpose-built for managing and analyzing massive volumes of data. These systems offer optimized architectures, distributed computing, and parallel processing capabilities tailored for big data workloads.

Real-World Examples:

Companies like eBay processing vast amounts of data using Teradata highlight the viability of SQL-based systems in handling enormous datasets effectively. Many organizations leverage SQL-based systems for their big data operations due to their proven track record in performance and reliability.

SQL in the Context of Hadoop:

The emergence of SQL-on-Hadoop engines, such as Apache Hive and Impala, further blurs the line between traditional SQL and big data ecosystems. These engines enable SQL querying on data stored in Hadoop, making SQL an integral part of big data processing.

Evolving Trends:

Studies indicate a significant adoption rate of Hadoop and SQL-on-Hadoop engines among organizations. This trend underscores the growing relevance of SQL-based systems in the big data landscape.

Conclusion:

In essence, the notion that SQL is unsuitable for big data systems lacks nuance. SQL-based systems, including purpose-built solutions and SQL engines on Hadoop, are viable options for managing and processing large-scale datasets. The choice of technology depends on specific use cases, and SQL has increasingly become an effective component in the toolkit for building and managing big data systems alongside NoSQL databases and Hadoop.

This analysis underscores the importance of understanding the nuanced landscape of database technologies, debunking the blanket statement that SQL is unfit for big data and emphasizing the role of SQL-based systems in supporting diverse big data use cases.