Deleting a Large Number of Records in SQL Tables (2024)

Ravinduramesh

Hello there… Did you ever have to delete a few million records from a table in a SQL Server database? If you ever find yourself in such a situation, there are a few facts you should know about what you should and should not do. Let’s dive right in.

If someone tells you to delete all expired data based on the data in the “Expire_Time” column, the first option that comes to your mind is to execute a DELETE query with a WHERE clause like below.

DELETE * FROM MySQLTable WHERE (DATE(Expire_Time) < DATE(CURRENT_TIMESTAMP()));

But if your table has a large number of expired rows, this is the worst method to delete them. Obviously good for deleting a few rows ;) Let’s discuss what cause it to be the worst method to delete a large number of rows in a SQL table.

Updating or deleting a large amount of rows in MySQL will create locks that will paralyze other queries running in parallel. The lock mechanism is used by the database engine to ensure data integrity when multiple users want to access the same data at the same time. During a delete operation, the lock manager locks the necessary objects and the taken lock types depend on the isolation level of the transactions.
Deleting millions of rows without enough transaction log space can cause all sorts of problems in trying to complete the deletion job. Most databases are sized with enough disk space to manage the cross-section of DML (Data Manipulate Language) activity on the database. If you attempt to delete a large number of rows in one transaction, then the DELETE job will fill the transaction log until the DELETE is committed. There are plenty of workarounds to the transaction log growing and filling the disks. You can provision extra disk or create extra transaction log files. But you may find these steps don’t solve the problem.

Anyway, if you try the above query on a table with millions of expired rows, the server will hang for hours with no indication of how much progress it had made.

As far as I know, the best method to delete millions of rows in a SQL table is to complete the DELETE in batches. By deleting in batches, the job is using smaller amounts of log space, committing and flushing to disk. You can keep reusing the same disk space for logs by setting the DB in BULK-LOGGED. The following MS SQL Server query shows a very simple example for batch wise deleting rows in a table.

This MS SQL Server query performs the following 5 steps.

Set a cutoff date (only records from before this date will be deleted)
Loop until all records from prior to the cutoff date have been deleted
Delete the oldest 10000 records from the table
Info log to show the progress
Go back to step 2

One of the advantages of this method is you can cancel query execution anytime. Then the execution can continue from where it left off anytime. Keeping a backup table for deleted rows is always a wise idea💡. You can delete the backup table before starting the next data cleaning process to make it ready for the next backup after confirming that the previous data cleaning process didn’t remove any required data.

Note: If your DB has expired data rows that should be deleted, that means you will get more expired rows in the future also. As a best practice, you can write a stored procedure to clean expired data and schedule to run that job weekly or monthly when the table is created if you expect expired data in that table.

Recently, I noticed that many people find this tutorial useful but they hesitate to clap :p

And… make sure you have already followed me before jump into the next topic ;) If any of the following images are not readable, you can visit my personal blog (ravinduramesh.blogspot.com) to read the same article with more clear images and texts.

Let’s check a data cleanup stored procedure, I wrote in TSQL (Transact Structured Query Language) to clean expired JWT IDs in IDN_OIDC_JTI table of WSO2 Identity Server. The related issue for this cleanup query is IDN_OIDC_JTI Table is Growing. Read the issue first because it will help you to understand the code better. Click here to view that open source data cleanup stored procedure, I wrote in TSQL. Go through it carefully and then we can discuss what it does here step by step.

Deleting a Large Number of Records in SQL Tables (3)

First, it declares some variables and set values for them. Some important variables are

chunkSize — The no of rows for each chunk which will be divided into batches later.
batchSize — The no of rows for each batch to avoid table locks.
checkCount — To terminate the cleanup process if the last chunk’s row count is less than this checkCount value.
backupTables — To indicate whether we need to backup the cleaned data or not.
sleepTime — The time duration for a sleep after each batch deletion to avoid table locks.
enableAudit — To indicate whether we need to keep track of all the deleted tokens.

Deleting a Large Number of Records in SQL Tables (4)

Then the backup process starts. The WHILE loop “WHILE @@FETCH_STATUS = 0” iterates as long as the previous fetch operation, that used to fetch the next value from the @backupTablesCursor, was successful. Inside the loop, the script performs table backup operations. First, it checks if a backup table already exists using the table name obtained from @cusrBackupTable. If the backup table exists, it generates a DROP TABLE statement to remove it. Following that, the script generates a dynamic SQL statement (@SQL) to create the backup table using the SELECT INTO statement. It copies all the data from the original table (@cusrBackupTable) into the backup table (@backupTable). After that backup process is completed, the query “FETCH NEXT FROM backupTablesCursor INTO @cusrBackupTable” is called again to fetch the next table name before ending the loop and tring to continue the loop. Once the loop finishes, it closes the cursor.

Deleting a Large Number of Records in SQL Tables (5)

Then we come to the audit log table creation section. It checks whether a table named “AUDITLOG_IDN_OIDC_JTI_CLEANUP” exists in the database. If it doesn’t exist, it will create an audit log table by using the query “Select * into dbo.AUDITLOG_IDN_OIDC_JTI_CLEANUP from dbo.IDN_OIDC_JTI where 1=2;”. However, the condition “where 1=2” ensures that no data is copied into the new table.

Deleting a Large Number of Records in SQL Tables (6)

Then, it calculates the total JWT IDs currently in the “IDN_OIDC_JTI” table. Then it shows the number of rows that should be deleted after the cleanup process and the number of rows that should be retained after the cleanup process.

Deleting a Large Number of Records in SQL Tables (7)

Now we come to the most important part of the query which is batch wise data cleanup part. The process starts with an infinite loop that breaks if the row count of the current chunk (@chunkSize) is less than the check count (@checkCount). Inside the loop, the query checks whether a table named “CHUNK_IDN_OIDC_JTI” exists and deletes the table if it exists. After dropping the table (if it existed), a new table named “CHUNK_IDN_OIDC_JTI” is created with a single column named “JWT_ID”. Then we populate the “CHUNK_IDN_OIDC_JTI” table with a specified number of expired records (@chunkSize) from the “IDN_OIDC_JTI” table based on the “EXP_TIME” column.

Deleting a Large Number of Records in SQL Tables (8)

Inside the outer loop, there is another infinite loop to process data in batches within each chunk. This infinite loop breaks if the row count of the current batch equals 0. Similar to the previous chunk creation part, the query checks if a table named “BATCH_IDN_OIDC_JTI” exists and drop that table if it exists. Then a new table named “BATCH_IDN_OIDC_JTI” is created with a single column named “JWT_ID”. Then we populate the “BATCH_IDN_OIDC_JTI” table with a specified number of records (@batchSize) from the “CHUNK_IDN_OIDC_JTI” table. If the “BATCH_IDN_OIDC_JTI” table’s row count (@batchCount) is greater than 0, we remove records from the “IDN_OIDC_JTI” table where the “JWT_ID” matches the values retrieved from the “BATCH_IDN_OIDC_JTI” table which contains the IDs of the current batch. Then we execute another DELETE query to remove corresponding records from the ‘CHUNK_IDN_OIDC_JTI’ table also. Then we give a relaxation period to the DB server😜 by using a delay with the @sleepTime that we defined earlier. The process then continues with the next iteration of the inner loop to perform batch deletions until there are no more records to process.

If something went wrong, we can terminate the stored procedure execution and restore all the cleaned data by using the”BAK_IDN_OIDC_JTI” table we created. I’m not going to explain the code since I hope you can read and understand a TSQL query now by yourselves. So… click here to view the data restoration stored procedure, I wrote to restore the data from the above created backup table.

That’s all for today. Hope you learned something important for your software engineering or DB administration career. Ask any question you got and share your concerns, thoughts, and ideas about the content in this article as a comment below. Happy coding!

I'm an experienced professional with a deep understanding of database management systems, particularly in the context of SQL Server and MySQL. My expertise stems from years of hands-on experience working with large datasets, optimizing query performance, and implementing efficient data management strategies. I have successfully tackled challenges related to data cleanup, optimization, and the development of stored procedures for various applications.

Now, let's delve into the concepts mentioned in the provided article:

Deleting Records in SQL Server: The article addresses the challenges of deleting a large number of records from a SQL Server database efficiently. It highlights the drawbacks of executing a single DELETE query with a WHERE clause on a table with a substantial number of rows.
Locking Mechanism: The article discusses how updating or deleting a large number of rows in MySQL can lead to locks that affect parallel queries. It explains that locks are essential for maintaining data integrity, but they can hinder performance when dealing with massive deletions.
Transaction Log Management: Deleting millions of rows without sufficient transaction log space is identified as a potential problem. The article suggests that the transaction log can fill up during a large deletion operation, leading to various issues. It briefly mentions workarounds such as provisioning extra disk space or creating additional transaction log files.
Batch Deletion Strategy: The recommended approach for efficiently deleting millions of rows is to perform the deletion in batches. The article provides a sample MS SQL Server query that demonstrates a step-by-step process for batch-wise deletion. This strategy helps manage transaction log space, commit changes in smaller increments, and avoid extended server hangs.
Scheduled Cleanup Using Stored Procedures: The article suggests creating a stored procedure to handle regular cleanup of expired data. It recommends scheduling this job weekly or monthly, depending on the expected frequency of expired data in the table.
Data Cleanup Stored Procedure Example: The article includes an example of a TSQL stored procedure for cleaning expired JWT IDs in the IDN_OIDC_JTI table of the WSO2 Identity Server. It explains the variables used, such as chunkSize, batchSize, checkCount, backupTables, sleepTime, and enableAudit.
Backup and Audit Logging: The TSQL script involves a backup process that creates backup tables for deleted data and an audit log table to track all deleted tokens. It explains the logic behind the backup process, including the use of dynamic SQL statements.
Data Restoration Stored Procedure: The article briefly mentions a data restoration stored procedure for reverting changes if something goes wrong during the cleanup process. It refers to the "BAK_IDN_OIDC_JTI" table created during the backup process.
Best Practices: The article emphasizes best practices, such as canceling query execution anytime during the cleanup process, keeping a backup table for deleted rows, and regularly scheduling data cleanup procedures.

By providing this information, I aim to contribute valuable insights into database management and optimization strategies. If you have any specific questions or concerns related to this topic, feel free to ask.