SQL Server performance issues

bridgeburner · Nov 20, 2023

I've been tasked with looking into some database performance issues with an application that is using Azure SQL.
Basically, this is an ETL application that aggregates "raw" data, which is nearing approximately 800 million records. It calculates various things, mainly what's known as "monthly totals" which consist of various summations; like total "TotalAmountReimbursed" etc. These calculations span date ranges - so it calculates your TotalAmountReimbursed over a period of time. This is done in an Azure Durable Function and the results are persisted to a multi-tenanted SQL Server table with an ordered clustered column store index. This is then exposed through an API and served to the UI. The function/batch job creates around 50 million records in this table if its starting out from scratch - i.e it regenerates ALL the raw data.

This function runs daily on a cron job and has "incremental" runs and this is where the issues come in. Whenever new data is ingested into the "raw" unaggregated data, the function has to recalculate what these monthly totals are based on the given date range. It has some fairly complex code that calculates these ranges, but that isn't important. The important bit is that it currently needs to first delete the existing aggregations for that date range, redo all the calculations and then insert them into the table. I've seen that the delete performance has been absolute trash and seems to be a serious bottleneck for the function. Last week this took 20 odd hours to do with each batch delete of 1 000 rows sometimes taking minutes to complete and it could potentially delete up to 2 million records at a time...

Now I know column store indexes are bad for transactional work mainly due to compression. But I believe it was chosen because it maximises query performance as the UI has number of different query combinations which make designing an effective B-Tree index difficult. But the chickens have kind of come home to roost here.

So my question here is what can we do to make this better? Is this a fundamental architecture issue or are there some quick wins available? One thing I have done already in the immediate term which seems to have improved things is increase the batch deletes from 1 000 rows to 4 000 rows and reduce the concurrency as there seemed to be severe table locking going on. The inserts are also batched, and its performance isn't too bad.

An idea I have is implementing a table switching solution - so on each run, function loads up the data NOT TO BE DELETED into a staging table, the new aggregations get inserted into it too, and when complete, drop the "old" table and rename the staging table. This runs after hours so not concerned about schema locks at this point. I'm just concerned about how long it would take to copy all the NOT TO BE DELETED if the date range isn't wide as that means most of the data would need to be copied over - i.e. just shifting the problem from deletes to inserts..

I know about table partitioning but not sure how to effectively partition the data since the date ranges vary so wildly it could possibly span partitions if you partition on date.

Curious to know if anyone got any experience with this kind of thing?

Charly · Nov 20, 2023

I hope you find your answers, but this is the sort of thing that companies have to spend big bucks on cleaning up if they don't already have the expertise inhouse.

Pho3nix · Nov 20, 2023

Charly said:
I hope you find your answers, but this is the sort of thing that companies have to spend big bucks on cleaning up if they don't already have the expertise inhouse.

Was going to say, this kinda issue I’d be sending an invoice to resolve.

Keen to see other views

Harmonic · Nov 20, 2023

A lot to read, gave it a very quick skim but can you not cache the totals event sourcing style? Goal being to update totals with diffs and not the whole DB

swakop_toe · Nov 20, 2023

So many variables that limit what can be done.
My go-to is always, check/implement indexing.
You mention it has, so assume it is properly done.
Next I thought was "drop table", but you need to keep existing data, so there goes that option.

Is the SQL instance run of the fastest IOPS platform there is? If not, is there budget to move to a faster or fastest facility?

Then, perhaps a 2nd SQL instance, DB2_table2, that builds the new table in a separate database parallel with the existing/DB1_table1 one. Once populated, then insert the data from the DB1_tabke1 "that need to be kept", into the DB2_table2. Drop the table1 in DB1, ready for the next build. Getting the Apps/queries to read and generate the output alternating between DB1 and DB2 should not be that difficult either.

Regardless, I fear it will require not just your time but additional OPEX to speed things up.
Above based on what I gathered from your post, that is.
Keep us posted on progress?

Kosmik · Nov 20, 2023

This problem sounds Ike a proper SSAS problem.

For starters though, you really should NOT be loading database computations to a function. It would probably be quicker and more efficient to do this through databricks etc and process the results into your "cubes" or static storage buckets for date.

ToxicBunny · Nov 21, 2023

Kosmik said:
This problem sounds Ike a proper SSAS problem.

For starters though, you really should NOT be loading database computations to a function. It would probably be quicker and more efficient to do this through databricks etc and process the results into your "cubes" or static storage buckets for date.

Yeah, I was also reading that post going, this sounds suspiciously like a technology choice issue.

Also with that amount of data, some of the calculations would be better suited to being done within SQL itself if it were properly structured.

Kosmik · Nov 21, 2023

ToxicBunny said:
Yeah, I was also reading that post going, this sounds suspiciously like a technology choice issue.

Also with that amount of data, some of the calculations would be better suited to being done within SQL itself if it were properly structured.

Pretty much all of it. Ingest the data to a db, cross ingest into a cube/model/crunch db/platform, store as static sliced and dimensioned data, api's to hit said static so no load client side or db side in trying to crunch numbers.

SSAS, Databricks, scala, hadoop etc etc are the right types of ways to deal with all this.

Also far quicker to create and truncate vs delete, especially in sql, log files are going to scream.

Stonemason · Nov 21, 2023

If this was a few years back I would have said Oracle would have been a better choice based on the number of records.

Without knowing the exact architecture of the db and without knowing why it was set up that way, It needs to be said that a good rule of thumb in cases where aggregates are needed, is to do it in views or reports rather than in database tables that needs constant inserts and deletes.

ToxicBunny · Nov 21, 2023

Kosmik said:
Pretty much all of it. Ingest the data to a db, cross ingest into a cube/model/crunch db/platform, store as static sliced and dimensioned data, api's to hit said static so no load client side or db side in trying to crunch numbers.

SSAS, Databricks, scala, hadoop etc etc are the right types of ways to deal with all this.

Also far quicker to create and truncate vs delete, especially in sql, log files are going to scream.

100% agree with you in reality, but for shytes and giggles I reckon it should all be done in Excel you know.. 800m rows is easy for excel to handle, and probably do it faster than their current method

Kosmik · Nov 21, 2023

ToxicBunny said:
100% agree with you in reality, but for shytes and giggles I reckon it should all be done in Excel you know.. 800m rows is easy for excel to handle, and probably do it faster than their current method

Oh God, don't say that too loud, accountants might hear you.

Baxteen · Nov 21, 2023

welcome to the hell that is Azure SQL.

anyone with experience in T-SQL will think completely differently.
there are no page files. instead Azure creates 60 nodes for the data to be stored in for each table.
this means that any join with indexing will simply not work.

I am not going to be able to explain it well enough.
watch this instead

ToxicBunny · Nov 21, 2023

Kosmik said:
Oh God, don't say that too loud, accountants might hear you.

OOooooo Pivot tables and Pivot Charts to provide an overview of those calculated averages would be brilliant, don't even need a dev team for that.

Kosmik · Nov 21, 2023

Stonemason said:
If this was a few years back I would have said Oracle would have been a better choice based on the number of records.

Without knowing the exact architecture of the db and without knowing why it was set up that way, It needs to be said that a good rule of thumb in cases where aggregates are needed, is to do it in views or reports rather than in database tables that needs constant inserts and deletes.

Yep

Operational/Transactional Data -> ops bases reporrts/Dashboard
|
Completed Data shift to lake -> Delayed Reporting with views/store procedures or stored ( static reports depending on complexity )
|
Lake to Cube -> Multi Dimensioned data in robust cubes with static storage for quick retriveal or data investigating.

Kosmik · Nov 21, 2023

Baxteen said:
welcome to the hell that is Azure SQL.

anyone with experience in T-SQL will think completely differently.
there are no page files. instead Azure creates 60 nodes for the data to be stored in for each table.
this means that any join with indexing will simply not work.

I am not going to be able to explain it well enough.
watch this instead

Elastic pools behave differantly to dedicated instances, pools have a nasty tempdb operation thats bitten a few people.

Pho3nix · Nov 21, 2023

OP have you worked with SSAS before or are you a DBA Dev Primarily?
@Kosmik gave you the perspective and approach I’d follow tbh.

Baxteen · Nov 21, 2023

Kosmik said:
Elastic pools behave differantly to dedicated instances, pools have a nasty tempdb operation thats bitten a few people.

exactly. based on the bahaviour in OP with delete taking forever it sounds like the pools are what is causing the problem.
records sit clustered in one pool so the compute is not spreading load.

Baxteen · Nov 21, 2023

@bridgeburner what I would reccomend with your delete process.

instead of deleting records.
create a Staging table and rebuild the table in the opperation. so if the eventual table name was going to be dbo.tableData you make a dbo.stg_TableData
after the staging table is populated, you drop dbo.tableData and then you can rename the staging table:

EXEC SP_RENAME '[dbo].[STG_tableData]', 'TableData'

Kosmik · Nov 21, 2023

Baxteen said:
@bridgeburner what I would reccomend with your delete process.

instead of deleting records.
create a Staging table and rebuild the table in the opperation. so if the eventual table name was going to be dbo.tableData you make a dbo.stg_TableData
after the staging table is populated, you drop dbo.tableData and then you can rename the staging table:

or just truncate.

newklear · Nov 21, 2023

Perhaps implement Redis where you have bottlenecks.

Join the MyBroadband community

Get started

SQL Server performance issues

Well-Known Member

Senior Member

The Legend

Executive Member

Senior Member

Honorary Master

Oi! Leave me out of this...

Honorary Master

Expert Member

Oi! Leave me out of this...

Honorary Master

Honorary Master

Oi! Leave me out of this...

Honorary Master

Honorary Master

The Legend

Honorary Master

Honorary Master

Honorary Master

Expert Member