Improve data load performance

VG008

Senior Member
Joined
Dec 9, 2010
Messages
803
Reaction score
41
Location
Johannesburg
Hi Guys

We receive a csv file, daily (8am) with around ~5 million records that needs to be loaded into our database. The file has sensitive data in it, so we need to encrypt certain columns. The data needs to be appended to the existing data in the table.

This data is queried by our dispute department via a website and needs to be available as soon as possible.

Currently the data takes ~1 hour to load, but it takes 4 mins to load the data into a non-indexed table.
So the problem seems to be with the indexes which slows the load.

We were thinking of disabling the indexes, load the data, and them rebuild the index (all takes 15mins), however people are querying this table and will be unusable via the website.

Any ideas on how to improve the load speed?

Tech used:
MS SQL Server 2012
SSIS

Thanks!
 
You could look at introducing a delta table which would be identical to the table in question but would only hold the current day's data. You would then query both tables and UNION the results. You could consolidate the data during low usage periods and truncate the delta table. This would make your data query'able within 4 minutes and you would defer the 1 hour consolidation.

You also mentioned disabling indexes. In a replicated, master-slave environment this would be feasible but not recommended. You could disable indexes and update the data on the master (after disabling replication to prevent indexes from being deleted on the slaves) while still letting reads hit the slave(s). However, replaying the inserts on the slave(s) will still incur the 1 hour update with some potentially scary consistency issues so I'd steer clear of that.

That said, this is probably indicative of a bigger problem and these solutions would only treat the symptom and not the cause. If you're inserting 5 million rows daily, you should investigate whether sharding your database is feasible. Eventually you'll find yourself in a position where your queries (and not just your batch imports) are exceeding acceptable limits, regardless of your indexes. How you shard depends on how you query your database, but it sounds like it could potentially be date/time-based.
 
Last edited:
Some other things to look at:

1. Ensure you're using (and re-using) a prepared statement.
2. Ensure you're batching inserts in a transaction and limiting commits (if you can get away with a single commit at the end - great).
3. Hardware. SSD is your friend here.
4. Most importantly, figure out where that time is being spent. How much is writing to disk? How much is rebuilding indexes? How much is encryption? How much is parsing the CSV? What effect would multiple threads parsing, processing and encrypting parts of the CSV have? There may be some low hanging fruit and you should focus your efforts on areas where the biggest potential wins are.
 
Top
Sign up to the MyBroadband newsletter
X