• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Educate me a bit if you will, on MS SQL indexes.

Homerboy

Lifer
I'm not 100% sure this is the right location for this post, but I figure there are a lot of SQL gurus here that may be able to help.

I have a fairly large table of about 10 million rows, with about 150,000 rows added each day (these are added in bulk at the end of the day, not 1-by-1 through the course of the day).

On this table there are 4 Indexes.
I will also submit that the machine is nothing special hardware wise at all, but I'm hoping to remedy that for 2012.

As you can imagine, when those 150K rows get added, with the 4 indexes on the table, it takes a fair amount of time... 90mins in fact. Which is too long for us for numerous reasons. Not to mention it will only get worse and worse as more rows are appended.

Of these 4 indexes, I'm not sure if I need them all, but not sure how to tell. I will admit, while I understand on a very basic level what indexes do and why they are "needed", I don't understand all the intricacies and intertwining.

The columns in each index are as follows:

Index 1: FILENO, TRANS_DATE, CODE
Index 2: FILENO, TRANS_DATE, User, TRANS_TIME
Index 3: FILENO, TRANS_DATE, User
Index 4: TRANS_DATE, User, TRANS_TIME, CODE, NOTE, RECORD

These are all non-unique, non-clustered indexes.

Are any of these redundant? Not needed? Could be removed? Could be impressed? Changed? modified.... I don't even know what to really ask.

What it comes down to, is I am looking for a way to improve this import time of 150K rows and to get a better understanding of what I am working with here and how it works.

Any help/ideas/thoughts would be appreciated.

Thanks in advance.
 
Hello Homerboy. First, I'd like to make it clear that I'm an Oracle DBA, not an SQL server guy (it's different, even if it seems similar). Still, there are similarities.

First, I hope you using bulk insert.

In sql server, the default is auto commit. Each DDL (insert) automatically commits, and inserts the data. If you have 150,000 rows, don't do 150,000 inserts! You can do one!

Second, your table sounds like an ideal table to be partitioned.

Assuming you have the license for the feature, I would definitely partition the table and the index. The outcome would be scanning only a fraction of the table.

For the indexes (since they're split), it's much quicker since their is less data to sort.


As for your existing indexes, it all depends on the needs of the application. All i see are
concatenated indexes. If a query only queries these columns, then no table I/O is performed at all, since the index can supply all the data needed.


Hope I've helped.
 
This might be long, but please bear with me. It's quite important to understand indexes because not knowing will only lead to problems with everybody, developers, end users, etc.

Let me start off with a simple table. A person table. All this table does is store name, and a few other tidbits of information:

Code:
	CREATE TABLE [dbo].[person]
	(

		id_person					INT PRIMARY KEY CLUSTERED IDENTITY(1,1),

		[sys]						BIT NOT NULL DEFAULT 0,
		date_added					DATETIME NOT NULL DEFAULT GETDATE(),
		date_updated				DATETIME,
		inactive					BIT NOT NULL DEFAULT 0,

			
		name_prefix					NVARCHAR(10),
		name_last					NVARCHAR(50),
		name_first					NVARCHAR(50),
		name_middle					NVARCHAR(50),
		name_suffix					NVARCHAR(10),

	)

	CREATE NONCLUSTERED INDEX NameLast_NameFirst_NameSuffix ON [dbo].person
	(name_last, name_first, name_suffix) INCLUDE(date_updated);

Understand, that this table we just created here has 2 indexes.

The primary key column id_person which is the clustered index (by the keyword there), and then another index that contains last name, first name, and suffix, and includes date_updated.

A table can only have 1 clustered index. A clustered index means this : Sort the underlying table by this value, and store all the information for the row in the clustered index. Think of it as the main data file for the table. If we insert into an identity column, the field is auto generated, the first insert into this table will have an id_person = 1, the next insert id_person = 2... Since that is a clustered index, the table is sorted that way by only that column. So if you were to search for id_person = 2, the database can easily find the information and all the information from the columns are available quickly.

A non clustered index is basically any other index on the table, and it be used to find information in the non clustered index, or to assist the database to find the row in the CLUSTERED index. Non clustered are also sorted by their columns. So my non clustered index will sort by last name, first name, and suffix.

Now. When we query:

Code:
SELECT name_last, name_first, name_suffix FROM person where name_last = 'BOB' and name_first = 'JOE'

It can get all that information from the non clustered index. Which is fast, because all the information you need is sorted in the non clustered index, and all the data/columns you are selecting are in the non clustered index.

If we search by:

Code:
SELECT name_last, name_first, name_suffix, [b]date_updated[/b] FROM person where name_last = 'BOB' and name_first = 'JOE'

That is also fast, because all the information is in that index. The include portion of the index means include the value into the index but do not search for the row by that column (so the database doesn't have to use CPU to find the row as date_updated is not considered when searching, but is there in case you need it)

If we search by:

Code:
SELECT name_last, name_first, name_suffix, [b]date_added[/b] FROM person where name_last = 'BOB' and name_first = 'JOE'

This is going to be slower.

Why? Because date_added is not part of the index. The Database can use the non clustered index to find the row you want, but the columns you need are not part of the index so it does a thing called a "key lookup." Every non clustered index MUST have a key to the clustered index... So the index sort of looks like: (but the primary key is hidden when using non clustered index, but its no different than doing the below)

Code:
	CREATE NONCLUSTERED INDEX NameLast_NameFirst_NameSuffix ON [dbo].person
	(name_last, name_first, name_suffix) INCLUDE(date_updated, [b]id_person[/b]);

Now that it has the CLUSTERED KEY in the index, and when pulling information that is not in the non clustered index, it uses the id_person to find the data/columns you want from the CLUSTERED index. That means for every query will essentially have 2 searches internally, one on the non clustered index to get the primary key, then another search on the clustered index (main table data) because it contains all the data.

That is the quick rundown on clustered/nonclustered.

Now the question might be... What if you never want to search on id_person, and only search on last name, first name, and suffix? Why not have that as your clustered index?

The answer to that is simple. Because when you INSERT things into the database, it has to find a spot to place it, since its sorted. All the data is separated by 8k pages. If you have columns (like above) at 361 bytes (consider unicode so all the NVARCHAR's are twice the size) that means you can have exactly 22 rows of data in each page. So id_person of 1 to 22 are in the first page, 23-44 in page 2... But lets say you insert something that would end up in page 1 (not that this is possible with an identity column but for sake of explination) insert an id_person of 10.5... The database has to create 2 new pages: And now page 1 is updated to point at these 2 new pages. The new pages contain exactly half of what was in page 1 and with the new row added on one of the two new pages, and page 1 is essentially "blank", and is only used to point to these pages.

That is very ineffecient. My guess something like this is happening with your database, but is sometimes necessary.

That is why I designed the person table to have an identity CLUSTERED index so every new person is always added to the end of the database and prevents those page splits on the main clustered index. The non clustered indexes are designed to be small, so they can hold more rows per page, so page splits happen less frequently, and that scanning will be faster.

Now back to your problem:

If nothing is searching by those indexes, they can be removed. That reduces the work necessary when inserting new rows by preventing page splits in the indexes and all the unnecessary overhead.

But my guess is something is using those indexes. There has to be queries out there searching by FILENO. If there are, and the number of rows is fairly small per FILENO (like 10,000 or less), then I'd remove all those FILENO (first column) indexes and create just a single FILENO index with no additional columns, because things will still be fast when searching, but you don't have to worry about the other columns (within that index) causing page splits.

I'd suggest you figure out what uses this table and how they are searching for data. This will help determine which indexes can be removed. I also suggest putting on a clustered index of some sort (one that minimizes page splits) if the table does not have one already. Because if you have no clustered index, every single time the database does a key lookup (pulling information about columns not in the non clustered index) it will scan the entire main data file (clustered index which is not sorted) to find your columns. (Actually this last sentence might not be true, SQL might store the page number in the index rather than the clustered key, not sure)

But without knowing more about the nature of the searchings I am not going to suggest removing any indexes.
 
Last edited:
Of your listed indexes, only 3 is redundant and is covered by 2. Chances are if you dropped 3 any queries using it would just switch to using 2. You'd want to do some testing to see if that is indeed the case.

If you are on SQL 2005 or later you can ask the system for index usage stats and actually see which indexes are being used over time. However, don't do this once and expect the results to be definitive. You'll want to check the index usage stats probably daily for a while to see which ones are being used.

Google for "sql server" check index usage and any of the top links there have a nice little script to run, just pick whichever one you like the best.

Another option to look at is dropping the indexes before you do the huge insert, and then recreating the indexes once the insert is done. If this takes less time than the insert with the indexes in place, then you win in the end. You get freshly defragged indexes each day as a bonus.
 
To all the answers thus far: THANK YOU.

This is what makes AT forums just awesome. No BS, no holier than though, just honest solid answers.

I will ingest everything here and see what I can do.
Like I said too, I REALLY need to get updated hardware for 2012. The history of our SQL server here is short, and I'm at the core of it. getting others to buy into has been hard, but they are now seeing the power of what can be done with it (and I am only learning as I go).

Getting them to understand why we need better hardware is a different battle all together.
 
To all the answers thus far: THANK YOU.

This is what makes AT forums just awesome. No BS, no holier than though, just honest solid answers.

I will ingest everything here and see what I can do.
Like I said too, I REALLY need to get updated hardware for 2012. The history of our SQL server here is short, and I'm at the core of it. getting others to buy into has been hard, but they are now seeing the power of what can be done with it (and I am only learning as I go).

Getting them to understand why we need better hardware is a different battle all together.

specs? 😉
 
specs? 😉

Well it was installed on an older, retired server we had here. (one I actually purchased about 6 years ago when I was IT Manager).

The rub too is that it wasn't even a clean install of the OS. Just the same hacked up install from 2007ish. I called our IT Manager out on it and he just responded with "it shouldn't affect anything SQL is doing" (This is the same guy that tried to shoot my procurement of SQL down by saying the software cost $40k... that's for the enterprise version)

Specs are single dual core Xeon (I think 2.0ghz... maybe 1.8?)
Dual 180GB SATA in RAID1 (yeap... mirror)
and... drum roll please... 2GB of RAM!

🙂
 
Well it was installed on an older, retired server we had here. (one I actually purchased about 6 years ago when I was IT Manager).

The rub too is that it wasn't even a clean install of the OS. Just the same hacked up install from 2007ish. I called our IT Manager out on it and he just responded with "it shouldn't affect anything SQL is doing" (This is the same guy that tried to shoot my procurement of SQL down by saying the software cost $40k... that's for the enterprise version)

Specs are single dual core Xeon (I think 2.0ghz... maybe 1.8?)
Dual 180GB SATA in RAID1 (yeap... mirror)
and... drum roll please... 2GB of RAM!

🙂

ouch... i don't mean to brag (although it's going to sound like it), but for we just ordered from our IT department 12 new servers for a major project next year...

ibm x3850 m2's:
dual xeon x7940 hex core 2.66ghz
default is 24gb's of ram (4 are coming with 128)

we don't really use local storage... each server has like 20-30gb local disk.

these servers will get a total of 70tb of raid5 emc storage...



now, as for your server, you need a new cause theres no way the current one is still covered by warranty...
but check you needs.
what's your bottleneck? do you plan to grow? the ability to do things in parallel if gratifying (but you need a strong enough server to do it with)...

don't just spec your server without consideration, if your company doesn't have a virtual server infrastructure (from my experience, most sql server databases can be virtualized) then you still need consider your average utilization, as well think about peak times.

what's the purpose of this server? can it be down? is it critical? only you can answer that and decide how important performance is. also, if thus far, you've managed, i guess you don't really need one heck of a server... 🙂
 
quick way is to disable the indexes as part of the import process. i do this on a few times a week when we run batch processes that affect tens of millions of rows.

just run alter index disable, check BoL for exact syntax and then rebuild the indexes after the job is done


if you look in BoL there are DMV's that tell you which indexes are used and how often. in the last 2 years i've deleted hundreds of unused indexes on our SQL servers
 
Disabling the indexes is basically what I was suggesting, though I had forgotten that you could simply disable them instead of dropping them before you ran the big insert. Thanks for the reminder.
 
Do you mind posting the insert statement that does the insert at the end of the day? Are you using the bulk insert statement, SSIS, DTS or a compiled application?

Also, are there any other processes running during the insert? I would also suggest setting up perfmon and monitor read and write disk queueing, page life expectency, cpu, pages/sec, total server memory, target server memory and transaction/sec, blocks and deadlocks. That amount of records you are inserting is rather small and the indexes should should only be adding references at the end of the tree. I suspect that something else is competing for resources when this is going on or that the server is spending a lot of time paging data in/out of memory during the insert.

You may also want to run profiler during this time to see what exactly is being execute against the server. Did you check the server logs during the input to see if you have IO errors?
 
Last edited:
Back
Top