database questions

JC0133 · Oct 31, 2018

Can some one please explain to me the difference and the pros and cons of a HEAP FILE, HEAP BASED INDEX, CLUSTERED and UNCLUSTERED INDEX?

I haven't been getting clear answers from google which probably means I am really confused in general.

KB · Nov 1, 2018

Is this your homework? Hope not... hopefully this helps

CLUSTERED INDEX - rows are stored in the table in order of the indexed column (fastest for querying by indexed column, slower for insert)
UNCLUSTERED INDEX - index stored separately from the table rows as row pointers (second fastest for querying by indexed column, slow on insert because it has to touch multiple places. increases database size.)
HEAP FILE - rows stored in any order by the database engine, typically last row added is last row of the table (fastest for insert as no indexes need to be updated, but slowest for querying as every row must be scanned. Used primarily with small tables or when tables must be queried with multiple and varying parameters)
HEAP BASED INDEX - seems to be the same as HEAP FILE

Ken g6 · Nov 1, 2018

I'm not a DB guy, but are you sure "heap" doesn't refer to https://en.wikipedia.org/wiki/Heap_(data_structure), an implementation of a priority queue?

Cogman · Nov 1, 2018

KB's answer is correct. But some more.

Clustered index == how the rows are physically stored on disk. Often times, this is simply by some "id" column. But it can make sense to expand or shrink that data based on how the data is queried. You generally want this index to both be generally relevant, but also not too big. A side, note, it should be unique (or else your DB will do that for you via an invisible rn column). You also want to pick your columns from general applicability to specific. For example, a clustered index of (ID, Client) is silly, because ID is first, Client will never be utilized in queries. On the flip side (Client, ID) can be very useful since client would jump you to the relevant ids quickly.

But do note, when you have a clustered index, it might require that an insert causes things to reorder (especially if it isn't just on ID, for example). This can be both good and bad for performance. For example, in our applications, we often will only write data for a given "Client" at a time, so when the db goes to lock data and increase lock size, it is fairly unlikely that it will end up deadlocking. If we locked based on ID, then the inserts and updates can arbitrarily start interfering with each other, which increases the rate of deadlocks. The cost of having to relocate a "page" is inconsequential compared to the deadlock cost. (Modern DBs are basically giant B-Trees where rows are stored in "pages" The table itself is basically a bunch of pointers to those pages).

Another thing to note, All non-clustered indexes contain references to the clustered index. That is how they look things up. If you reach for something outside of either the clustered or nonclustered index, that causes the db to have to go and look up the actual row. However, if your columns are limited to only things in the clustered index/nonclustered index then it is just a fast lookup against only the index.

NonClustered Indexes are basically just invisible tables clustered by the nonclustered key which contain the non-clustered index columns + clustered index columns. All the rules about page location/etc still apply to them, but usually in a lesser extent because they generally have less data in them (they don't contain the full row). This is why adding non-clustered indexes will decrease write performance, but can also introduce deadlocking in some cases.

Hopefully not to confusing. I find, though, when you remove the mysticism around what is happening then it becomes clearer as to why some things are faster than other things. (note, this is all based on my experience with MSSQL... However, I believe most other sql DBs operate in a similar fashion).

JC0133 · Nov 1, 2018

KB said:
Is this your homework? Hope not... hopefully this helps

CLUSTERED INDEX - rows are stored in the table in order of the indexed column (fastest for querying by indexed column, slower for insert)
UNCLUSTERED INDEX - index stored separately from the table rows as row pointers (second fastest for querying by indexed column, slow on insert because it has to touch multiple places. increases database size.)
HEAP FILE - rows stored in any order by the database engine, typically last row added is last row of the table (fastest for insert as no indexes need to be updated, but slowest for querying as every row must be scanned. Used primarily with small tables or when tables must be queried with multiple and varying parameters)
HEAP BASED INDEX - seems to be the same as HEAP FILE

Not homework, something I need to understand for a test coming up.

Search

database questions

JC0133

Senior member

KB

Diamond Member

Ken g6

Programming Moderator, Elite Member

Cogman

Lifer

JC0133

Senior member

TRENDING THREADS