Identity Column Overflow

ivan2 · Jan 8, 2009

Not a real problem but wanted to get some ideas since google couldn't offer me a solution.

How can I fix an identity column that is going to overflow, or on the brink of overflowing? Lets also say that this could be a result of careless design and that warehousing is not an option.

Please provide the solution AS IF you are the following:

DBA who bought this piece of shit worth of program from a vendor, not possible to introduce another composite primary key. And that you are already looking at a identity column of type bigint.

Programmer who designed this

(swear to god it's not me)

drebo · Jan 8, 2009

Make sure it's unsigned?

ivan2 · Jan 8, 2009

that's valid, but what about if it's in SQL or Sybase and there's no unsigned...

sorry maybe I havent been explicit, my thought was more along the line that what happen if the absolute biggest value has been reached for the largest datatype (hence the bigint).

KLin · Jan 8, 2009

Changing the datatype to bigint should fix the issue. Are you saying there's a table that is reaching 2,147,483,647(the upper limit of the int datatype storage) records?

http://msdn.microsoft.com/en-us/library/ms187745.aspx

ivan2 · Jan 8, 2009

Originally posted by: KLin
Changing the datatype to bigint should fix the issue. Are you saying there's a table that is reaching 2,147,483,647(the upper limit of the int datatype storage) records?

http://msdn.microsoft.com/en-us/library/ms187745.aspx

Yes, if my identity column is already bigint, will I be sol?

Markbnj · Jan 8, 2009

In that astounding circumstance, you would probably have to split the table.

KLin · Jan 8, 2009

Originally posted by: ivan2

Originally posted by: KLin
Changing the datatype to bigint should fix the issue. Are you saying there's a table that is reaching 2,147,483,647(the upper limit of the int datatype storage) records?

http://msdn.microsoft.com/en-us/library/ms187745.aspx

Click to expand...

Yes, if my identity column is already bigint, will I be sol?

Only if the identity column is anywhere near 9,223,372,036,854,775,807. I sure hope not. :Q

imported_Dhaval00 · Jan 8, 2009

Originally posted by: KLin

Originally posted by: ivan2

Originally posted by: KLin
Changing the datatype to bigint should fix the issue. Are you saying there's a table that is reaching 2,147,483,647(the upper limit of the int datatype storage) records?

http://msdn.microsoft.com/en-us/library/ms187745.aspx

Click to expand...

Yes, if my identity column is already bigint, will I be sol?

Click to expand...

Only if the identity column is anywhere near 9,223,372,036,854,775,807. I sure hope not. :Q

Exactly. And what about -9,223,372,036,854,775,807 to zero? You exhausted that, too?

ivan2 · Jan 8, 2009

Originally posted by: Dhaval00

Originally posted by: KLin

Originally posted by: ivan2

Originally posted by: KLin
Changing the datatype to bigint should fix the issue. Are you saying there's a table that is reaching 2,147,483,647(the upper limit of the int datatype storage) records?

http://msdn.microsoft.com/en-us/library/ms187745.aspx

Click to expand...

Yes, if my identity column is already bigint, will I be sol?

Click to expand...

Only if the identity column is anywhere near 9,223,372,036,854,775,807. I sure hope not. :Q

Click to expand...

Exactly. And what about -9,223,372,036,854,775,808 to zero? You exhausted that, too?

I don't know if the DB will start using negative number, I was under impression that SQL and Sybase won't, at least.

It's just a question that I dream of, and it's not about it's chance. Plus, changing the ID to other datatype will surely bring changes to persistence, at which point I can basically revise the whole thing or even generate the PK myself.

imported_Dhaval00 · Jan 8, 2009

You can reset the SEED (at least in SQL Server) to a negative number, but that is bound to cause issues later - especially under the assumption that you've exhausted bigint. That makes me curious as to what kind of data is this DB serving? DBCC CHECKIDENT will let you change the identity seed to a value that already exists in the table, so be very sure as to what you're doing.

No offense, but I still find it hard to believe that this DB exhausted bigint.

JasonCoder · Jan 9, 2009

Seems like most of these solutions will be a stopgap. It's going to occur again at some point. You really should look at using guids after your temp fix.

imported_Dhaval00 · Jan 9, 2009

Originally posted by: JasonCoder
Seems like most of these solutions will be a stopgap. It's going to occur again at some point. You really should look at using guids after your temp fix.

I still have a hard time coming to terms with GUIDs. I have seen various designs with GUIDs and my biggest argument against them is the size of the index (at least within the data warehousing realm). GUIDs are 16 bytes long (INT is 4 while BIGINT is 8) and the values take longer to search and sort (very important on your PK!). This performance hit pretty much cascades throughout the DB and is highly visible in reporting systems.

A good estimate of your index size is around 30-40% of your DB size. With GUIDs (and assuming you have the capability to exhaust bigint), this can easily increase to 50%.

I am no DBA, but these are things I have learned from DBA's who have maintained humongous, disparate DB's for years. Again, I am still interested in finding out what kind of DB the OP is dealing with.

Evadman · Jan 9, 2009

Originally posted by: ivan2
I don't know if the DB will start using negative number, I was under impression that SQL and Sybase won't, at least.

MS SQL supports neg identity numbers, and you can change them once the table is created using the modify table statement. Set the seed to 0 and the increment to -1. You can also use GUID's as were suggested, or switch to a decimal datatype. A decimal in MS SQL spans 38 decimal places, and you can create numeric(38,0) for 38 places to the left of the decimal. A bigint is just short of 19 places while a regular int is 12.

If that table is really reaching the integer limit on rows, that table should have been horizontally partitioned long ago. No possible way it can be used. Even with a 4 bite key and 12 bites of data per row (probably as small as possible and still be useful) that is 17,179,867,744 bytes. 17 Terabytes. Are you seriously telling me that a database is going to look though 17 TB of data to find a single row? Forget it. Even with the intelligent searching though the b-tree for that data, you will end up with a single row insert of 20 minutes. Could you imagine typing 'Evadman' and waiting 20 minutes for an insert?

Originally posted by: Dhaval00
I still have a hard time coming to terms with GUIDs.

GUID's should only be used where GUID's make sense. GUID's don't make sense in the vast majority of cases.

By the way if you have some 'rule' that states indexes should be in the 30-40% range, then you need a new rule. If a database is 40% index, than that is a very VERY crappily designed database. That means there is nearly as much actual data as index. An insert will take forever. I can think of only a few instances where an index size that big would be a good idea. They would be specialized reporting databases or tables that are extremely small.

<edit>
int max rows: 4,294,967,295
bigint max rows: 18,446,744,073,709,551,615
GUID max rows: 5,316,911,983,139,663,491,615,228,241,121,400,000
decimal(38,0) max rows: (999,999,999,999,999,999,999,999,999,999,999,999,999*2)+1

I can't calculate the decimal max range on any calculator I have, it breaks them all, or forces notation.

ivan2 · Jan 9, 2009

Thank you all for the insightful comments. Sorry Evadman I don't have a real life example here =( But I can imagine someone will reach that point in the following case:

There could be "fragmentation" within an Identity field. There it is possible that the PK will become exhausted even without having much of data in the table (a transactional table for example, new value keeps getting inserted with the next largest number as PK...)

Is it possible, or recommended, to have some sort of "compacting" mechanism on the PK periodically?

imported_Dhaval00 · Jan 10, 2009

Originally posted by: Evadman
By the way if you have some 'rule' that states indexes should be in the 30-40% range, then you need a new rule. If a database is 40% index, than that is a very VERY crappily designed database.

^I did mention within the data warehousing realm^. And just to clarify, it is not a rule - obviously, each DB has its own, unique structure. I was simply generalizing the number... more like a ball-park range. Additionally, I wasn't being specific to a single index... this would be a properly indexed DB - clustered & non-clustered (which in most cases will be covered).

JasonCoder · Jan 10, 2009

GUIDs may not make sense from a DBA's perspective but trust me from the developer's perspective they make all kinds of sense. There are lots of relationships that are much easier with GUIDs.

Example: say you have a metadata table that can be related to any of three other entities in your system. Let's say a book, a publisher and an author each could be tracked (be related to) in our metadata table. With GUIDs you need only only column (let's call it EntityGUID) and I can now do all sorts of wonderful easy joins. If it was an int the joins get more complicated. '789' could relate to any or all of the three. Store which one in a separate column? Yuk. Yeah there are other ways but I prefer GUIDs.

Another example: say you want to merge two tables. How do you work out the identity sequencing? Just create a new table and dump everything in it with new identity sequences? Shit, hope none of your apps were counting on those columns keeping their values. With GUIDs this is a non issue. You merge easy because a GUID is guaranteed to be unique.

I could go on but I'll leave it to google. Maybe these types of cases aren't valid in data warehouses, I don't know. But I'm having trouble with the complaint on searching and indexing a 16 byte value. Systems search and index character data much much larger than 16 bytes all the time. An author's table that might be queried by FName often should have an index on that varchar(255) column.

imported_Dhaval00 · Jan 11, 2009

Been there, done that. I was sold to the whole notion of GUIDs until I saw the huge performance degradation in a couple of our data warehouses. Simply put, it comes down to the scope of the uniqueness that might be required in a specific solution. To me, your first example is a normalization oversight (unless I missed what you were getting at).

Nevertheless, I never said GUIDs are useless - they have their uses (like CRM systems and invoice tracking system where acquisitions and mergers are quite common). I was trying to stick to the OP's problem and trying to emphasize the performance hit - which may not be visible in smaller DB's, but just imagine the amount of data he/she is dealing with if he/she is capable of exhausting BIGINT. With regards to the performance, refer to the thread http://forums.anandtech.com/me...Functional+Programming for an example of the kind of systems I deal with. Specifically, in warehousing, we do things like "lookups" to transition Type 2 dimension changes - this forces us to bring the PK into memory [or cache on-file]. To throw out numbers, a BIGINT PK with 10 million rows (clustered) would take about 2.5 minutes to import into the SSIS process. This excludes things like reconciling with the data warehouse dimensions. If I am dealing with GUIDs, I am forced to consider the size and performance overhead (which includes JOINs across tables and reconciliation across databases). The SSIS package that used to take 15 minutes to complete will all of sudden take 25 minutes (we have some older DB's that used GUIDs when M$ initially came out with them). And yes, all of these numbers are pretty much realistic.

So let's say you have a single table which uses GUIDs as the PK and has a clustered index on it. The key values from the clustered index will be used by all nonclustered indexes as lookup keys and therefore will be stored in each nonclustered index leaf entry. So every record refers to it's position in the table with the GUID. Since a guid is 16 bytes and an INT is 4 (BIGINT is 8), the index with the GUID is much larger and will there therefore be slower. This is what M$ workshops taught me and it is very consistent in the way I see these systems perform. Also, this is in line with my previous statement that the performance hit when using a GUID will ripple through the entire DB.

My general observation is it is developers who are sold to the whole notion of GUIDs because [in most cases] they don't have to worry about performance, indexes, etc. Try stepping into a DBA's shoes [or even an ETL developer's] and then try and hold on to your support for GUIDs

. As a generalized rule, 90% of the DB's out there survive without GUIDs - this includes M$'s own DB's and various others like Wikipedia and Slashdot.

JasonCoder · Jan 11, 2009

Originally posted by: Dhaval00
So let's say you have a single table which uses GUIDs as the PK and has a clustered index on it.

I've already said I wasn't sure about DW but I have no idea why you would do this in your DW... why wouldn't you index things however the heck you wanted in your DW and leave the OLTP system to use GUIDs or whatever it wants. Or are you saying the SSIS process to move data into your DW bogs down?

With this same logic I suppose someone could complain about every index in the entire database if it isn't an INT. That's just silly. I index varchar and spatial data all the time... and for good reason - app perf.

I wasn't trying to sell the idea that GUIDs are the only way to go. Just a very viable way. Sounds like we're in agreement that GUIDs make sense in a lot of cases but apparently not data warehousing

Tons of data, yeah I'd be looking to squeeze every bit of perf out too.

imported_Dhaval00 · Jan 11, 2009

Just to clarify, SSIS won't bog down - on a 64-bit system, any process spawned from within SSIS has no upper limits. Each process, in general, will be bogged down by available resources at the OS level.

Evadman · Jan 26, 2009

Originally posted by: Dhaval00
The SSIS package that used to take 15 minutes to complete will all of sudden take 25 minutes.

Honestly, it sounds like you need a faster box. The box wasn't sized correctly for the load, or the load changed from when it was spec'd. You are exactly correct that queries take longer the more bytes are in the PK (and that a GUID is 2x a bigint), but that is easily remedied with horizontal partitions, or faster disk. I deal with databases in the PB range, and tables that are in the TB range. Heck, one of the daily jobs I have imports a set of files that varies between about 10 and 47 GB, Runs a crapload of verification and doublechecks on it for invalid data or corrupted sections, and does it in about 8 minutes. If it passes 10 minutes, about 30 people are paged out because at 21 minutes it will hold up other processes that cost more per minute than my yearly pay is.

As a generalized rule, 90% of the DB's out there survive without GUIDs - this includes M$'s own DB's and various others like Wikipedia and Slashdot.

Absolutely. GUID's are not required in the vast majority of cases they get used in. Even the first example JasonCoder used is easily done with an incrementing integer in a single column. He will also run into problems with table merges, because GUID's are NOT guaranteed to be unique, even though lots of programmers & DBA's think they are. GUID's are also more intensive to create than the next integer in a set.

Markbnj · Jan 26, 2009

Please don't use up all teh guids. My assembliez will bustz.

imported_Dhaval00 · Jan 26, 2009

Originally posted by: Evadman

Originally posted by: Dhaval00
The SSIS package that used to take 15 minutes to complete will all of sudden take 25 minutes.

Click to expand...

Honestly, it sounds like you need a faster box. The box wasn't sized correctly for the load, or the load changed from when it was spec'd.

I wasn't trying to highlight a problem in any of my existing systems. The thought was more along the lines of 'what could be' - meaning, if you replace INTs with GUIDs, and your performance degrades, it mostly is because of the size of the GUID column.

As a side note, I did run benchmark tests on a project last year (enlightening!) which looked at the overhead of dealing with GUID PKs (my earlier example).

Rangoric · Jan 26, 2009

Originally posted by: Dhaval00

Originally posted by: Evadman

Originally posted by: Dhaval00
The SSIS package that used to take 15 minutes to complete will all of sudden take 25 minutes.

Click to expand...

Honestly, it sounds like you need a faster box. The box wasn't sized correctly for the load, or the load changed from when it was spec'd.

Click to expand...

I wasn't trying to highlight a problem in any of my existing systems. The thought was more along the lines of 'what could be' - meaning, if you replace INTs with GUIDs, and your performance degrades, it mostly is because of the size of the GUID column.

As a side note, I did run benchmark tests on a project last year (enlightening!) which looked at the overhead of dealing with GUID PKs (my earlier example).

Where you using newsequentialid() or newid() generate the GUIDS.

newsequentialid should be much better in terms of performance then newid but still worse then identity. However it wouldn't be the huge hit it is to start (newsequentialid & identity work better for an index because they are created in order, this helps the indexer place the new items, if you do an online search the newsequentialid() came from COMBs and there are a few good references for that online).

Not sure which DBs support it but I know SQL 2005 & later do for sure, and making a function to make a new one is trivial (search for COMBs again).

Evadman · Jan 27, 2009

newsequentialid() will drastically cut down on B tree fragmentation in an OLTP system, since new pages will be added to the end, instead of causing page splits. But once in a warehouse, it shouldn't matter if the ID's are in order, with the exception of the import processing. Warehouses should have an index fill as close to 100% as possible anyway, so that disk latency can be minimized.

Identity Column Overflow

Diamond Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Elite Member <br>Moderator Emeritus

Lifer

Senior member

Diamond Member

Senior member

Golden Member

Senior member

Administrator Emeritus<br>Elite Member

Diamond Member

Senior member

Golden Member

Senior member

Golden Member

Senior member

Administrator Emeritus<br>Elite Member

Elite Member <br>Moderator Emeritus

Senior member

Senior member

Administrator Emeritus<br>Elite Member