Should commonly used derived tables be placed into a TVF

ICRS · Jul 12, 2008

TVF = Table Value Function.
When I say derived table I mean:

Select ______ from
(Select ________ From ________) As Derivedtbl_1 INNER JOIN ....

I have the same derived table being used all over the place, for different reasons. To prevent redundent coding I thought it should be placed into a Table Value Function. So if this 1 query needs to be updated I won't have to go into each stored proceedure and update it. I will be able to update it in 1 place.

imported_Dhaval00 · Jul 12, 2008

It depends on how complex the query is. I just got burned on one of my projects last week for using table variables [synonymous to TVFs] in my stored procedures. There will always be issues that you will find in production because the data just keeps growing, especially so in case the system is an archival DB. In any case, there are shit loads of articles out there that tell you to use TVFs, table variables, or CTEs (Common Table Expressions), but hardly signify the upper limits on these constructs.

Going back to my example of my use of table variables, my stored procedures get called from within integration services, and until the DB had around ~200 million rows, everything was good. The run time of the queries then was probably about 10 minutes and the work included hitting 7-8 tables with most of them hosting around a couple million transactions, as well as the biggest table above with ~200 million rows. Of course, all of these have proper indexes [covered + non-covered]. Data would be queried every day to generate user reports and also data would be moved between the transactional DB and a trimmed version of a data warehouse. All of a sudden, these tasks which used to take about 10 minutes, now took 2-3 hours to finish [this happened overnight]. After crap loads of debugging and monitoring, I discovered that the DISK IO was taking a substantial hit when I was doing work on the data in the table variables.
The logic was something along these lines: It wouldn't be feasible to loop through 200 million rows and other large tables every night. So I made use of table variables and a staging DB, which also allowed for incremental updates, to query the tables on an incremental basis. Then I would be dealing with only a few thousand transactions as opposed to millions. Turned out that there is a 'threshold' in SQL Server, surpassing which, forces the Query Optimizer to just give up on table variables - when I say 'give up,' I mean the Query Optimizer's execution plan is fucked. I was doing things like updating the table variable once it was populated, which forced SQL Server to start an ambient transaction. This in turn also resulted in writing the logs of these updates to the log file which is huge because of millions of records - this is where SQL Server went into a suspended state, under the wait type PAGEIOLATCH_SH; simply put, the hard disk wasn't able to keep up with what SQL Server wanted to do. If you don't know, PAGEIOLATCH_SH indicates a resource contingency issue with the disk system. Overall, because our DB hit this threshold, the time for my queries increased from a few minutes to few hours overnight.

So the solution? Temporary tables! Even though some articles advertise the use of table variables and TVFS, they won't take into account such performance issues. In general, temporary tables can have indexes and statistics, which helps the Query Optimizer to cache the best execution plans - you don't get this liberty with table variables and TVFs. So my advice to you is to look at the execution plans. If you have updates/inserts in your logic, you possibly want to use a temporary table (you can create indexes inline, and then drop the table at the end). Then, it also depends on the overall size of your tables. Also, if your queries are too complex, ensure you have proper indexes defined [assuming it is a transactional + reporting DB]. One place where you may get hosed is when you need to pinpoint the bottleneck in your queries; I use PRINT statements to print the current time in my stored procedures at each step, which allow me to monitor around how much time each query is taking.

Hope this helps.

Edit: By the way, after switching to temporary tables, my run times have now gone down to 2.5 minutes. LOL.

ICRS · Jul 12, 2008

Originally posted by: Dhaval00
It depends on how complex the query is. I just got burned on one of my projects last week for using table variables [synonymous to TVFs] in my stored procedures. There will always be issues that you will find in production because the data just keeps growing, especially so in case the system is an archival DB. In any case, there are shit loads of articles out there that tell you to use TVFs, table variables, or CTEs (Common Table Expressions), but hardly signify the upper limits on these constructs.

Going back to my example of my use of table variables, my stored procedures get called from within integration services, and until the DB had around ~200 million rows, everything was good. The run time of the queries then was probably about 10 minutes and the work included hitting 7-8 tables with most of them hosting around a couple million transactions, as well as the biggest table above with ~200 million rows. Of course, all of these have proper indexes [covered + non-covered]. Data would be queried every day to generate user reports and also data would be moved between the transactional DB and a trimmed version of a data warehouse. All of a sudden, these tasks which used to take about 10 minutes, now took 2-3 hours to finish [this happened overnight]. After crap loads of debugging and monitoring, I discovered that the DISK IO was taking a substantial hit when I was doing work on the data in the table variables.
The logic was something along these lines: It wouldn't be feasible to loop through 200 million rows and other large tables every night. So I made use of table variables and a staging DB, which also allowed for incremental updates, to query the tables on an incremental basis. Then I would be dealing with only a few thousand transactions as opposed to millions. Turned out that there is a 'threshold' in SQL Server, surpassing which, forces the Query Optimizer to just give up on table variables - when I say 'give up,' I mean the Query Optimizer's execution plan is fucked. I was doing things like updating the table variable once it was populated, which forced SQL Server to start an ambient transaction. This in turn also resulted in writing the logs of these updates to the log file which is huge because of millions of records - this is where SQL Server went into a suspended state, under the wait type PAGEIOLATCH_SH; simply put, the hard disk wasn't able to keep up with what SQL Server wanted to do. If you don't know, PAGEIOLATCH_SH indicates a resource contingency issue with the disk system. Overall, because our DB hit this threshold, the time for my queries increased from a few minutes to few hours overnight.

So the solution? Temporary tables! Even though some articles advertise the use of table variables and TVFS, they won't take into account such performance issues. In general, temporary tables can have indexes and statistics, which helps the Query Optimizer to cache the best execution plans - you don't get this liberty with table variables and TVFs. So my advice to you is to look at the execution plans. If you have updates/inserts in your logic, you possibly want to use a temporary table (you can create indexes inline, and then drop the table at the end). Then, it also depends on the overall size of your tables. Also, if your queries are too complex, ensure you have proper indexes defined [assuming it is a transactional + reporting DB]. One place where you may get hosed is when you need to pinpoint the bottleneck in your queries; I use PRINT statements to print the current time in my stored procedures at each step, which allow me to monitor around how much time each query is taking.

Hope this helps.

Edit: By the way, after switching to temporary tables, my run times have now gone down to 2.5 minutes. LOL.

We actually have tried using temporary tables with indexes. For our queries this actually ended being significantly slower. Though non of the results being returned should ever exceed 500,000 rows, infact hardly any will exceed 1000 rows.

imported_Dhaval00 · Jul 12, 2008

Well, what's your hardware spec? Is the OS 32-bit or 64-bit? How big are the tables? Are doing only SELECTs? And is the .ldf log file on the same physical drive?

In my case, my queries specifically target incremental loads - so most we select is 3000-8000 records in temporary tables. My server's running dual-quad core with 8 GB of RAM.

ICRS · Jul 13, 2008

Originally posted by: Dhaval00
Well, what's your hardware spec? Is the OS 32-bit or 64-bit? How big are the tables? Are doing only SELECTs? And is the .ldf log file on the same physical drive?

In my case, my queries specifically target incremental loads - so most we select is 3000-8000 records in temporary tables. My server's running dual-quad core with 8 GB of RAM.

Where I work us programmers have 0 controle over any server. We actually have no idea what they are. Only the people in the server group have access to this knowledge.

imported_Dhaval00 · Jul 13, 2008

Do you have control over the queries and indexes? I was thinking about your previous post [where you mentioned temp tables were too slow] and wanted to find out what kind of indexes do you have on the main tables? The reason why things could have been slower is because your main tables don't have proper indexes. I have learned to treat temp tables as catalysts, but not the main ingredients of the chemical reaction - if your main tables have improper indexes, most of the query time will be wasted on filling the temp table (assuming the source tables are huge). If this is the case, switching to TVFs and/or table variables will be even slower because of all the reasons [possibly more] I mentioned earlier. Have you run query tests by measuring the time it takes to finish your queries? I guess it might be helpful if you can pinpoint as to which query is taking the most time to run.

Did the queries become slower over time, or they have been slower right from the beginning? Do you know if you indexes are defragmented regularly?

ICRS · Jul 13, 2008

Originally posted by: Dhaval00
Do you have control over the queries and indexes? I was thinking about your previous post [where you mentioned temp tables were too slow] and wanted to find out what kind of indexes do you have on the main tables? The reason why things could have been slower is because your main tables don't have proper indexes. I have learned to treat temp tables as catalysts, but not the main ingredients of the chemical reaction - if your main tables have improper indexes, most of the query time will be wasted on filling the temp table (assuming the source tables are huge). If this is the case, switching to TVFs and/or table variables will be even slower because of all the reasons [possibly more] I mentioned earlier. Have you run query tests by measuring the time it takes to finish your queries? I guess it might be helpful if you can pinpoint as to which query is taking the most time to run.

Did the queries become slower over time, or they have been slower right from the beginning? Do you know if you indexes are defragmented regularly?

Non of the queries them self are slow. I just wanted to know what is best practice.

Most of our queries take under 200 milliseconds to perform, and even the longest one doesn't take over 30 seconds. When done using TVF. I only tested a stored proc on a few of our queries, it was slower but not disasterously slow. I.E ones that took 100ms now took around 150ms.

Each table has a id column (int type) which is set as the PK and Clustered index. We then used query analyzer to suggest non clusterd indexes and tested to see which ones improved our queries meaningfully and kept those.

imported_Dhaval00 · Jul 13, 2008

Do you know approximately how big your tables are? As I mentioned in my first post - the Query Optimizer will wait until it hits a 'threshold' (I am still trying to find out what that threshold it; I can claim that it seems to be the way pages are maintained + index fragmentation) before it starts puking all over the place. So right now, if your tables are smaller, then there might be significant performance boost - that's the whole point of TVFs and table variables... i.e. to use them for smaller datasets. Once you go past the threshold, your performance hit won't be linear, but will be exponential. Unless you have a data sampling tool like integration services which will allow you to simulate production-type data loads, there is no sure shot way of finding for sure [at this very moment].

Just for your knowledge: If you have UPDATEs and INSERTs, performance might get affected based on the fact that these actions have to be logged - which means this will be logged in the main MDF file. Assuming this is a huge transactional DB, SQL Server will need to compensate for the growth; if the Query Optimizer decides that one way is better than another, your performance may go up from a few milliseconds to a few minutes.

Edit: There was an oversight on my part... I think you're doing only SELECTs. I deviated a bit by pulling table variables in the whole conversation. In any case, if you have a single statement, you can put it in an inline TVF and then use it - the Optimizer is smart enough to see this inline TVF as a 'macro', expand the statement, and exploit the performance gain. However, if you have a multi-line TVF [which most productions DBs are likely to have], the Optimizer can't determine the cardinality. As mentioned before, it can't maintain statistics in such cases. The Optimizer will take the best guess, which very well will be the worst guess in most cases because of the complexity and size of data. You can easily create an inline TVF and a mulit-line TVF, look at the execution plan, and determine the difference. If you still decide to use TVFs, keep in mind these limitation - (a) You can't change the DB's state - no CRUDs, (b) Can't call other SPROCs, and (c) Can't create and leverage temp tables (I know I keep talking about temp tables, but they have saved my ass on too many occasions).

ICRS · Jul 13, 2008

Originally posted by: Dhaval00
Do you know approximately how big your tables are? As I mentioned in my first post - the Query Optimizer will wait until it hits a 'threshold' (I am still trying to find out what that threshold it; I can claim that it seems to be the way pages are maintained + index fragmentation) before it starts puking all over the place. So right now, if your tables are smaller, then there might be significant performance boost - that's the whole point of TVFs and table variables... i.e. to use them for smaller datasets. Once you go past the threshold, your performance hit won't be linear, but will be exponential. Unless you have a data sampling tool like integration services which will allow you to simulate production-type data loads, there is no sure shot way of finding for sure [at this very moment].

Just for your knowledge: If you have UPDATEs and INSERTs, performance might get affected based on the fact that these actions have to be logged - which means this will be logged in the main MDF file. Assuming this is a huge transactional DB, SQL Server will need to compensate for the growth; if the Query Optimizer decides that one way is better than another, your performance may go up from a few milliseconds to a few minutes.

Edit: There was an oversight on my part... I think you're doing only SELECTs. I deviated a bit by pulling table variables in the whole conversation. In any case, if you have a single statement, you can put it in an inline TVF and then use it - the Optimizer is smart enough to see this inline TVF as a 'macro', expand the statement, and exploit the performance gain. However, if you have a multi-line TVF [which most productions DBs are likely to have], the Optimizer can't determine the cardinality. As mentioned before, it can't maintain statistics in such cases. The Optimizer will take the best guess, which very well will be the worst guess in most cases because of the complexity and size of data. You can easily create an inline TVF and a mulit-line TVF, look at the execution plan, and determine the difference. If you still decide to use TVFs, keep in mind these limitation - (a) You can't change the DB's state - no CRUDs, (b) Can't call other SPROCs, and (c) Can't create and leverage temp tables (I know I keep talking about temp tables, but they have saved my ass on too many occasions).

Our Tables range in size from 10 to 100,000 with the lagest query returning no more than 500,000 results.

Oh yes, all of the functions are Inline-TVF. We aren't allowed to use Multi-line TVF here for some reason.

imported_Dhaval00 · Jul 13, 2008

500,000 rows is significant during operational loads. The best option is to look at the cost and the row size statistics in the execution plan. Wherever you have a 50% or so cost, put a TVF in place and see what happens. Again, be cautious of the DB size and the indexes; for some of the reasons I pointed out previously, you'll have to monitor these things. TVFs might decrease your work right now, but later on could turn out be bottlenecks. Again, it may help to think of these things outside the OOP realm (modularization may decrease work initially, but not without affecting performance later on). In terms of SQL Server, the Optimizer is the work-horse - keep it happy, and it'll keep you happy.

ICRS · Jul 13, 2008

Originally posted by: Dhaval00
500,000 rows is significant during operational loads. The best option is to look at the cost and the row size statistics in the execution plan. Wherever you have a 50% or so cost, put a TVF in place and see what happens. Again, be cautious of the DB size and the indexes; for some of the reasons I pointed out previously, you'll have to monitor these things. TVFs might decrease your work right now, but later on could turn out be bottlenecks. Again, it may help to think of these things outside the OOP realm (modularization may decrease work initially, but not without affecting performance later on). In terms of SQL Server, the Optimizer is the work-horse - keep it happy, and it'll keep you happy.

I can't believe I forgot one of the issues I came up with for stored procs and temp tables over Inline-TVF.

We have a Query A which I placed in Stored Proc A.

Query B needs to select from A. Iin stored Proc for B I have Temp Table that inserts from the Stored Proc A and addes indexes.

Query B however is needed for Query C.

I can't call Stored Proc B and insert it into a temporary table. Because you can't insert a result of a stored procedure into a temporary table, if that stored procedure also has Insert Exec statement. One restriction I discovered in SQL is a stored proc which contains an Insert Exec can't be used on another stored procs Insert Exec.

Should commonly used derived tables be placed into a TVF

ICRS

Banned

imported_Dhaval00

Senior member

ICRS

Banned

imported_Dhaval00

Senior member

ICRS

Banned

imported_Dhaval00

Senior member

ICRS

Banned

imported_Dhaval00

Senior member

ICRS

Banned

imported_Dhaval00

Senior member

ICRS

Banned

TRENDING THREADS