Error and Transaction Handling in SQL Server

This is an appendix to the three-part series Error and Transaction Handling in SQL Server. More precisely, this appendix is an extension to the chapter Special Contexts in Part Two and it covers how error and transaction handling in SQL Server works when you access linked servers. When reading Part Two you may have been appalled by the great level of inconsistency in SQL Server, and you may have been confused by all the possible actions that SQL Server can take in case of an error. When you use linked servers, the stakes raise even higher. If you also use distributed transactions, you may find yourself in situations where it is difficult to have any civilised error handling at all.

I assume in this appendix that you have read Part One in whole and the first six chapters of Part Two, and that you have gathered a basic understanding of the material. If you have arrived at this page through a link or a web search, I recommend that you read these two parts first.

Index of All Error-Handling Articles

All the articles above are for SQL 2005 and later. For those who still are on SQL 2000, there are two older articles:

Some General Notes on Linked Servers

This chapter gives a general introduction to linked severs and distributed transactions. It also introduces a setup for the examples in this appendix.

When you access a linked server, SQL Server never talks to the remote data source directly. Instead it talks to an OLE DB provider that accesses the data source. Most OLE DB providers are product-specific. For instance, the OLE DB provider for SQL Server is SQL Server Native Client, SQLNCLI. There are OLE DB providers for data sources such as Oracle, Postgres, Access/Jet and Active Directory. There is also the generic provider MSDASQL that talks to an ODBC driver, which makes it possible to set up a linked server to a data source for which there exists an ODBC driver, but no OLE DB provider.

It seems to me that the most common use of linked servers is to connect to another SQL Server instance, and this is what I will focus on in this appendix. Given the query syntax in SQL where a linked server is just one more dot, you can easily be lured to think that connecting to a remote server is no different than accessing an object in a different database in the local instance. But for various reasons this is not the case. SQL Server takes a fairly agnostic view on the linked server, and generally sees it as an "OLE DB data source". There do not seem to be very many shortcuts if the other instance is also running SQL Server. This results in a somewhat clunky behaviour with restrictions that are difficult to understand if you incorrectly think of the feature as a means to connect two instances running SQL Server. As one example, you cannot invoke a table-valued function on a linked server.

Interesting enough, in the particular field of error handling, there appears to be such shortcuts. To wit, as we shall see in this appendix, SQL Server has a tailored error handling if the provider is SQLNCLI and a more generic behaviour with other providers.

Note: In total there no less five providers to connect to SQL Server through OLE DB. They are not five distinct ones, but rather, I assume, development of the same code base and only the name has changed over the years:

SQLOLEDB, this was the original version that shipped with SQL 7 and SQL 2000 and still ships as part of Windows for legacy reasons. When you set up a linked server to use SQLOLEDB, SQL Server will actually use the matching version of SQL Server Native Client.
SQLNCLI. The first release of SQL Server Native Client that shipped with SQL 2005.
SQLNCLI10. The version of SQL Server Native Client that shipped with SQL 2008.
SQLNCLI11. The version of SQL Server Native Client that shipped with SQL 2012 to SQL 2017. No new versions came with SQL 2014 and later, because OLE DB had been deprecated for connections to SQL Server.
MSOLEDBSQL. Eventually, Microsoft undeprecated OLE DB, and released a new version of the OLE DB provider out of band. If it will ship with versions later than SQL 2017 remains to see.

To keep things simple, I use the name SQLNCLI all throughout this article. I believe that it actually maps to the appropriate provider which could be SQLNCLI10 or SQLNCLI11. Of this writing (August 2018) MSOLEDBSQL is relatively new, and it appears some glue is missing to make linked servers using this provider work flawlessly. I assume that this is something that will be addressed by time, and I am not covering these oddities. If you want to try the demos in this article with MSOLEDBSQL, you may see different output than I present here.

Setting Up a Linked Server

If you want to define a linked server to the SQL Server instance ZAPPA running on the server FRANK, this is as simple as

You can now access objects and perform actions on this instance, in four different ways:

Note: The access to the linked server is like any other login to that server. That is, you need to have a valid login on that server or set up login mapping with sp_addlinkedsrvlogin. For brevity, I'm assuming that your login is valid on both instances.

Just because the remote instance is known as FRANK\ZAPPA, you don't have to use that name in your code. The linked server as such is really only an alias. For instance, this defines a linked server FZ that maps to the physical instance FRANK\ZAPPA:

(For a complete discussion of the parameters to sp_addlinkedserver, please see Books Online.) I will use FZ as the name of the linked server in the examples in this article. If you want to run the examples yourself, define FZ to point to an instance in your environment. To be able to invoke stored procedures on the linked server, you need to run one more statement:

If you want to change FZ to point to a difference instance, first drop the current definition with sp_dropserver:

(The second argument states that you also wants to drop any login mappings you have set up.)

If you only have access to one single SQL Server instance, you can still run the examples by setting up a loopback instance like this:

SQL Server will connect back to itself, thinking that it is talking a to different data source. This will work with most of the examples in the article, but there is one exception, which I will point out when we arrive there. An alternative is to add a second instance to your machine, for instance by downloading and installing an instance of SQL Server Express Edition.

Note: There is always a server definition for the local instance itself, and if your server is MYSERVER you can run things like

SELECT * MYSERVER.master.sys.objects
EXEC MYSERVER.master.sys.sp_who

However, SQL Server will recognise what is going on and shortcut these to local operations, so the quirky issues with linked servers will not exhibit. It is not smart enough, though, to look through a loopback server as set up as above.

Also, in preparation of the exercises, put error_handler_sp in tempdb on both instances as we use it all through the article. If you already have gotten your feet wet with SqlEventLog, I encourage you install it as well in tempdb, and if you like you can use it in place of error_handler_sp in the example scripts.

Distributed Transactions

When you access a linked server and you have a transaction in progress – either an explicit one started with BEGIN TRANSACTION, or an implicit one defined by a single local statement – the transaction is promoted to be a distributed transaction. This can be turned off in SQL 2008 and later versions with sp_serveroption by setting the option remote proc transaction promotion to false.

SQL Server does not implement distributed transactions itself, but it relies on an operating system component, Microsoft Distributed Transaction Coordinator (DTC). To be able to run some of the demos, you will need to know how to control DTC. Typically, DTC is not automatically started on desktop operating systems. To check the status, go the Control Panel, select Administrative Commands and in this folder select Services. Scroll down the list until you come to Distributed Transaction Coordinator. (Note that there may be other services of which the name also starts in Distributed...) You can right-click the service to start or stop it from the context menu.

When you run a distributed transaction over two data sources that are on different machines, you need to have DTC running on both. For the two DTC services to talk to each other, they need to be able to log on to each other's operating systems. If you are in a domain and there are no trust issues, this should work out smoothly. On the other hand, if you are in a workgroup, it can be about impossible to get it to work. (It may work if you change the service accounts for DTC to be accounts with the same name and password on both machines.) For the examples in this article, you can avoid this problem if you define FZ to point to another instance on the same machine, as in this case you only need one instance of DTC.

How Errors from Linked Servers are Communicated

When accessing a linked server, an error can occur in any of these three places:

In this section we will look at how these errors are reported locally and how you can trap them.

Errors Occurring Locally

Msg 7202, Level 11, State 2, Line 1

Could not find server 'NOSUCHSERVER' in sys.servers. Verify that the correct server name was specified. If necessary, execute the stored procedure sp_addlinkedserver to add the server to sys.servers.

These errors are just like any other local SQL Server error, and there is not much to say about them.

Errors in the OLE DB Provider

When errors occur in the OLE DB provider, they are typically not raised as errors as such. Instead the raw error messages from the provider are relayed as informational messages, which means that you cannot trap them in SQL Server with TRY-CATCH and the error_xxx functions. However, they are interpreted by SQL Server that raises its own generic error message so that your CATCH handler can fire.

We will look at some examples of OLE DB errors, starting with connectivity errors. For this particular example set up FZ to point to a non-existing machine:

OLE DB provider "SQLNCLI11" for linked server "FZ" returned message "Login timeout expired".

OLE DB provider "SQLNCLI11" for linked server "FZ" returned message "A network-related or instance-specific error has occurred while establishing a connection to SQL Server. Server is not found or not accessible. Check if instance name is correct and if SQL Server is configured to allow remote connections. For more information see SQL Server Books Online.".

Msg 53, Level 16, State 1, Line 0

Named Pipes Provider: Could not open a connection to SQL Server [53].

In total there are three messages, whereof two are said to come from the OLE DB provider, and the last one looks like an SQL Server message. Looks like? Well, this situation is a little odd, which we see more clearly if we change FZ to point to a local non-existing instance:

OLE DB provider "SQLNCLI11" for linked server "FZ" returned message "Login timeout expired".

Msg 50000, Level 16, State 1, Procedure error_handler_sp, Line 20

*** <dynamic SQL>, Line 0. Errno -1: SQL Server Network Interfaces: Error Locating Server/Instance Specified [xFFFFFFFF].

There is one thing that is odd with the error message: the error number is -1. Error numbers produced by SQL Server itself are supposed to be between 1 and 49999. On SQL 2005 and SQL 2008 the error number is different, but not less odd: 65535. What happens in these two cases is that SQL Server takes one of the error messages from the OLE DB provider and dresses it up an SQL Server message, using the return code from OLE DB as its error number. If you open a command-line window and run

you will see the same messages as above, but in somewhat different order and this time no error looks like an SQL Server error – all three clearly come from the client API.

For the next example with OLE DB, define FZ to point to an SQL Server instance on a different machine. (This example also gives you an occasion to test whether you are able to get this scenario working in your environment). Run this:

Recall that since the INSERT statement defines a system transaction, the stored procedure will always execute in the context of that transaction. And if this is a stored procedure on a linked server, the transaction is promoted to a distributed transaction. If it all works out smoothly and the procedure executes successfully, you will see this message:

The object 'sys.objects' does not have any indexes, or you do not have permissions.

I was less lucky. When I first ran the script, I got this message because I had forgotten to start DTC on the other server.

There is no OLE DB error here, and why it is so, we will learn in the next section. Once I had started DTC on the other machine, I got this output:

OLE DB provider "SQLNCLI11" for linked server "FZ" returned message "The partner transaction manager has disabled its support for remote/network transactions.".

(0 row(s) affected)

Msg 50000, Level 16, State 2, Procedure error_handler_sp, Line 20

*** <dynamic SQL>, Line 6. Errno 7391: The operation could not be performed because OLE DB provider "SQLNCLI11" for linked server "FZ" was unable to begin a distributed transaction.

If this happens to you, you have a problem with your DTC setup. The error message may lure you to think it is a configuration issue with DTC, so you may try both this and that. Not that I recommend it. I have tried myself without much success. I believe that real issue in an authentication failure.

There are a few things we can note about the output. The message from SQL Server is a generic one, but this time it is one of its own messages. The OLE DB provider has a more detailed error message. Unfortunately, neither that message is particularly clear, and this is nothing unique – error messages you get with linked servers are often very obscure, and this is far from the worst.

Errors in the Remote Data Source

When the errors occurs in the remote data source, this is presented completely differently depending on whether the remote data source is another SQL Server instance or not. If the linked server is another SQL Server instance, you cannot even tell from the message whether the error occurred locally or remotely. Make sure that you have FZ pointing to an existing SQL Server instance and run this:

Msg 15010, Level 16, State 1, Procedure sp_helpdb, Line 43

The database 'NoSuchDb' does not exist. Supply a valid database name. To see available databases, use sys.databases.

Msg 259, Level 16, State 1, Line 1

Ad hoc updates to system catalogs are not allowed.

If you remove FZ from the statements so that they are local to the server, the output is the same. This is possible, because the OLE DB provider for SQL Server has a provider-specific interface which exposes the seven components of an error message from SQL Server. Therefore the local SQL Server can dress up the error from the remote server as if it had occurred locally.

If the remote data source is not known to be an SQL Server instance, the output is radically different. It is possible to test this without getting a license for Oracle, learning Access or installing MySQL. You can set up FZ in this way:

This sets up a linked server using the generic OLE DB-over-ODBC provider MSDASQL with the old SQL Server ODBC driver. In the example above, replace FRANK\ZAPPA with the name of your remote SQL Server instance.

Note: I only show the possibility to set up a linked server over ODBC to be able show you how the error handling works with another provider than SQLNCLI. While everything seems to work when you set up a linked server this way, Microsoft does not support running linked servers against other SQL Server instances over ODBC; they only support linked servers using SQLNCLI.

Also note that for simplicity's sake, I'm using the old ODBC driver that came with SQL 2000 and that ships with the operating system. You should never use this driver for your applications or anything else. Just like MSOLEDBSQL, new versions of the ODBC driver are released out of band, and you can find information about recent releases on the Microsoft SQLNCli team blog. Newer versions of the ODBC driver may give different results than I show here, but the takeaway of these results is not the exact output, but that is is different to SQLNCLI.

If you run the script above with this definition of FZ, you get this output instead on SQL 2016 and lower:

OLE DB provider "MSDASQL" for linked server "FZ" returned message "[Microsoft][ODBC SQL Server Driver][SQL Server]The database 'NoSuchDb' does not exist. Supply a valid database name. To see available databases, use sys.databases. ".

Msg 7212, Level 17, State 1, Line 1

Could not execute procedure 'sp_helpdb' on remote server 'FZ'.

OLE DB provider "MSDASQL" for linked server "FZ" returned message "[Microsoft][ODBC SQL Server Driver][SQL Server]Ad hoc updates to system catalogs are not allowed.".

Msg 7343, Level 16, State 2, Line 1

The OLE DB provider "MSDASQL" for linked server "FZ" could not INSERT INTO table "[FZ].[tempdb].[sys].[objects]".

Because MSDASQL does not expose the components from an SQL Server message like SQLNCLI does, SQL Server uses its standard method to relay error messages from linked servers. The actual errors from the remote instance appears as informational messages from the OLE DB provider and they are passed directly to the client with no chance for us to pick them up in a CATCH block with the error_xxx functions. Instead SQL Server produces a generic message which only tells us that the operation failed. As long as you only want to trap the error to avoid further execution and roll back any open transaction, this is good enough. But if you also want to log the error into something like SqlEventLog, this is somewhat meagre. Note also the severity level for the first error: it is 17 which is pretty vile; this is above the range for regular user errors.

On SQL 2017 and higher, the output is the same for the first statement, but the output for the attempt to insert into sys.objects is different:

Msg 7399, Level 16, State 1, Line 9

The OLE DB provider "MSDASQL" for linked server "FZ" reported an error. The provider did not give any information about the error.

Msg 7343, Level 16, State 2, Line 9

The OLE DB provider "MSDASQL" for linked server "FZ" could not INSERT INTO table "[FZ].[tempdb].[sys].[objects]". Unknown provider error.

That is, the informational message with the real error message is not there, but instead there is a second error message telling us that there was no message from the provider. Which is very strange, since it is the same version of MSDASQL and the ODBC driver in both cases. It does not however seem to be a generic issue with SQL 2017. I ran a test against Postgres, using the PGNP OLE DB provider. This statement:

OLE DB provider "PGNP" for linked server "POSTGRES" returned message "ERROR: null value in column "a" violates not-null constraint

DETAIL: Failing row contains (null).

Msg 7343, Level 16, State 2, Line 5

The OLE DB provider "PGNP" for linked server "POSTGRES" could not INSERT INTO table "[POSTGRES]..[public].[tbl]".

So in this case the error message from OLE DB (and eventually the remote server) is not lost.

Before you move on, restore FZ to be defined with SQLNCLI as a regular linked server for SQL Server. Save your script for using MSDASQL, however, because we will occasionally come back to this setup.

Calling Remote Procedures

In this chapter and the next ones, we will run commands both on the local server and the remote server FZ. Here in the text, I will show things piece by piece for clarity. To make it easier to run the examples, I have also prepared scripts with the same contents that I show in the text. These scripts all follow this structure:

In this way you can run an example in one go without having to jump forth and back between query windows connected to different instances. As the examples are variations of each other, the same script is good for several examples, but you will have to make small changes as we move on. The first script to work with is linkedserverdemo1.sql and we will work with this script until I tell you to switch to the next.

Trying to Catch Remote Errors

We will first look at how we can catch remote errors. On the remote server FZ we have this table and stored procedure:

Before we run anything, let's try to understand what will happen when we run local_sp. The first call to remote_sp will insert two rows into remotetbl with the values (1, 1) and (2, 0). The second call will attempt to insert the rows (1, 2) and (3, 0). However, the row with (1, 2) will cause a primary-key violation. Had remote_sp been a stored procedure in the same database, or at least a stored procedure in a database on the same instance, the PK violation would have fired the CATCH block in local_sp, and thus neither row in the second call would have been inserted.

But what happens when we try this on a linked server? The same? Something else? Are you ready to place your bets? OK, let's try it!

The statement has been terminated.

Msg 2627, Level 14, State 1, Procedure remote_sp, Line 2

Violation of PRIMARY KEY constraint 'PK__remotetb__3BD0198E22AA2996'. Cannot insert duplicate key in object 'dbo.remotetbl'. The duplicate key value is (1).

a b

----------- -----------

1 1

2 0

3 0

(3 row(s) affected)

We can observe two things: 1) The row with (3, 0) was inserted into to remotetbl. 2) The CATCH handler in local_sp was not invoked. (The text Entering CATCH handler in local_sp was not printed, and the error message is not reformatted by error_handler_sp.) Instead the error message went directly to the client. And on the remote side, execution continued after the PK violation, which only is a statement-terminating error.

So this means that errors raised in remote stored procedures cannot be trapped by TRY-CATCH? Don't jump to conclusion. Change the second call in local_sp to read:

This will cause a batch-aborting conversion error when trying to insert the first row. So the row (3, 0) should not be inserted this time. This is the output:

Entering CATCH handler in local_sp

Msg 50000, Level 16, State 1, Procedure error_handler_sp, Line 20

*** [remote_sp], Line 2. Errno 245: Conversion failed when converting the varchar value 'b' to data type int.

a b

----------- -----------

1 1

2 0

(2 row(s) affected)

As expected, the row (3, 0) is not there, but what is more interesting, this time the CATCH handler in local_sp fired. This may seem like one of those willy-nilly things with SQL Server error handling, but for once there is a rational explanation. Before I come to it, let's look at one more thing. Change the second call to remote_sp to once again read:

Furthermore, change the setting in local_sp to read SET XACT_ABORT ON and try again. Now the output is:

Entering CATCH handler in local_sp

Msg 50000, Level 14, State 1, Procedure error_handler_sp, Line 20

*** [remote_sp], Line 2. Errno 2627: Violation of PRIMARY KEY constraint 'PK__remotetb__3BD0198E47DBAE45'. Cannot insert duplicate key in object 'dbo.remotetbl'. The duplicate key value is (1).

a b

----------- -----------

1 1

2 0

(2 row(s) affected)

This time the row with (3, 0) was not inserted and the local CATCH handler was invoked. Let consider the fate of (3, 0) first. We changed the XACT_ABORT setting locally, but what says that this would affect the remote procedure? However, that is precisely what is happening. (If you want a proof of this, set up a Profiler trace to see what is going on at FZ.) Maybe there is a pattern? If the error on the remote side is batch-aborting, the local CATCH handler is invoked, but not if the remote error was only statement-terminating? That is indeed a correct observation. Remains to understand why.

There is a logical lapse that is easy to make: you may think that the call to the remote procedure is just like any other call, but it isn't. When the PK violation occurs in remote_sp, the FZ instance first checks if there is a CATCH handler within the procedure. There isn't and then it goes further up the call stack. Except that there is not much of a call stack, since remote_sp is the top procedure. To wit, FZ knows nothing about local_sp or whatever CATCH handler the client may have. You see, for FZ the local server is just another client. So this is the rub: there is not one call stack, but there are two, one on each server. This explains why execution continued in remote_sp in the first case: there is no CATCH handler, and the error is statement-terminating.

Let's now move over to the local server. The local server sees that there is an error coming from the linked server. A gut reaction would be to jump to the CATCH handler. But wait, the remote procedure is still executing. So at very least, local_sp has to wait until remote_sp has completed, as it may produce more output, which could be result sets or more error messages. How should the local instance deal with these possibilities? Invoking the CATCH handler would give a false promise that the error in remote_sp was taken care of and that no further disaster happened. Then again, ignoring the CATCH handler makes TRY-CATCH seem to be unreliable.

When I first discovered this behaviour I debated with myself, and eventually I arrived on the opinion that I did not like it. I was about to file a bug, but I found that Jack J Zou had already beaten me to it. Microsoft responded to his item, saying that the behaviour by design and that it is also documented in an older version of Books Online in the topic Handling Errors in Server-to-Server Remote Stored Procedures.

Before we move on, let's look at what happens when we use MSDASQL for the linked server. Change local_sp back to have SET XACT_ABORT OFF. The output is:

OLE DB provider "MSDASQL" for linked server "FZ" returned message "[Microsoft][ODBC SQL Server Driver][SQL Server]The statement has been terminated.".

OLE DB provider "MSDASQL" for linked server "FZ" returned message "[Microsoft][ODBC SQL Server Driver][SQL Server]Violation of PRIMARY KEY constraint 'PK__remotetb__3BD0198E1AD3FDA4'. Cannot insert duplicate key in object 'dbo.remotetbl'. The duplicate key value is (1).".

Entering CATCH handler in local_sp

Msg 50000, Level 17, State 1, Procedure error_handler_sp, Line 20

*** [local_sp], Line 7. Errno 7212: Could not execute procedure 'remote_sp' on remote server 'FZ'.

a b

----------- -----------

1 1

2 0

3 0

(3 row(s) affected)

In this case the CATCH handler was fired. Here, SQL Server does not have that special knowledge that the other instance is an SQL Server instance, and raised an error despite that procedure continued executing on the other side. We can clearly see that the behaviour when the linked server is known to be SQL Server is a special case, and as noted above it is by design.

Change local_sp to have SET XACT_ABORT ON and run again. The output is the same. That is, this time the setting for XACT_ABORT ON is not propagated to the other side; that's another special case that only happens with SQLNCLI. (And that is not very surprising. With the exception of Sybase, a close relative of SQL Server, I would not expect any other database product have the command SET XACT_ABORT or something similar.)

Before you leave this section, restore the definition of FZ to the normal setting with SQLNCLI, as we will continue to use this script.

Did the Remote Procedure Succeed or Not?

It may seem that we have the answer: when we have XACT_ABORT in place, we can rely on TRY-CATCH, but alas not all errors abort the batch, even when XACT_ABORT is in effect. The most noticeable of them is RAISERROR, which the error from the remote procedure very well may have been produced by. Not the least if the remote procedure uses a standard CATCH handler with error_handler_sp or SqlEventLog. The simple remote_sp we had in the previous section is certainly not the norm.

Here is a version of remote_sp where we use a standard CATCH hander with some variations commented out.

To call this new remote procedure, we have this local procedure which is the same as local_sp, except that it has XACT_ABORT ON and one more diagnostic PRINT:

Both remote_trycatch_sp and local_sp2 are included in linkedserverdemo1.sql. All you need to change is the test section to read:

Msg 50000, Level 14, State 1, Procedure error_handler_sp, Line 20

*** [remote_trycatch_sp], Line 4. Errno 2627: Violation of PRIMARY KEY constraint 'PK__remotetb__3BD0198E03BB8E22'. Cannot insert duplicate key in object 'dbo.remotetbl'. The duplicate key value is (1).

local_sp2 jogs along

a b

----------- -----------

1 1

2 0

(2 row(s) affected)

We can't tell from looking at the error message whether it was passed directly from FZ to the client, or if it was reraised anew in local_sp. However, we can tell from the diagnostic message local_sp2 jogs along, that the CATCH handler in local_sp2 was not invoked.

If your remote server is running SQL 2012 or later, you can test the effect of uncommenting the ;THROW command in remote_trycatch_sp:

Entering CATCH handler in local_sp2

Msg 50000, Level 14, State 1, Procedure error_handler_sp, Line 20

*** [remote_trycatch_sp], Line 4. Errno 2627: Violation of PRIMARY KEY constraint 'PK__remotetb__3BD0198EC5142E37'. Cannot insert duplicate key in object 'dbo.remotetbl'. The duplicate key value is (1).

a b

----------- -----------

1 1

2 0

(2 row(s) affected)

;THROW aborts the batch when there is no CATCH handler higher up on the call stack, which is why the CATCH handler in local_sp2 is invoked this time. (Please recall that remote_trycatch_sp and local_sp2 are on different call stacks in different processes). In the main parts of this series of articles on error handling, I have favoured the solutions based on RAISERROR over ;THROW, but it seems that it's time to chalk one up for ;THROW here.

Unfortunately, there are complications. Uncomment that diagnostic PRINT, so that CATCH handler in remote_trycatch_sp reads:

Msg 2627, Level 14, State 1, Procedure remote_trycatch_sp, Line 4

Violation of PRIMARY KEY constraint 'PK__remotetb__3BD0198EE7FBB646'. Cannot insert duplicate key in object 'dbo.remotetbl'. The duplicate key value is (1).

The CATCH handler in remote_trycatch_sp was entered

local_sp2 jogs along

a b

----------- -----------

1 1

2 0

(2 row(s) affected)

What? As I was cooking up the demos for this section, I added that PRINT statement to make it clearer which CATCH handler that was entered. When I ran the test with ;THROW, my jaw dropped when the CATCH handler in local_sp2 was not invoked. To be honest, I have no idea what is going on, but note that there is another small mystery in the output: The message The CATCH handler in remote_trycatch_sp was entered comes after the error message, although the messages are produced in reverse order. It appears that something somewhere changes the order and that lures the local SQL Server to think that the message from ;THROW was not batch-aborting, and therefore it cannot invoke the CATCH handler.

And to take this confusion one step further, comment out the statement SET NOCOUNT ON in remote_trycatch_sp and run again. This time the CATCH handler does fire! So more in general, if you have a remote procedure that runs with SET NOCOUNT ON and first produces an informational message (for instance with PRINT) and then produces an error, you cannot trap this error with TRY-CATCH in your local procedure.

From what we have seen here, we cannot rely on solely on TRY-CATCH to determine whether the remote procedure failed or not, but we need to look for alternatives. The reader may at this point object that a PRINT message in a stored procedure is quite much of an edge case, and that is true. However:

If TRY-CATCH cannot be relied on, what is left to us are return codes and @@error. We first run an exploratory test with local_retcode_ataterror_sp so that we get a feeling of what we can expect:

We try this procedure with different CATCH handlers in remote_trycatch_sp. We can stay with the one we tested last, that is:

We get these values for @ret and @@error (I'm ignoring the error message for brevity):

@ret @@error

----------- -----------

0 0

@ret @@error

----------- -----------

NULL 0

@@error contains nothing of use, but we can see that @ret is explicitly is set to NULL for the call that fails. That is, we know that it was 0 prior to the call, so it was not a matter of @ret just retaining its value. (If that sounds funny, recall that this can happen in procedure calls, as I discussed in the section RETURN and the Return Values from Stored Procedures in Part Two.)

In this case we got the value specified the RETURN statement. Still no value in @@error. What happens if we comment out the RETURN statement, or leave it blank? This time the result is:

The value of -4 may seem surprising, but this is something that I discussed in the section RETURN and the Return Values from Stored Procedures. When there is no explicit RETURN value, but an error has been raised during the execution of the stored procedure, the return value is 10 minus the severity level of the error.

With this knowledge, we can write a local procedure which is able to detect more remote errors. (All remote errors? Stay tuned!)

That is, we check @ret after each call to the remote procedure, and if the value is anything but zero, we raise an error to take us to the CATCH block. As you see, I also set @ret to a non-zero value before each call. This may be overly ambitious. I have not seen a case when calling a remote store procedure where the return value has been unchanged. Or more precisely, I have not seen such a case when I also have TRY-CATCH. But I certainly did not try everything, so I cannot be sure that the pre-assignment of @ret is unnecessary.

Msg 50000, Level 14, State 1, Procedure error_handler_sp, Line 20

*** [remote_trycatch_sp], Line 4. Errno 2627: Violation of PRIMARY KEY constraint 'PK__remotetb__3BD0198E6E8B6712'. Cannot insert duplicate key in object 'dbo.remotetbl'. The duplicate key value is (1).

Entering CATCH handler in local_retcode_sp

Msg 50000, Level 16, State 1, Procedure error_handler_sp, Line 20

*** [local_retcode_sp], Line 13. Errno 50000: remote_trycatch_sp failed. Return status -4

a b

----------- -----------

1 1

2 0

Query Timeout on Linked Servers

Here is a really ugly one. For this example, you should comment out the creation and drop of remotetbl from linkedserverdemo1.sql. Instead run this script in a query window directly connected to your linked server:

The implication of the serializable isolation level is that if you re-run a query in the same transaction, you are guaranteed to get back the exactly same result. That is, no rows will have been modified, deleted or inserted. In other words, when we run any of the remote procedures in the demo, they will be blocked as long as this transaction is open.

As you may recall from Part Two, many client APIs has a default query timeout of 30 seconds. If no result has been produced within this amount of time, they send an attention signal to SQL Server and communicate an error to the application. When SQL Server accesses a linked server, it also employs a query timeout. However, in SQL Server the timeout defaults to ten minutes, and thus the risk you would experience a timeout is less. For this test, we don't want to wait ten minutes, so the script includes this command to set the timeout to five seconds:

Note: the default value for the server option is 0, but that does not mean "wait forever", but instead SQL Server uses the server-wide configuration parameter remote query timeout (s), which by default is 600, that is, ten minutes.

OLE DB provider "SQLNCLI11" for linked server "FZ" returned message "Query timeout expired".

@ret @@error

----------- -----------

NULL 0

No error was raised! All we get is an informational message from the OLE DB provider. This is no less than horrifying! This means that if you are not prepared for this, not only will your local stored procedure be unaware of that the call to the remote procedure failed, but furthermore the application will think that everything is nice and cosy. This must be construed as a bug, and I have reported it on Uservoice. The response so far from Microsoft is non-committal, but the item is at least still open.

Nevertheless, there are some good news. While there is no error message, we can note that the return value is NULL, so that way we can tell that something went wrong, and we can verify this by running local_retcode_sp:

OLE DB provider "SQLNCLI11" for linked server "FZ" returned message "Query timeout expired".

Entering CATCH handler in local_retcode_sp

Msg 50000, Level 16, State 1, Procedure error_handler_sp, Line 20

*** [local_retcode_sp], Line 8. Errno 50000: remote_trycatch_sp failed. Return status (null)

a b

----------- -----------

(0 row(s) affected)

It seems that this bad behaviour is confined to the case when the OLE DB provider is SQLNCLI, that is, when then the remote server is known to be another SQL Server instance. To wit, if you change FZ to be defined through MSDASQL instead, you get this output:

OLE DB provider "MSDASQL" for linked server "FZ" returned message "[Microsoft][ODBC SQL Server Driver]Query timeout expired".

Entering CATCH handler in local_retcode_ataterror_sp

Msg 50000, Level 16, State 1, Procedure error_handler_sp, Line 20

*** [local_retcode_ataterror_sp], Line 6. Errno 7214: Remote procedure time out of 600 seconds exceeded. Remote procedure 'remote_trycatch_sp' is canceled.

In this case, SQL Server produces a generic error message. (Never mind that the number 600 is incorrect; apparently it takes the value from the server-level configuration parameter and ignores the setting for the linked server.)

Commit that open transaction and make sure that you drop remotetbl before you move on. Also make sure that you restore the definition of the linked server to use SQLNCLI. Finally, restore the statements in the script to create and drop remotetbl.

Using Synonyms

Alas, there are also a situation where return codes fails, something Robin Otte made me aware of. He had a synonym for his remote procedure like this:

Using synonyms for objects on other servers and databases is definitely good practice, since if the remote database is moved elsewhere, you only need to retarget the synonyms, but you don't need to mess with the actual code.

The script includes a procedure local_synonym_sp which is the same as local_retcode_sp, except that the procedure calls are through the synonym:

We change the test to run this procedure, and if you have not changed remote_trycatch_sp since the previous test, you get this output:

Msg 50000, Level 14, State 1, Procedure error_handler_sp, Line 20

*** [remote_trycatch_sp], Line 4. Errno 2627: Violation of PRIMARY KEY constraint 'PK__remotetb__3BD0198E5FB28569'. Cannot insert duplicate key in object 'dbo.remotetbl'. The duplicate key value is (1).

Entering CATCH handler in local_synonym_sp

Msg 50000, Level 16, State 1, Procedure error_handler_sp, Line 20

*** [local_synonym_sp], Line 14. Errno 50000: remote_trycatch_sp failed. Return status -4

a b

----------- -----------

1 1

2 0

But now change the CATCH handler again, so that the PRINT and the ;THROW statements are activated. Also make sure that the initial SET NOCOUNT ON is active. This time when you run, this happens:

Msg 2627, Level 14, State 1, Procedure remote_trycatch_sp, Line 4

Violation of PRIMARY KEY constraint 'PK__remotetb__3BD0198EEC38D725'. Cannot insert duplicate key in object 'dbo.remotetbl'. The duplicate key value is (1).

The CATCH handler in remote_trycatch_sp was entered

The 'remote_trycatch' procedure attempted to return a status of NULL, which is not allowed. A status of 0 will be returned instead.

a b

----------- -----------

1 1

2 0

This time, we get 0 as the return value. The only thing that tells us that something went wrong is that informational message about the procedure attempting to return NULL and because this is not permitted, 0 will be returned instead. But since that message goes to the client and cannot be trapped, there is no way that we programmatically tell that remote_trycatch_sp failed.

The message itself seems confusing. If the procedure was permitted to return NULL when called from local_retcode_sp, why not this time? But if you look closer at the message, you see that it is not the procedure name that appears there – it is the name of the local synonym. Why this happens, I don't know, but supposedly a synonym to an object on remote server is implemented as a some sort of hidden stored procedure. (From what I can tell, you don't get this warning, when the synonym refers to a procedure on the same server.)

What do about it? I'm sorry, but I will have to wave the white flag and say when you use synonyms for your remote procedures, you will not be able detect that the procedure failed with a timeout error. Or, as in this test case, there is a PRINT before the error message is raised. Since is quite bad, I have filed a bug on Uservoice. I'm not really holding my breath that Microsoft will fix this, though.

Stored Procedure Calls in Distributed Transactions

By now, we have changed so much forth and back in linkedserverdemo1.sql, that we will leave that script behind and move on to linkedserverdemo2.sql. This script contains the same tables and procedures as linkedserverdemo1.sql with the addition of BEGIN and COMMIT TRANSACTION in the local stored procedures. Here is the new version of local_sp:

Entering CATCH handler in local_sp

Msg 50000, Level 16, State 3, Procedure error_handler_sp, Line 20

*** [local_sp], Line 7. Errno 8501: MSDTC on server 'NATSUMORI\CATORCE' is unavailable.

That is, I had forgotten to start DTC. I put in that message, in case you are equally oblivious. Having started DTC, I got this output:

The statement has been terminated.

Msg 2627, Level 14, State 1, Procedure remote_sp, Line 2

Violation of PRIMARY KEY constraint 'PK__remotetb__3BD0198EA3F8EDDC'. Cannot insert duplicate key in object 'dbo.remotetbl'. The duplicate key value is (1).

a b

----------- -----------

1 1

2 0

3 0

(3 row(s) affected)

The outcome was the same as without a transaction. That is, the error was not caught in local_sp, but execution continued in remote_sp and the transaction was committed at the end. If you change local_sp to have SET XACT_ABORT ON, the error in remote_sp is caught, the transaction is rolled back and the tables are empty. There is nothing radical about this; but the behaviour is quite expected. As long as XACT_ABORT is OFF in remote_sp, it will continue execution on a statement-terminating error like a PK violation.

Let's now try with local_sp2, which looks like this after adding a user-defined transaction:

We run local_sp2 and then look at what's in remotetbl. This is the output you get when the local server runs SQL 2005:

Msg 266, Level 16, State 2, Procedure remote_trycatch_sp, Line 0

Transaction count after EXECUTE indicates a mismatching number of BEGIN and COMMIT statements. Previous count = 1, current count = 0.

Entering CATCH handler in local_sp2

Msg 50000, Level 14, State 1, Procedure error_handler_sp, Line 20

*** [remote_trycatch_sp], Line 4. Errno 2627: Violation of PRIMARY KEY constraint 'PK__remotetb__3BD0198EE91D9F5C'. Cannot insert duplicate key in object 'dbo.remotetbl'. The duplicate key value is (1).

a b

----------- -----------

(0 row(s) affected)

(If you get a completely different message, comment out BEGIN and COMMIT TRANSACTION and run local_sp2. Once you have done this, restore BEGIN and COMMIT and run again; you should now see an output similar to the above. We will return to that different message later.)

Do you recall what happened when we ran local_sp2 without a transaction? That time the CATCH handler in local_sp2 was not fired, but the SP jogged along. This time, however, the local CATCH handler was fired. Why, I don't know exactly, but the explanation may be first in the output: there is an extra error message due to that remote_trycatch_sp rolled back the transaction which caused a trancount mismatch. This message is sent directly to the client, while the PK error is visible in the CATCH handler. If you run local_sp2 on SQL 2008 or higher, no message is sent to the client directly; all you see is the PK violation which is caught by the CATCH handler.

So this looks quite good then? We get the error message and even our CATCH handler works now, and there is no need to look at return codes? May I remind you of that 1974 smash-hit from Bachman Turner Overdrive, You Ain't Seen Nothing Yet!

Notice that the first line in remote_trycatch_sp says SET NOCOUNT ON. Change that to SET NOCOUNT OFF and try again. This is the output on SQL 2005:

Msg 50000, Level 14, State 1, Procedure error_handler_sp, Line 20

*** [remote_trycatch_sp], Line 4. Errno 2627: Violation of PRIMARY KEY constraint

PK__remotetb__3BD0198EDA073A2C'. Cannot insert duplicate key in object 'dbo.remotetbl'. The duplicate key value is (1).

Entering CATCH handler in local_sp2

Msg 50000, Level 16, State 2, Procedure error_handler_sp, Line 20

*** [remote_trycatch_sp], Line 0. Errno 266: Transaction count after EXECUTE indicates a mismatching number of BEGIN and COMMIT statements. Previous count = 1, current count = 0.

a b

----------- -----------

(0 row(s) affected)

The messages are reversed! (It's a little confusing to read, but the message about the PK violation comes from the call to error_handler_sp on the remote server FZ.) That we could possibly live with, but look what happens on SQL 2008:

Entering CATCH handler in local_sp2

Msg 50000, Level 16, State 2, Procedure error_handler_sp, Line 20

*** [remote_trycatch_sp], Line 0. Errno 266: Transaction count after EXECUTE indicates a mismatching number of BEGIN and COMMIT statements. Previous count = 1, current count = 0.

Msg 0, Level 11, State 0, Line 0

A severe error occurred on the current command. The results, if any, should be discarded.

a b

----------- -----------

(0 row(s) affected)

The error about the PK violation is nowhere to be seen. Instead there is this other message about a severe error, which presumably is due to that SQL Server sends bad TDS. I say so, because when the local server runs SQL 2012 or later this message is not there:

Entering CATCH handler in local_sp2

Msg 50000, Level 16, State 2, Procedure error_handler_sp, Line 20

*** [remote_trycatch_sp], Line 0. Errno 266: Transaction count after EXECUTE indicates a mismatching number of BEGIN and COMMIT statements. Previous count = 1, current count = 0.

a b

----------- -----------

(0 row(s) affected)

Confusing, eh? I think I can explain the difference between SQL 2005 and the others. On SQL 2008, all error messages from the remote procedure are handled in the same way, but since only one message can be displayed the error_xxx functions, you can only see one of them. You can convince yourself that both are there if you are running local_sp2 on SQL 2012 or later by inserting a ;THROW to the CATCH handler in local_sp2. You will see both error messages, but the order will be different depending on the NOCOUNT setting in remote_trycatch_sp.

But what has NOCOUNT to do with it? I think I have hunch of what is going on, but the explanation goes far beyond the scope of this series of articles, so I let it suffice to say that it has to do with the internals of the TDS protocol and the OLE DB provider.

If we try to escape our confusion and instead look at the practical implications, we can see that there is quite a serious one. There is a fifty-fifty chance that the error message we see is message 266 and that the real error is hidden. This means that our possibilities to understand what went wrong are seriously diminished.

Entering CATCH handler in local_sp2

Msg 50000, Level 16, State 1, Procedure error_handler_sp, Line 20

*** [remote_trycatch_sp], Line 11. Errno 8525: Distributed transaction completed. Either enlist this session in a new transaction or the NULL transaction.

a b

----------- -----------

(0 row(s) affected)

This time the output is the same on all versions of SQL Server. There are not two messages, because this is a batch-aborting message that occurs in the CATCH block in remote_trycatch_sp before the original message is reraised.

More precisely the error occurs when error_handler_sp is invoked. Why? I don't have the full story, but judging from the error message we are in the grip of DTC. The transaction has been rolled back, but if we try to start some local SQL Server transaction, DTC tells us to start a new distributed transaction or an explicit NULL transaction (which I don't think is possible from T‑SQL, but please tell me if you find out). In this particular case, what triggers the wrath of DTC is the attempt to compile a stored procedure which is not in the plan cache.

As long as we are using error_handler_sp, we could possibly consider this to be a small problem, because the odds are decent that the procedure already has a plan in the cache. However, in Part Three of this series, I introduce SqlEventLog and the stored procedure slog.catchhandler_sp. I have not been able to use this procedure successfully in remote_trycatch_sp, but I get the error above, even if there is a plan for the procedure in the cache. When trying to understand why, I have found that all of these trigger error 8525:

To sum this up, if you are inside a transaction and call a remote stored procedure that uses proper error handling, that is, a CATCH block as suggested in this series of articles, all you may see is this very unhelpful message. To find the original error message, you would need to use Profiler on the remote server, and include the events Error:Exception and Error:User message in the trace. You could also do the corresponding in an X-Event session. Whichever, this requires that you are able to reproduce the problem. If it was something that only happens intermittently, you are left in the dark of what is going on.

This behaviour is utterly bad, and when I discovered it I filed a bug, although I had little hope that it would be fixed; it was more like a protest action. Indeed the bug was closed as Won't Fix. (They first tried with By Design, but I did not buy that design.) The situation would have been somewhat better if it had been possible to detect that you are in a distributed transaction. You can tell this from the DMV sys.dm_session_transactions, but to access this DMV requires the permission VIEW SERVER STATE, so it is not good to put in application code. I have a feedback item calling for something better.

If you are calling a stored procedure which you have written to be called over a linked server, there is a simple strategy you can apply to avoid this problem: don't use any CATCH block, but rely solely on SET XACT_ABORT ON (which you should have in the remote procedure).

But the problem is that in many cases you want to call an existing stored procedure, which might be calling other procedures in its turn, and therefore you don't have any control over how they do their error handling. I'm afraid that I don't have any advice for you.

What happens if we use a linked server with MSDASQL for the examples in this section? I'm not showing the output, but here is a quick summary. As long as the plan for error_handler_sp is in the cache, we get both messages from remote_trycatch_sp, but we get them as informational messages, so the client must take care of them and not drop them on the floor. The case where error_handler_sp is not the cache is not any different; we get the same error message about a completed distributed transaction.

To conclude this section on distributed transactions on the lighter side, here is what happens when you have SET NOCOUNT ON in remote_trycatch_sp, and you uncomment the diagnostic PRINT and the ;THROW statement in the CATCH handler:

Msg 2627, Level 14, State 1, Procedure remote_trycatch_sp, Line 4

Violation of PRIMARY KEY constraint 'PK__remotetb__3BD0198E09AB25D0'. Cannot insert duplicate key in object 'dbo.remotetbl'. The duplicate key value is (1).

The CATCH handler in remote_trycatch_sp was entered

local_sp2 jogs along

Entering CATCH handler in local_sp2

Msg 50000, Level 16, State 1, Procedure error_handler_sp, Line 20

*** [local_sp2], Line 9. Errno 3930: The current transaction cannot be committed and cannot support operations that write to the log file. Roll back the transaction.

a b

----------- -----------

(0 row(s) affected)

This time, the CATCH block in local_sp2 was not fired directly, but the SP tried to jog along. However when it tried to commit the transaction, it was detected that the transaction was doomed, and there was not any actual harm.

Accessing Remote Objects in Queries

We will now turn to accessing remote objects directly in queries, by four-part notation or through OPENQUERY.

Linked Servers and Deferred Name Resolution

In this section we will look at what happens when we try to access non-existing objects. There is no pre-cooked script for this section.

In Part Two, you learnt that SQL Server has deferred name resolution, which has the consequence that you can get compilation errors at run-time. That is, if you run this:

The procedure is created successfully whereupon it fails at run-time, because the table is missing. With linked servers it is a completely different matter. If you try this:

Msg 7202, Level 11, State 2, Procedure ForCalvin, Line 2

Could not find server 'AndHis' in sys.servers. Verify that the correct server name was specified. If necessary, execute the stored procedure sp_addlinkedserver to add the server to sys.servers.

What if the server exists, but not the table? Assuming that you still have FZ defined try this:

Msg 7314, Level 16, State 1, Procedure ForCalvin, Line 2

The OLE DB provider "SQLNCLI11" for linked server "FZ" does not contain the table ""Next"."Two"."HitchHikers"". The table either does not exist or the current user does not have permissions on that table.

OLE DB provider "SQLNCLI11" for linked server "FZ" returned message "Deferred prepare could not be completed.".

Msg 208, Level 16, State 1, Line 1

Invalid object name 'Next.Two.HitchHikers'.

I find this ironic. While I don't like deferred name resolution in general, I could see a point in having it for linked servers. If I run a large deployment script on my production server in a tight maintenance window, and the remote server is down because they are applying a service pack, do I want my script to fail? I guess the reason for this inconsistence is that when deferred name resolution was introduced in SQL 7, linked servers were scoped out because of time constraints or some other reason. But it is in moments like this you easily start to think that SQL Server has been designed for maximum confusion.

Say now that you are able to create the procedure, but at run-time the table is missing for whatever reason. What happens in this case? Here is a script that creates a table at the remote server, creates a local procedure, and then drops the table before the procedure is executed:

Msg 7314, Level 16, State 1, Procedure Awrighty, Line 5

The OLE DB provider "SQLNCLI11" for linked server "FZ" does not contain the table ""tempdb"."dbo"."Cleetus"". The table either does not exist or the current user does not have permissions on that table.

That is, since the characteristic output of the error_handler_sp is not there, we can tell that the CATCH handler is not invoked, not too different from when a local table is missing. But if you look closer, it's not exactly the same. With a local table, the procedure starts running, and you get the error when you reach the SELECT statement. But since the message Awrighty starting is not printed, we can tell that we never entered the TRY block. No wonder that the CATCH block is not fired! The error occurs already when the optimiser builds a plan for the procedure.

What if the procedure is in the cache when the table is dropped? If this happens with a table in a database on the local server, SQL Server directly invalidates all plans referring to this table. But that cannot happen when a table is dropped on the remote server, so the local instance will run the procedure and not discover until it reaches the statement that the table is gone. This is exactly what happens if we uncomment the first call to Awrighty in the script above:

Awrighty starting

Awreety

-----------

Awrighty starting

Entering the CATCH handler

Msg 50000, Level 16, State 1, Procedure error_handler_sp, Line 20

*** [Awrighty], Line 5. Errno 7314: The OLE DB provider "FZ" for linked server "FZ" does not contain the table ""tempdb"."dbo"."Cleetus"". The table either does not exist or the current user does not have permissions on that table.

Indeed, the procedure starts executing, and it tries to run the SELECT statement which fails, et voilà! The error is caught and handled like any other run-time error. (Which, as you may recall, is in contrast to when a table local to the server is missing. In this case, the error cannot be caught in the scope where it occurs, only in outer scopes.) If you peel off the TRY block, add a transaction and run some more tests, you will find that this is a batch- and transaction-aborting error. I'm not including an example to show this, but I leave it as an exercise to the reader.

The remote table being dropped is only one example of what could cause the query to fail. Columns could have been dropped or renamed. Or for that matter the database itself. The server could be down. These situations result in different errors, but as long as the plan is the cache already, the effect is the same: you get a run-time error that you can trap with TRY-CATCH. On the other hand, if the definition of the linked server is dropped, this flushes the plan cache, and you get a scope-aborting compilation error that you cannot trap in the local scope.

Errors When Updating Remote Objects

Here are some examples of what can happen if you attempt to update remote objects through four-part notation. A full script for this section is available in the file linkedserverdemo3.sql.

The statement has been terminated.

Entering CATCH handler in EatThatQuestion.

This is the CATCH handler of Blessed Relief

Msg 50000, Level 14, State 1, Procedure error_handler_sp, Line 20

*** <dynamic SQL>, Line 1. Errno 2627: Violation of PRIMARY KEY constraint 'PK__remotetb__3BD0198E6CD828CA'. Cannot insert duplicate key in object 'dbo.remotetbl'. The duplicate key value is (1).

The output is not too unexpected, but we can note that the procedure name and line number are incorrect. This is not that mysterious; the local server connects to the remote server and it is the remote server that runs the query, presumably through some sort of ad-hoc batch. The local server then only relays what it gets. (Recall that the text about dynamic SQL is added by error_handler_sp when error_procedure() returns NULL.) We can also note that this is different from when we called a remote stored procedure: in that case a PK violation went directly to the client and it was not trappable with TRY-CATCH. But in this case, SQL Server has full control over what it emits to the remote server, and knows that this is a single operation, so it is safe to invoke the CATCH handler.

Now let's uncomment the BEGIN and COMMIT TRANSACTION statements and see what happens. This is a case where it matters whether you have a loopback server, or whether your linked server is a different instance. If your linked server is a loopback server, you get this message:

Entering CATCH handler in EatThatQuestion.

This is the CATCH handler of Blessed Relief

Msg 50000, Level 16, State 2, Procedure error_handler_sp, Line 20

*** <dynamic SQL>, Line 1. Errno 3910: Transaction context in use by another session.

Mysterious as it may seem, this falls under the category you are not supposed to do that. If you have set up FZ to point to a different instance, you will instead see this output:

OLE DB provider "SQLNCLI10" for linked server "FZ" returned message "Cannot start more transactions on this session.".

Entering CATCH handler in EatThatQuestion.

This is the CATCH handler of Blessed Relief

Msg 50000, Level 16, State 2, Procedure error_handler_sp, Line 20

*** [EatThatQuestion], Line 7. Errno 7395: Unable to start a nested transaction for OLE DB provider "SQLNCLI10" for linked server "FZ". A nested transaction was required because the XACT_ABORT option was set to OFF.

Here is a case where you first get a message from the OLE DB provider and another from SQL Server. The message from SQL Server is quite clear: we need to change EatThatQuestion to have SET XACT_ABORT ON. So let's do that. You might see this output:

This message indicates an internal error somewhere, either in SQL Server or the client API. Often when the level is 11, the error has occurred is in the client API, although the bug does have to be there; often it is SQL Server that has emitted bad TDS. If you try SQLCMD or OSQL instead of SSMS, you may get no output at all – despite that we know that EatThatQuestion generates a PK violation.

If you run the script a number of times, you may sometimes see this message instead:

There are all sorts of mysteries here. None of the CATCH blocks have been entered, and the message comes directly from SQL Server, without passing through error_handler_sp. The severity level is 18, beyond the normal range for user errors which 11 to 16. If you count lines in EatThatQuestion, what is on line 10? This statement:

That is, the PK violation is trapped and execution is sent to the CATCH block, but upon entering the CATCH block something really bad happens, and an entirely uncatchable error is produced. (It does not help if you wrap EatThatQuestion in an outer procedure.) And not only that: again we face a situation where the original error is masked by a new error, meaning that we have very little clue about went wrong originally. If the situation is easily reproducible, the workaround is the same as above: run a trace or an X-event session and include the events for exceptions and user messages.

When I discovered this, I reported a bug. Microsoft closed it with Won't fix but they were kind to offer an explanation, which I reproduce here:

In their response, Microsoft also referred to the topic Using TRY...CATCH in Transact-SQL in Books Online, which contains this relevant passage in a note:

The race condition with multiple processes at play, two SQL Server instances and one or two DTC instances, explains why the output is not always the same. The full story about A severe error... is not revealed, but maybe SQL Server has started sending something when the attention signal interrupts and so the client gets get an incomplete message.

In their initial response to the bug, Microsoft say that they handle the situation for stored procedure calls correctly, so supposedly we should not see this error when calling remote stored procedures. (But as we have seen, there are other issues.)

Interesting enough, if I run this scenario with FZ set up to use MSDASQL, I consistently get the correct error message on SQL 2016 and lower:

OLE DB provider "MSDASQL" for linked server "FZ" returned message "[Microsoft][ODBC SQL Server Driver][SQL Server]The statement has been terminated.".

OLE DB provider "MSDASQL" for linked server "FZ" returned message "[Microsoft][ODBC SQL Server Driver][SQL Server]Violation of PRIMARY KEY constraint 'PK__remotetb__3BD0198E6383C8BA'. Cannot insert duplicate key in object 'dbo.remotetbl'. The duplicate key value is (1).".

Entering CATCH handler in EatThatQuestion.

This is the CATCH handler of Blessed Relief

Msg 50000, Level 16, State 2, Procedure error_handler_sp, Line 20

*** [EatThatQuestion], Line 8. Errno 7343: The OLE DB provider "MSDASQL" for linked server "FZ" could not INSERT INTO table "[FZ].[tempdb].[dbo].[remotetbl]". The data violated the integrity constraints for one or more columns.

However, since there is a race condition here, this does not really prove anything. It might be working by chance only, and you may see a different result in your environment. On SQL 2017, I don't get the two messages from MSDASQL, but I get the other messages.

Query Timeouts

For completeness sake, let's look how this works with direct queries. On FZ run in tempdb (after having created remotetbl):

Make sure that the BEGIN and COMMIT TRANSACTION commands in EatThatQuestion are commented out and run BlessedRelief. This was the output I got:

OLE DB provider "SQLNCLI11" for linked server "FZ" returned message "Query timeout expired".

Entering CATCH handler in EatThatQuestion.

This is the CATCH handler of Blessed Relief

Msg 50000, Level 16, State 2, Procedure error_handler_sp, Line 20

*** [EatThatQuestion], Line 6. Errno 7343: The OLE DB provider "SQLNCLI11" for linked server "FZ" could not INSERT INTO table "[FZ].[tempdb].[dbo].[remotetbl]".

...except that on SQL 2017, the first message, Query timeout expired, is not produced. Since it is bad to lose the original error message, I have reported this as a bug on Uservoice.

Errors While Returning a Result Set

In this chapter we will look at what happens when an error is produced while a result set is returned to the client. If you have read Part Three to the end, you may recall the section The Virtue of Getting All Result Sets, where we saw that this is a situation where you can lose the error message if you don't write your client code properly. Here we will explore how this works, when SQL Server is the client to a remote server.

The script for this chapter is linkedserverdemo4.sql. You can run the script as a whole, but we will look at it piece by piece. The first batch is the same as in the other scripts: it sets up the linked server. Likewise does the last batch clean up the only object created.

The script first sets up remotetbl on FZ. However, this is a little different from before:

As previously, there are two remote stored procedures, remote_sp without TRY-CATCH and remote_trycatch_sp:

a b

----------- ----

1 9

Msg 0, Level 11, State 0, Line 10

A severe error occurred on the current command. The results, if any, should be discarded.

This is a message from SSMS that things went wrong, and since the severity level 11, the error presumably occurred in SSMS itself. However, as above, I suspect that the root cause is that SQL Server sends bad TDS. In any case the consequence is that we don't know what the real error is. (And nor we are not able to tell whether execution continued on the remote server.) Not that I really expect Microsoft to fix this, but I have reported it as a bug on Uservoice.

a b

----------- ----

1 9

(1 row affected)

Msg 50000, Level 16, State 2, Procedure error_handler_sp, Line 20

*** [remote_trycatch_sp], Line 3. Errno 220: Arithmetic overflow error for data type tinyint, value = 999.

This is a different situation for the local SQL Server, as the error message does not arrive in the middle of the result set, but comes as a different "result set" in the TDS stream.

Let's now look at what happens when we call these procedures through OPENQUERY instead:

a b

----------- ----

1 9

Msg 220, Level 16, State 2, Line 40

Arithmetic overflow error for data type tinyint, value = 999.

This is exactly the output we expect. However, if we call the procedure with TRY-CATCH, it does not go so well:

The error message is lost entirely! Undoubtedly, if used causually, this can have disastrous consequences. The result set is returned to the user, who thinks the data is correct, and from this makes a business decision which puts the company at stake.

You may at this point expect a link to my bug report on Uservoice, but there isn't any. Indeed, this is exactly the mistake I warn you for in the section The Virtue of Getting All Result Sets, but in this case I say it this is "by design". OPENQUERY is designed to return one result set and ignore the rest. Furthermore, OPENQUERY is supposed to passed a pass-through query. And then I mean a query, not a procedure call. If you want to call a stored procedure on a remote data source to use the result in a query, you are better of inserting the result in a temp table with INSERT-EXEC. Even if that means that you need to write a CREATE TABLE statement with umpteen columns, although you only care about three or four of them.

Conclusion

We have now looked at the peculiarities with error and transaction handling when you use linked servers and we have focused on the most common case: the linked server is another SQL Server instance and we use the SQLNCLI provider. We have also made some tests using MSDASQL + the SQL Server ODBC driver to get a feeling for what you can expect when you access other data sources than SQL Server.

We have seen that in the case with SQLNCLI, SQL Server takes some shortcuts and presents errors on from the remote instance as if they are local messages. This does not always fall out well, and you can face situations where you lose the original error message, not the least if you engage in distributed transactions. And there is the really ugly case when you get a timeout in a remote procedure and this does not raise any error at all.

In contrast, when you use MSDASQL, the behaviour is squared: the error message you can trap in SQL Server is a generic one telling you that some operation failed. The real error messages always go to the client as informational messages. For some cases, like the query timeout, the behaviour with MSDASQL is certainly better than with SQLNCLI. However, please keep in mind that the examples with MSDASQL + SQL Server were for demonstrational purposes only. Microsoft does not support linked servers with MSDASQL for connection to other SQL Server instances; they only support SQLNCLI.

You have seen several examples of confusing behaviour in this appendix. Have you gotten the full story? I cannot make any such promises. This appendix came about when I started to write something a lot briefer intended to go into the chapter Special Contexts in Part Two. However, as I explored various combinations, I found more and more confusing things, and I quickly had more material than I could fit into the main article. And at some point I just had to stop. Thus, I am quite sure that if you use linked servers a lot that you will find more absurdities than you have seen here.

If you are working with a remote data source other than SQL Server, I strongly recommend that you test various error scenarios to find out how they work. Keep in mind that you are not only at prey of the funky stuff in SQL Server, but there may also be glitches in the OLE DB provider you use. And if you use MSDASQL + ODBC you have one more boundary in the mix where confusion can appear.

Overall, I am not a friend of linked servers and even less of distributed transactions. They often mean hassle, and then I am not only thinking of error and transaction handling, but more of things in general like connectivity. Not only may the remote server be down, but it can stop working because changes in the AD which prevents delegation from working. Performance can also be a problem, since optimisation of distributed queries is quite a challenge for the optimiser. Distributed transactions also incur several restrictions. You cannot use snapshot isolation, and you cannot use the command SAVE TRANSACTION. Distributed transactions are not supported with mirroring, and nor were they supported with availability groups before SQL 2016 SP2. On top of all, you cannot easily check whether you are in a distributed transaction.

What are the alternatives? Well, if you can put the databases on the same instance do so – that will make your life so much easier. If the databases must be on different servers, look into asynchronous solutions with Service Broker or similar. Or have a client program to talk to both servers. True, many such solutions require more effort up front. But linked servers are one of those things that look simple in the beginning but keep on causing grief as you use them.

We are the end of Appendix 1. If you have questions on the contents, or you have noticed any spelling or grammar errors, you are welcome to mail me on [email protected]. If you have problems related to something you are working with, I recommend that you ask in a public forum for quicker response. For a list of acknowledgements, please see the end of Part Three. Below is a revision history for this appendix.

Error and Transaction Handling in SQL Server
Appendix 1: Linked Servers

Introduction

Index of All Error-Handling Articles

Some General Notes on Linked Servers

Setting Up a Linked Server

Distributed Transactions

How Errors from Linked Servers are Communicated

Errors Occurring Locally

Errors in the OLE DB Provider

Errors in the Remote Data Source

Calling Remote Procedures

Trying to Catch Remote Errors

Did the Remote Procedure Succeed or Not?

Query Timeout on Linked Servers

Using Synonyms

Stored Procedure Calls in Distributed Transactions

Accessing Remote Objects in Queries

Linked Servers and Deferred Name Resolution

Errors When Updating Remote Objects

Query Timeouts

Errors While Returning a Result Set

Conclusion

Revision History

Error and Transaction Handling in SQL Server Appendix 1: Linked Servers

Introduction

Index of All Error-Handling Articles

Some General Notes on Linked Servers

Setting Up a Linked Server

Distributed Transactions

How Errors from Linked Servers are Communicated

Errors Occurring Locally

Errors in the OLE DB Provider

Errors in the Remote Data Source

Calling Remote Procedures

Trying to Catch Remote Errors

Did the Remote Procedure Succeed or Not?

Query Timeout on Linked Servers

Using Synonyms

Stored Procedure Calls in Distributed Transactions

Accessing Remote Objects in Queries

Linked Servers and Deferred Name Resolution

Errors When Updating Remote Objects

Query Timeouts

Errors While Returning a Result Set

Conclusion

Revision History

Error and Transaction Handling in SQL Server
Appendix 1: Linked Servers