Error and Transaction Handling in SQL Server
Appendix 3: Service Broker

An SQL text by Erland Sommarskog, SQL Server MVP. Latest revision: 2015-05-03.
Copyright applies to this text. See here for font conventions used in this article.

Introduction

This article is an appendix to the three-part series Error and Transaction Handling in SQL Server. More precisely, it is a continuation of chapter six, Case Study: the Simple Order Database in Part Three. This appendix discusses how the import of external orders could be implemented with Service Broker.

This appendix is not intended to be a full coverage of all error handling you need in a Service Broker implementation. It gives a blind eye to the error messages you can receive in a dialog. The focus is how to handle errors in the T‑SQL that processes the messages, and particularly how to avoid that a so-called poison message disables a queue. That is, when there has been a rollback five consecutive times on a queue, SQL Server by default disables that queue.

If you arrived to this page through a web search, beware that this appendix is written under the assumption that you have read Parts One and Two in whole and Part Three up to chapter six. In case that was a little while ago, I'm including quick recaps of SqlEventLog and the simple order database, as they are featured in the demo which is the centrepiece of this appendix. It goes without saying, that this article assumes that you have a basic understanding of Service Broker.

Table of Contents

Introduction

Index of All Error-Handling Articles

Recaps: SqlEventLog and the Simple Order Database

SqlEventLog

The Simple Order Database

Sending the Files as Service Broker Messages

One Message per Order

Demo Setup

Tackling the Problem

ProcessOrders1

Multiple Queue Readers

ProcessOrders2

ProcessOrders3

Closing Remarks

The Case of DoOneOrderFromActivation

A Final Word of Caution

Feedback and Revision History

Index of All Error-Handling Articles

Here follows a list of all articles in this series:

Part One – Jumpstart Error Handling.

Part Two – Commands and Mechanisms.

Part Three – Implementation.

Appendix 1 – Linked Servers. (Extends Part Two.)

Appendix 2 – CLR. (Extends both Parts Two and Three.)

Appendix 3 – Service Broker. (This article.)

All the articles above are for SQL 2005 and later. For those who still are on SQL 2000, there are two older articles:

Error Handling in SQL Server 2000 – a Background.

Implementing Error Handling with Stored Procedures in SQL 2000.

Recaps: SqlEventLog and the Simple Order Database

SqlEventLog

In Part Three, I present SqlEventLog as a general error-logging utility. All components in SqlEventLog exists is the slog schema. SqlEventLog exposes two public stored procedures, slog.sqleventlog_sp and slog.catchhandler_sp.

The procedure sqleventlog_sp permits you to raise an error message and at the same time log the message in the table slog.sqleventlog. The procedure has two mandatory parameters: @procid (for which you always pass @@procid) and @msgtext which is the error message. The message can be parameterised with %1 to %6 as placeholders, and you pass the actual values in the parameters @p1 to @p6.

The procedure has a couple of optional parameters, of which I make use of one in this appendix: @raiserror. When set to 0, sqleventlog_sp does not raise any error, but only logs the message to the table.

The procedure slog.catchhandler_sp is intended to be called from a CATCH handler. It logs the error that fired the CATCH handler to the sqleventlog table and reraises the error, unless told not to. The reraised error is augmented with the error number, procedure and line number from the original error. The procedure has this signature:

CREATE PROCEDURE slog.catchhandler_sp
                 @procid        int = NULL,
                 @reraise       bit = 1,
                 @errno         int = NULL OUTPUT,
                 @errmsg        nvarchar(2048) = NULL OUTPUT,
                 @errmsg_aug    nvarchar(2048) = NULL OUTPUT AS

For @procid you always pass @@procid (the parameter is optional to avoid errors in case you forget it). By passing @reraise = 0, you suppress the error from being reraised. @errno returns the number of the original error. @errmsg is the original error message, while @errmsg_aug returns the error message augmented with error number, procedure name and line number.

You find the code for SqlEventLog in the file sqleventlog.zip. To install it, run build_sqleventlog.bat. You need to configure the server and the database in the file. Please see the section Installing SqlEventLog in Part Three for more details on installation.

The Simple Order Database

The setting is that there are internal orders added to the database from some application, and then there are orders coming from an external system, implemented by a different team and vendor. Thus, we should not trust them to always send data on the correct format. In the demo in the main article, the orders come through XML files. A file contains many orders, but each order should be handled separately. That is, if there is an error with one order, the rest of the orders in that file must still be stored in the database. The external system is permitted modify or delete an existing order as long as the warehouse has not started to process the order.

The core of the database consists of four tables: Orders, OrderDetails, Customers and Products. There are a couple of stored procedures to enter data into these tables and some validation procedures to go with them. Then there are two tables OrderImportFiles and OrderImportFileItems. The first holds header information about the file; the second has one row per order, with the order still stored in XML format. There is one procedure, AddOrderImportFile that receives an XML document and stores the data into the two import tables. The procedure ProcessImportFiles loops over all files that have not been fully processed and ProcessOneImportFile runs a loop where each order is processed. When an order entry has been processed, the row in OrderImportFileItems is updated with status, success or failure, and any error message. A different process, not included in the demo, is responsible for sending back a response file to the external system.

There are two script files with tables, stored procedures and sample data: SimpleOrderDatabase.sql (which only runs on SQL 2012 and later) and SimpleOrderDatabase-2008.sql (which runs on all versions from SQL 2005 and on). The files are largely the same, but the SQL 2012 version makes use of the new function try_convert(). To make things simple, the Service Broker demo builds on the SQL 2008 version. The script files include 32 test orders, of which 26 result in errors when they are processed.

Sending the Files as Service Broker Messages

The simple order database lends itself well for discussions about Service Broker. Rather than sending a file through FTP or whatever, the external order system could send the orders as Service Broker messages.

Let's first study a design resemblant to the original case. That is, the external system sends a number of orders assembled into a single XML document. Just like in the original example there is a Guid attribute in the root element that identifies the "file". (And I will refer to this GUID as a file GUID, also in this appendix.) Since the business rules mandate that orders should be handled separately, the XML document needs to be persisted in the database, and the tables OrderImportFiles and OrderImportFileItems serve that purpose well. This leads to an activation procedure that is very similar to AddOrderImportFile. That is, a procedure that performs these steps:

Since these files only come a few times per day, there would be little reason to have the WHILE loop that is typical for activation procedures.

The interesting bullet for this article is the third, that is, the validation. As a recap, here are the validations AddOrderImportFile performs:

The original AddOrderImportFile raises an error and rolls back any transaction if any of these validations fail. That would not be particularly intelligent with a Service Broker queue, as the bad XML document would remain in the queue and after five attempts the queue would be disabled. No, the correct behaviour would be to send back an error message that alerts the sender that the message they sent was invalid.

Thus, your code would look something like this:

IF @nodecnt <> 1
BEGIN 
   SELECT @errmsg = 'Malformed XML, there should be a single Orders top node.' 
   GOTO ErrorExit
END
...
Error_exit:
   EXEC slog.sqleventlog_sp @@procid, @errmsg, @raiserror = 0
   SELECT @error_response = (SELECT @errmsg FOR XML PATH('ERROR'))
   ; SEND ON CONVERSATION @DialogHandle 
      MESSAGE TYPE OrderFileError (@error_response) 
   COMMIT TRANSACTION
   RETURN
END

That is, you set the error message and then you go to an error exit where you log the message to sqleventlog if you like to. Observe that the parameter @raiserror must be set to 0 as you only want to log the error. There is some formatting of the error message to XML, which is commonly used for Service Broker messages. The error response is sent back to the originator. Finally the transaction is committed.

Why not raise an error and have the SEND command in the CATCH handler? No, that does not fly. We run with XACT_ABORT ON, in which case all errors doom the transaction, and we are not able to perform any SEND command, no less commit. Well, we can roll back and then SEND the error response. But if we roll back, the bad message is put back in the queue which is the last thing we want.

The activation procedure would still have a standard CATCH handler:

BEGIN CATCH
   IF @@trancount > 0 ROLLBACK TRANSACTION
   EXEC slog.catchhandler_sp @@procid
   RETURN 55555
END CATCH

If we end up here, something unexpected happened in our code. Since the error is on our side in this case, there is little reason to keep the queue alive. Deadlock? Ah, in an activation procedure, you get the deadlock retry for free. The message is still there in the queue, and hopefully we have better luck next time.

Now, the observant reader may have noticed that there is a detail I have passed over: the validation of the file GUID. On SQL 2012, we can use try_convert(), but you may recall that in the SQL 2008 version of AddOrderImportFile this was the solution:

BEGIN TRY
   SELECT @FileGuid = convert(uniqueidentifier, @FileGuidStr)
END TRY
BEGIN CATCH
END CATCH

This worked well in Part Three where the input came from a file, and the above was performed outside a transaction. But in an activation procedure there is always a transaction, and it will be doomed if the validation fails. How to handle this? Here are four alternatives:

  1. Upgrade to SQL 2012 or later and use try_convert.
  2. Give a blind eye to the error, that is, assume that it is too unlikely that it would happen, and don't have any TRY-CATCH at all.
  3. Implement your own try_convert. (For uniqueidentifier you might be able to do this in T‑SQL. For many other data types, a CLR function is the only realistic alternative.)
  4. Roll back the message and receive it anew and this time send an error response? Works – but only if there is single queue reader. We will come back to this theme.

By the way, there is one more validation you should perform, at least if you are on SQL 2012 where try_convert is your best friend. To wit, that the message is valid XML. In AddOrderImportFile that check was implemented by the data type of the parameter. For a Service Broker queue, it is common to set up the contracts so that the message type says WELL_FORMED_XML, and in that case the sender can't get bad XML on to the queue. However, Remus Rusanu recommends in his blog post Error Handling and Activation that you should always receive the message body as binary and then cast to XML. (If you have never read Remus's blog, you should. There is a lot to learn about Service Broker there.)

This scenario here is very simple: just receive a message off a queue, perform a minimum of validations and store it in a local table for further processing. Because we are in full control of what errors that may occur we can permit us a very simple strategy to avoid poison messages disabling the queue. But the more complex your logic is, the more likely that you will need better error handling to avoid that the queue gets disabled. And that is exactly what we will look into next.

One Message per Order

Let's now assume that the sender sends the orders continuously all through the day, one by one. This leads to quite a different implementation. In this case, we don't want to store the messages in an intermediate table, but we want to process an order message directly and send back a response saying that the order was successfully received in our system – or that it was invalid.

Demo Setup

To discuss the alternatives, I've cooked up a demo which you find in the file ServiceBroker-demo.sql. (This script runs on all versions of SQL Server from SQL 2005 and up.) To run this demo, you need some preparations. First you need to make sure that you have SqlEventLog installed in your database. (I recommend that you use tempdb.) Next you need the SimpleOrderDatabase-2008.sql to create and populate the simple order database. You should run this script only down to this line:

EXEC AddOrderImportFile @orderdoc, 'C:\Temp\Orderfile2.xml'
go

That is, you should load OrderImportFiles and OrderImportFileItems with the test orders, but you should not process them. In this Service Broker demo, these two tables have changed sides and are now part of the external system.

Note: On SQL 2005, the script generates two errors saying Incorrect syntax near WHERE. You can ignore these errors. They are due to two filtered indexes that are not essential for the demo.

You will repopulate the order database several times during this demo, so you may want to take a copy of the script, where you put the clean-up section on top, and then delete everything below this line

------------------------------- Process the test files! -------------------------

This permits you to recreate the order tables by running the entire query window.

Looking inside ServiceBroker-demo.sql, this script first sets up two services, one for order requests and one for order responses, with the associated queues, message types and one contract. There is no activation here; this is added to the queue OrderRequests later. Next follows the stored procedure SendAndReceiveOrders. This procedure serves as our token external system. It reads the orders in OrderImportFileItems and sends a message for each order on the OrderRequests service. It then enters a loop where it listens to messages on the OrderResponses queue. The procedure has a table variable where it maps the conversation handles to the key columns in OrderImportFileItems, so that it can update the columns InternalOrderID, Status and Message with the data in the response messages. (A normal external order system would probably rely on the order ID in the orders, but you may recall that quite a few of the test orders are faulty, and some do not even have the requested OrderID attribute.) SendAndReceiveOrders takes a parameter @WaitTime which controls how long to wait for messages from the OrderResponses queue. Once the timeout has expired, it exits.

I'm not showing any of the code in SendAndReceiveOrders here, because it is merely a procedure to produce test cases and drive the demo, and there is nothing particularly interesting in terms of error handling in the procedure. Nor does it aspire to show any best practices for Service Broker.

Next in the script comes a function get_waittime which also is a helper for the demo. I will return to the purpose of this function later.

The procedure that follows, DoOneOrderFromActivation, is somewhat more interesting as it is part of our order system. It is not an activation procedure itself, but it is called from the different activation procedures in this demo. It is drawn from the SQL 2008 version of ProcessOneImportFile, or more precisely the body of the WHILE loop in that procedure. The procedure has this signature:

CREATE PROCEDURE DoOneOrderFromActivation @data         xml,
                                          @OrderGuidStr nvarchar(400),
                                          @OrderID      int OUTPUT AS

@data is the XML document received in the message. @OrderGuidStr is the OrderID attribute in the XML document, which is extracted prior to calling the procedure. @OrderID is the order ID in the Orders table, which is returned as it is needed in the response message. There are no parameters for status and error message, because this procedure only returns in a normal way if the order is processed successfully. If there is an error with an order, the procedure raises an error which is captured by a CATCH block in the caller.

The first executable statement in the procedure is this check:

   IF @@trancount = 0
   BEGIN
      EXEC slog.catchhandler_sp @@procid,
           'This procedure should be called from a transaction'
      RETURN 55555
   END

Since this procedure is intended to be called from an activation procedure, things are seriously wrong if there is no transaction in progress. We should have a transaction to keep the conversation group locked until the message has been processed.

The rest of the procedure is very similar to ProcessOneImportFile and I am not going to repeat it, although there are a few small variations. The CATCH handler in DoOneOrderFromActivation is a standard one:

BEGIN CATCH
   IF @@trancount > 0 ROLLBACK TRANSACTION
   EXEC slog.catchhandler_sp @@procid
   RETURN 55555
END CATCH

You may recall that in ProcessOneImportFile, I tacked on the ID for the current product to the error message. I did not find a clean way of doing this here, so I simply dropped that level of ambition.

Next follows three different activation procedures, ProcessOrders1, ProcessOrders2 and ProcessOrders3 that demonstrate different techniques. These three procedures are the main meat of this demo, and I cover these procedures in detail below.

The activation procedures are followed by two batches that perform the part that actually runs the demo. You find these batches towards to the end of the file where they occur just before the clean-up section. This is how they look like:

ALTER QUEUE OrderRequests WITH 
      STATUS = ON,
      ACTIVATION (STATUS = ON,
                  PROCEDURE_NAME = ProcessOrders1,
                  MAX_QUEUE_READERS = 1,
                  EXECUTE AS OWNER)
go
DECLARE @logid int
SELECT @logid = coalesce(MAX(logid), 0) FROM slog.sqleventlog
EXEC SendAndReceiveOrders 5000
SELECT * FROM OrderImportFileItems ORDER BY FileID, RowNo
SELECT * FROM slog.sqleventlog WHERE logid > @logid ORDER BY logid

That is, I first define which activation procedure to use and with how many queue readers. This is the part that we will change as the demo proceeds.

In the second batch, I run SendAndReceiveOrders and after the execution I display the contents in OrderImportFileItems and the messages logged to the table sqleventlog during the execution. SendAndReceiveOrders takes one parameter which is a timeout in milliseconds for how long it should wait for all the order responses to come back. You may have to adjust the timeout to get results resemblant of what I say. The expected outcome of a successful execution is that the contents of OrderImportFileItems is the same as after running the full SimpleOrderDatabase script. (With some minor differences, like the product ID not being tacked on.) The messages are also listed in this table in Part Three.

Tackling the Problem

Now that we have the demo script laid out, let's look at the actual problem. We receive orders from an external source we don't trust. We can explicitly check for the errors we think that are the most likely to occur, but in a complex system it is unlikely that we are able to think of everything beforehand, and there can be run-time errors like constraint violations, conversion errors etc that forces us to roll back the transaction. Because we run with SET XACT_ABORT ON, the transaction will always be doomed. This means that the message goes back to the queue, and if this happens five times, Service Broker disables the queue. Thus, we need to find a way to prevent this.

First a non-starter. In some books and articles, you may see this pattern:

BEGIN TRANSACTION

RECEIVE TOP (1) ...

SAVE TRANSACTION after_receive

BEGIN TRY
   -- Processing goes here
   COMMIT TRANSACTION
END TRY
BEGIN CATCH
   ROLLBACK TRANSACTION after_receive
   -- Send error response
   COMMIT TRANSACTION
END CATCH

That is, after having received the message, you define a savepoint and in case of an error you roll back to the savepoint, so that all has happened in the transaction is that you have received the message. It is nice and elegant, had savepoints only worked. But as we have learnt in Part Two and Part Three they don't. Or more precisely, we learnt that we cannot rely on being able to roll back to a savepoint, as the transaction may be doomed. In fact, when XACT_ABORT is ON, the transaction is always doomed. So this is a completely dead end.

ProcessOrders1

We need to find something else than savepoints. The message goes back to the queue, this is inevitable. But we know that the message has failed, and we can make use of this information. As long as there is only one queue reader, and we are not using broker priorities, we could take a simple-minded approach. We know that the bad message will come back directly, so we can simply save the error status in a variable and test for that variable when the message is taken of the queue. And this is precisely what ProcessOrders1 does.

CREATE PROCEDURE ProcessOrders1 AS
SET XACT_ABORT, NOCOUNT ON

DECLARE @DialogHandle  uniqueidentifier,
        @MessageType   sysname,
        @binarydata    varbinary(MAX),
        @xmldata       xml,
        @Response      xml,
        @OrderGuidStr  nvarchar(400),
        @OrderID       int,
        @ErrorMessage  nvarchar(2048),
        @ErrorNumber   int

WHILE 1 = 1
BEGIN TRY
   SELECT @DialogHandle = NULL, @MessageType = NULL, @binarydata = NULL,
          @xmldata = NULL, @OrderGuidStr = NULL, @OrderID = NULL,
          @ErrorMessage = NULL, @ErrorNumber = NULL

   DECLARE @WaitTime datetime
   SELECT @WaitTime = dbo.get_waittime()
   WAITFOR DELAY @WaitTime

   BEGIN TRANSACTION

   WAITFOR (
      RECEIVE TOP (1) @DialogHandle = conversation_handle,
                      @MessageType  = message_type_name,
                      @binarydata   = message_body
      FROM    OrderRequests
   ), TIMEOUT 1000

   IF @DialogHandle IS NULL
   BEGIN
      ROLLBACK TRANSACTION
      RETURN
   END

   IF coalesce(@MessageType, '') <> 'OrderRequest'
   BEGIN
      EXEC slog.sqleventlog_sp @@procid,
           'Unexpected message type "%1".', @p1 = @MessageType, @raiserror = 0

      END CONVERSATION @DialogHandle

      COMMIT TRANSACTION
      CONTINUE
   END

   IF @Response IS NULL
   BEGIN TRY
      SELECT @xmldata = convert(xml, @binarydata)

      SELECT @OrderGuidStr = @xmldata.value('Order[1]/@OrderID', 'nvarchar(400)')

      EXEC DoOneOrderFromActivation @xmldata, @OrderGuidStr, @OrderID OUTPUT

      SELECT @Response =
         (SELECT 'O'            AS [@Status],
                  @OrderGuidStr AS [@OrderID],
                  @OrderID      AS [@InternalOrderID]
          FOR XML PATH('OrderResponse'), TYPE)
   END TRY
   BEGIN CATCH
      IF @@trancount > 0 ROLLBACK TRANSACTION

      EXEC slog.catchhandler_sp @@procid, @reraise = 0,
                                @errno      = @ErrorNumber OUTPUT,
                                @errmsg_aug = @ErrorMessage OUTPUT

      IF @ErrorNumber <> 1205
      BEGIN
         SELECT @Response =
            (SELECT 'E'            AS [@Status],
                     @ErrorMessage AS [@Message],
                     @OrderGuidStr AS [@OrderID]
            FOR XML PATH('OrderResponse'), TYPE)
      END
   END CATCH

   IF @@trancount > 0
   BEGIN
      ; SEND ON CONVERSATION @DialogHandle MESSAGE TYPE OrderResponse (@Response)

      SELECT @Response = NULL

      END CONVERSATION @DialogHandle

      COMMIT TRANSACTION
   END
END TRY
BEGIN CATCH
   IF @@trancount > 0 ROLLBACK TRANSACTION
   EXEC slog.catchhandler_sp @@procid
   RETURN 1
END CATCH

The procedure has the standard WHILE loop which is typical for activation procedures. As long as there are messages in the queue, we keep on executing, but if no message has arrived after a grace period (here one second), we exit. The body of the WHILE loop is a TRY-CATCH block, of which the CATCH block has a standard error hander. This CATCH block is responsible for handling errors in the procedure outside the processing of the message. For instance, RECEIVE could raise an error because the queue is disabled and in such case we should leave on the spot.

The first step in the loop is to initiate all variables, save one: @Response. This variable may contain a canned error response from the previous iteration. Next follows three lines of code of which the last is WAITFOR. This is nothing you would normally have in an activation procedure, but it appears here for testing purposes only. Why, I will return to.

At this point the transaction starts and we attempt to receive a message. I have set up the message types in the contract with VALIDATION = WELL_FORMED_XML, but in line with the advice from Remus Rusanu, I nevertheless receive the message as binary. The only errors I ever want with RECEIVE are errors that are due to problems with the queue, not with the message. I have a timeout with RECEIVE, and if it expires without any message being received, I commit and exit.

This is followed by a token check on the message type. I only put in this code, in case Service Broker would start to send me system messages. I have opted to log the error without raising it and then commit and move on. In a real Service Broker application you should explicitly check for the system message types http://schemas.microsoft.com/SQL/ServiceBroker/Error and http://schemas.microsoft.com/SQL/ServiceBroker/EndDialog and act accordingly. Exactly what is accordingly is beyond the scope for this article, and I would encourage you to review the result of this search on Remus Rusanu's blog to learn more about it.

We now come to the part which is the real focus for this article, to wit the processing of the message. If @Response is NULL, we have a new fresh message to process, and the IF statement is followed by a TRY-CATCH block. The TRY block has the message processing:

      SELECT @xmldata = convert(xml, @binarydata)
      SELECT @OrderGuidStr = @xmldata.value('Order[1]/@OrderID', 'nvarchar(400)')
      EXEC DoOneOrderFromActivation @xmldata, @OrderGuidStr, @OrderID OUTPUT
      SELECT @Response =
         (SELECT  'O'           AS [@Status],
                  @OrderGuidStr AS [@OrderID],
                  @OrderID      AS [@InternalOrderID]
          FOR XML PATH('OrderResponse'), TYPE)

As I already have discussed, most of the message processing is in the procedure DoOneOrderFromActivation. Here I only convert the binary string to XML and then extract the Order ID from the order, since I need it in the response message. If the order is successfully processed, I form a positive response message.

If the order processing fails, I end up in the CATCH handler where I perform three actions:

  1. I roll back the transaction. (Most likely DoOneOrderFromActivation has already rolled back the transaction, but let's make sure.)
  2. I log the error to SqlEventLog, but do not reraise the error. I retrieve the error number and the error message.
  3. I form an error response, unless the error was a deadlock. In this case we should try once more.

The last thing that happens in the WHILE loop is this:

IF @@trancount > 0
BEGIN
   ; SEND ON CONVERSATION @DialogHandle
   MESSAGE TYPE OrderResponse (@Response)

   SELECT @Response = NULL

   END CONVERSATION @DialogHandle

   COMMIT TRANSACTION
END

The key here is that if the transaction is gone, there was an error and the message is back in the queue and we have created a response message for the next iteration (unless it was a deadlock). On the other hand, if the transaction is still alive, the message is out of the queue. It was either successfully processed, or we have a canned error response that we are sending now. Once the message is sent, we clear @Response and commit. In this example, I close the dialog since my test system, SendAndReceiveOrders, have one dialog per order. (In a real high-volume application you should probably not do this, but recycle conversations for better performance.)

If you run the full script in ServiceBroker-demo.sql, you will see the that the outcome is more or less the same as the output when you run SimpleOrderDatabase-2008.sql in its entirety. Furthermore, you should see 26 error messages in the table sqleventlog. If you don't see this, but some orders retain the status N this means that SendAndReceiveOrders did not wait for the responses long enough and you need to increase the timeout. (That is, the parameter 5000 passed to SendAndReceiveOrders.)

Multiple Queue Readers

This is a fairly simple-minded solution, and as I mentioned it is designed for the case when you have a single queue reader. So there is no reason to expect it to work with multiple queue readers, but just for fun, reload the tables in the simple order database as I described above. (Drop the tables and re-run SimpeOrderDatabase-2008.sql down to the second execution of AddOrderImportFile.) Then change the ALTER QUEUE statement to read:

ALTER QUEUE OrderRequests WITH 
      STATUS = ON,
      ACTIVATION (STATUS = ON,
                  PROCEDURE_NAME = ProcessOrders1,
                  MAX_QUEUE_READERS = 15,
                  EXECUTE AS OWNER)

Before I ask you to run in this full, this is the place where I'm going to explain the function get_waittime and the funny WAITFOR in ProcessOrders1. If you were to take out the WAITFOR, most likely the result would be the same as with MAX_QUEUE_READERS = 1, for the simple reason that Service Broker would not find it worthwhile to fire up another queue reader. The processing of those 32 orders is simply too little work for that to happen. Since I did not feel like composing more orders, I added the WAITFOR to encourage Service Broker to add more readers. It is unlikely that you will get 15 readers, but get_waittime computes the wait-time from the number of queue readers, so that the higher MAX_QUEUE_READERS is, the longer the wait time.

If you run ServiceBroker-demo.sql with MAX_QUEUE_READERS set to 15, what happens? Since there is concurrency, the result is not fully deterministic, and the exact result tends to be different from execution to execution. However, there is a reoccurring pattern. Looking only at the Message column in OrderImportFileItems, you may see this for the last three orders:

{547} Procedure ModifyExternalOrderDetail, Line 25
The UPDATE statement conflicted with the CHECK constraint ...
{547} Procedure ModifyExternalOrderDetail, Line 25
The UPDATE statement conflicted with the CHECK constraint ...
{547} Procedure ModifyExternalOrderDetail, Line 25
The UPDATE statement conflicted with the CHECK constraint ...

That is, the three orders all have the same error message, which is incorrect. As a matter of fact, the error message does not apply to any of these orders, and the third-to-last order is one of the few in the test data that should be processed successfully. It is not difficult to understand why this happens. The message goes back to the queue, and while the reader is stuck in the WAITFOR, another reader picks up the message which fails anew. Once the first reader is done with the WAITFOR, it picks up a completely different message from the queue and sends the canned response message on that dialog. Since the test sender uses the dialog handle to map back to the source table, the wrong row gets updated.

You could argue that these errors occur because of my funky test setup, but don't be fooled. The WAITFOR helps to emphasise the concurrency problems, but in a real-world high-volume application, they would occur every now and then. Nor is the fact that the test originator relies on the dialog handle significant. The major flaw is that the activation procedure discards messages as erroneous without even looking at them. If the originator would look at the order GUID rather than the dialog handle like a real-world sender would do, it would never see any responses at all for the discarded orders, but they would be left with status N in OrderImportFileItems.

Clearly, ProcessOrders1 does not work with multiple queue readers. Does this mean that this is a bad procedure? No... and yes. There are applications where having a single queue reader makes perfects sense, because the semantics are such that the messages must be processed in order. As a matter of fact, this applies to our simple order system where an order may be modified or deleted. If the modifications are processed out of order, this can lead to undesired results. (Although the correct solution for such a situation is probably to have the same conversation or conversation group for all messages related to the same order. Keep in mind that when a message in a conversation group is retrieved, the entire conversation group remains locked until the transaction ends.) So from that perspective ProcessOrders1 is alright. Still, I don't like it. Someone too smart for his own good could reconfigure the queue, and things would go south with the speed of light.

ProcessOrders2

Thus, even if you only plan to have a single queue reader, you need something better. Enter stage ProcessOrders2. While also designed for a single queue reader, it works better (but not well) with multiple queue readers. ProcessOrders2 follow the same basic pattern as ProcessOrders1, but since it is a better example, I first include it in full, and later I highlight the differences.

CREATE PROCEDURE ProcessOrders2 AS
SET XACT_ABORT, NOCOUNT ON

DECLARE @DialogHandle  uniqueidentifier,
        @MsgSeqNo      bigint,
        @MessageType   sysname,
        @binarydata    varbinary(MAX),
        @xmldata       xml,
        @Response      xml,
        @OrderGuidStr  nvarchar(400),
        @OrderID       int,
        @ErrorMessage  nvarchar(2048),
        @ErrorNumber   int

DECLARE @BadMessages TABLE
        (DialogHandle  uniqueidentifier NOT NULL,
         MsgSeqNo      bigint           NOT NULL,
         Response      xml              NULL,
         PRIMARY KEY (DialogHandle, MsgSeqNo) WITH (IGNORE_DUP_KEY = ON)
)


WHILE 1 = 1
BEGIN TRY
   SELECT @DialogHandle = NULL, @MsgSeqNo = NULL, @MessageType = NULL,
          @binarydata = NULL, @xmldata = NULL, @Response = NULL,
          @OrderGuidStr = NULL, @OrderID = NULL, @ErrorMessage = NULL,
          @ErrorNumber = NULL

   DECLARE @WaitTime datetime
   SELECT @WaitTime = dbo.get_waittime()
   WAITFOR DELAY @WaitTime

   BEGIN TRANSACTION

   WAITFOR (
      RECEIVE TOP (1) @DialogHandle = conversation_handle,
                      @MsgSeqNo     = message_sequence_number,
                      @MessageType  = message_type_name,
                      @binarydata   = message_body
      FROM    OrderRequests
   ), TIMEOUT 1000

   IF @DialogHandle IS NULL
   BEGIN
      ROLLBACK TRANSACTION
      RETURN
   END

   IF coalesce(@MessageType, '') <> 'OrderRequest'
   BEGIN
      EXEC slog.sqleventlog_sp @@procid,
           'Unexpected message type "%1".', @p1 = @MessageType, @raiserror = 0

      END CONVERSATION @DialogHandle

      COMMIT TRANSACTION
      CONTINUE
   END

   SELECT @Response = Response
   FROM   @BadMessages
   WHERE  DialogHandle = @DialogHandle
     AND  MsgSeqNo     = @MsgSeqNo

   IF @Response IS NULL
   BEGIN TRY
      SELECT @xmldata = convert(xml, @binarydata)

      SELECT @OrderGuidStr = @xmldata.value('Order[1]/@OrderID', 'nvarchar(400)')

      EXEC DoOneOrderFromActivation @xmldata, @OrderGuidStr, @OrderID OUTPUT

      SELECT @Response =
         (SELECT  'O'           AS [@Status],
                  @OrderGuidStr AS [@OrderID],
                  @OrderID      AS [@InternalOrderID]
          FOR XML PATH('OrderResponse'), TYPE)
   END TRY
   BEGIN CATCH
      IF @@trancount > 0 ROLLBACK TRANSACTION

      EXEC slog.catchhandler_sp @@procid, @reraise = 0,
                                @errno      = @ErrorNumber OUTPUT,
                                @errmsg_aug = @ErrorMessage OUTPUT

      IF @ErrorNumber <> 1205
      BEGIN
         INSERT @BadMessages(DialogHandle, MsgSeqNo, Response)
            SELECT @DialogHandle, @MsgSeqNo,
                     (SELECT 'E'            AS [@Status],
                              @ErrorMessage AS [@Message],
                              @OrderGuidStr AS [@OrderID]
                      FOR XML PATH('OrderResponse'), TYPE)
      END
   END CATCH

   IF @@trancount > 0
   BEGIN
      ; SEND ON CONVERSATION @DialogHandle
      MESSAGE TYPE OrderResponse (@Response)

      END CONVERSATION @DialogHandle

      COMMIT TRANSACTION
   END
END TRY
BEGIN CATCH
   IF @@trancount > 0 ROLLBACK TRANSACTION
   EXEC slog.catchhandler_sp @@procid
   RETURN 1
END CATCH

The key difference is the table variable @BadMessages. When the processing of a message fails, we store the dialog handle and the message sequence number together with the error response for the message. That is, we no longer assume that the failed message comes back immediately, but we permit another message to arrive in-between. (Which could happen even with a single queue reader, if you use broker priorities.) There are three things to note with this table:

  1. DialogHandle is not the primary key alone. In this demo, we never receive more than one message on the same dialog, but in the general case there can be multiple messages in a dialog, and for this reason I have added the message sequence number to the primary key.
  2. The Response column is nullable. There is no reason why there would ever be a NULL value in this column, and normally you should not design tables in this way; you should only permit NULL, if NULL makes sense and you know how to handle it. However, this is a table for error handling, and we want our error handling to be as resilient as possible. That is, we don't want an INSERT to this table to fail because of a NOT NULL violation.
  3. The primary key has IGNORE_DUP_KEY = ON, an option I presented in Part Two. Here I use the option merely as a safety measure. There is no reason why there would ever be a duplicate inserted into this table, but if it happens, we don't want it to cause a failure.

In the code itself, the first thing to note is in the statement that initiates the variables in the beginning of the loop:

SELECT @DialogHandle = NULL, @MsgSeqNo = NULL, @MessageType = NULL, 
       @data = NULL, @Response = NULL, @OrderGuidStr = NULL, @OrderID = NULL, 
       @ErrorMessage = NULL, @ErrorNumber = NULL

This time there is no special trick with @Response, but it is initiated to NULL like all other variables.

When receiving the message, we also need the message sequence number:

RECEIVE TOP (1) @DialogHandle = conversation_handle,
                @MsgSeqNo     = message_sequence_number,
                @MessageType  = message_type_name,
                @data         = convert(xml, message_body)
FROM    OrderRequests

To see if we have seen this message before, there is this code:

SELECT @Response = Response 
FROM   @BadMessages 
WHERE  DialogHandle = @DialogHandle
  AND  MsgSeqNo     = @MsgSeqNo

IF @Response IS NULL
BEGIN TRY 

The TRY block is the same as in ProcessOrders1, but in the CATCH block is of course different as we now insert the response into @BadMessages. The final part where we send the response message is the same, except that the statement to clear @Response is absent.

All and all, ProcessOrders2 is a cleaner solution. (I don't really like code where variables linger from one iteration to another in a loop, as such code is more prone to bugs when making changes.)

Reload the simple order database and change the ALTER QUEUE statement to read:

ALTER QUEUE OrderRequests WITH 
      STATUS = ON,
      ACTIVATION (STATUS = ON,
                  PROCEDURE_NAME = ProcessOrders2,
                  MAX_QUEUE_READERS = 1,
                  EXECUTE AS OWNER)

The outcome should be the same as with ProcessOrders1 and a single queue reader. What happens if you run ProcessOrders2 with MAX_QUEUE_READERS = 15? I recommend that you try this several times, because when I have tested it, the outcome has been different. Sometimes all orders in OrderImportFileItems are updated with the correct status and error message. In this case there are more than 26 error message in sqleventlog, and several messages appears multiple times, as they have been logged by multiple readers. At other executions, several orders still have status N, that is, SendAndReceiveOrders has not received any response message for these orders. In this case, the last message in sqleventlog, occurring once per queue reader is The service queue "OrderRequests" is currently disabled. That is, there were more than five consecutive rollbacks on the queue. (If you see messages with Status = N, but the last message in sqleventlog is something else, the issue is that the timeout passed to SendAndReceiveOrders is not big enough.)

The fact that the queue gets disabled is not surprising at all, given that the test data produces so many errors. Rather, I am surprised when queue survives the ordeal from start to end. But if we consider a more normal message flow, where errors only happens once per every thousand messages or so, it seems that the solution in ProcessOrders2 could work well as long as there are not more than four queue readers. (Since by the fifth time, all queue readers will have the canned response in their instance of @BadMessages.) Only if there are two bad messages coming directly after each other, there is a risk for the queue being disabled. Ah, but what says that bad messages cannot come in bursts (because of some bug on the sending side)? No, ProcessOrders2 is still a solution for a single queue reader, but it is better then ProcessOrders1, because it does not produce bogus responses when there are multiple queue readers.

ProcessOrders3

How would a solution that works with multiple queue readers work? Answer: very similar to ProcessOrders2. Make @BadMessages a permanent table and you have ProcessOrders3. With a permanent table, one queue reader can benefit from the work of other readers. Message A fails on reader 1, and the message goes back to the queue. Reader 2 picks up the message, but finds the canned error response in BadMessages, so rather than trying to process the message, it sends the error response and commits, taking message A off the queue for good.

ProcessOrders3 retains all messages in BadMessages, which permits us to easily see what message errors that have occurred. But it would be reasonable to prefer to delete a message from the table, once the canned response has been sent.

Is this the final solution? Probably, but it is not perfect. If you run ProcessOrders3 with 15 queue readers a few times, you may find that there are more than 26 messages in sqleventlog. Sometimes that extra message is a deadlock, which is alright since a deadlock causes a retry. But you may also see multiple occurrences of the same the same error message, which implies that the order was processed by multiple queue readers. How could that happen? Elementary, my dear Watson. The transaction is rolled back at some point, and later we write the response message to BadMessages. Thus, there is a window where another queue reader could pick up the message without finding it in BadMessages, before the first queue reader has not yet written that row.

Speaking of concurrency, note that in ProcessOrders3 the option IGNORE_DUP_KEY for the primary key serves a real purpose. When the same message is processed twice, both readers attempt to write the error response to BadMessages. This can be addressed with IF EXISTS but that requires the serializable isolation level to be safe, and my experience of SERIALIZABLE is that it is very prone to deadlocks. (As this is part of our error handling code, we don't want any errors, including deadlocks.) Thus, I decided to use IGNORE_DUP_KEY as this is both safer and easier to implement, even if I normally frown upon its usage in permanent tables.

As we have seen, even with ProcessOrders3 a bad message can be processed more than once. Is this cause for concern? For the queue to be disabled, no less that five queue readers would have to attempt to process this message, how likely is that? It may sound extremely unlikely, but a possible scenario is a table lock on the sqleventlog table, so that all processes get stuck when trying to log the error message. This could occur if someone casually runs a SELECT with a WHERE clause on a non-indexed column (or no WHERE clause at all) on that table. Since SqlEventLog is intended to be used all over the database, the table could very well acquire some size.

Note: I thought for a while that I had a solution: call a loopback procedure to insert data into BadMessages before rolling back the transaction. But later I realised I had overlooked an important fact: the rollback could occur in a common application procedure, for instance ModifyExternalOrder in our example, which does not have (and should not have) any knowledge that it may be called from Service Broker.

Another concern is that bad messages may come in bursts. Since all messages are bad for a while, and there are many queue readers, it is very likely that five readers each gets one bad message each before any other reader catches the rebound and gets a bad message off the queue with a COMMIT.

Starting with SQL 2012, there is a way out, you can disable poison-message handling:

ALTER QUEUE que WITH POISON_MESSAGE_HANDLING (STATUS = OFF)

With this setting, the queue will never be disabled because of rollbacks. But this is not an ideal solution. Even if you have implemented a bullet-proof handling of errors in the messages, there may be other catastrophic errors where there is little reason to keep the queue alive. (For instance, say that someone has dropped a stored procedure inadvertently.) One could wish that Service Broker offered a knob to control that limit of five. Something like 1.2 * MAX_QUEUE_READERS seems like a good pick.

Closing Remarks

The Case of DoOneOrderFromActivation

When you look at the interplay of DoOneOrderFromActivation and ProcessOrders1-3 it almost looks like an idea. That is, let the activation procedure be a shell procedure that handles the queues and services, and put the actual processing in an inner procedure. But I am not sure that this is a pattern to follow. DoOneOrderFromActivation exists in this demo is simply because that I did not want to repeat the logic in three different activation procedures. Normally, you only have one single activation procedure. I have three, because I'm presenting a demo.

Nevertheless, let's pursue the idea for a few seconds, to see where the problems lie, as they are related to error handling. Ideally, the activation procedure would have no knowledge about the data it receives. From that perspective, I was not so fond of putting the line

SELECT @OrderGuidStr = @data.value('Order[1]/@OrderID', 'nvarchar(400)')

in the main procedure, as that logic belongs to the inner procedure. It would be nicer if the signature of DoOneOrderFromActivation would be:

CREATE PROCEDURE DoOneOrderFromActivation @data     xml,
                                          @Response xml OUTPUT AS

That is, no matter if the message is processed successfully or not we would get back a response message of any sort and the shell procedure would only be responsible for sending it. However, this does not hold with the normal pattern where the inner procedure reraises errors in the CATCH handler, because the OUTPUT parameter @Response would not be set in this case. (See the discussion on TRY-CATCH in Part Two for an example.)

OK, but why would DoOneOrderFromActivation reraise the error? It sets @Response, and the return value of the stored procedure would be non-zero in case of an error, and so the shell procedure would know that things have gone sour. There is one fine detail: if there is an error, DoOneOrderFromActivation has to roll back the transaction (because it is doomed). And what happens then on exit? Error 266, trancount mismatch, is raised. That particular error does not prevent from @Response being set in the activation procedure. But we would still need to have TRY-CATCH around the call to DoOneOrderFromActivation so that error 266 does not reach the outer CATCH handler which exits the activation procedure. Yes, you can do this, but for my part, I found it too complex to be worth it and retained the solution I had.

A Final Word of Caution

I need to make the confession that my main experience for the ideas I have put forth in this appendix is exactly the demo we have studied. That is, I cannot boast with actually having implemented error handling with Service Broker in a high-volume environment. Thus, if you consider to implement something like ProcessOrders3 in a mission-critical Service Broker application, I urge you to conduct further testing to verify that what I have presented actually holds water.

Feedback and Revision History

For a list of acknowledgement for the entire series, please see the end of Part Three. If you have questions or opinions related to this article, please contact me on [email protected]. This very much applies if you spot grammar or spelling errors as well! If you have questions related more to your own work than the examples I discuss in this text, I strongly recommend you to ask in a public forum for quicker response.

Below follows a revision history for this appendix.

2015-05-03
First version.

Back to my home page.