Network Working Group R. Sparks Internet-Draft dynamicsoft Expires: August 8, 2003 February 7, 2003 Considerations for the Session Initiation Protocol's non-INVITE Transaction draft-sparks-sip-noninvite-00 Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http:// www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on August 8, 2003. Copyright Notice Copyright (C) The Internet Society (2003). All Rights Reserved. Abstract This draft explores several issues with the Session Initiation Protocol's non-INVITE transaction. It focuses on the use of provisional responses and on problems related to transaction timeouts. Sparks Expires August 8, 2003 [Page 1] Internet-Draft SIP non-INVITE Considerations February 2003 Table of Contents 1. Use of provisional responses . . . . . . . . . . . . . . . . 3 1.1 NITs must complete as soon as possible . . . . . . . . . . . 3 1.2 Provisional responses can delay recovery from lost final responses . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Not responding can temporarily blacklist an element . . . . 6 2. 408 for non-INVITE is not useful . . . . . . . . . . . . . . 7 3. Non-INVITE timeouts doom forking proxies . . . . . . . . . . 9 4. Mismatched timer values . . . . . . . . . . . . . . . . . . 9 5. Proposals . . . . . . . . . . . . . . . . . . . . . . . . . 10 5.1 Proposal 1. Disallow non-100 provisionals to non-INVITE . . 10 5.2 Proposal 2. Disallow 100 Trying to non-INVITE before Timer E reaches T2 . . . . . . . . . . . . . . . . . . . . . . . . 10 5.3 Proposal 3. Allow 100 Trying after Timer E reaches T2 . . . 10 5.4 Proposal 4. Disallow 408 to non-INVITE requests . . . . . . 10 5.5 Proposal 5. Absorb late non-INVITE responses . . . . . . . . 11 5.5.1 Proposal interdependencies . . . . . . . . . . . . . . . . . 11 6. A more radical alternate proposal . . . . . . . . . . . . . 11 7. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . 12 References . . . . . . . . . . . . . . . . . . . . . . . . . 12 Author's Address . . . . . . . . . . . . . . . . . . . . . . 13 Full Copyright Statement . . . . . . . . . . . . . . . . . . 14 Sparks Expires August 8, 2003 [Page 2] Internet-Draft SIP non-INVITE Considerations February 2003 1. Use of provisional responses The SIP [1] specification states that a UAS SHOULD NOT issue a provisional response within a non-INVITE transaction (herein, a NIT). This is motivated by two factors: o Because of race conditions, a NIT must complete as soon as possible (Section 1.1). o Provisional responses in a NIT can damage recovery from transport error (Section 1.2). It has been proposed to disallow provisional responses within a NIT altogether based on these motivations and a desire to reduce unnecessary network traffic. However, o Late final responses are the same as no response at all (Section 1.1). o Sending no response at all is likely to cause an element to be temporarily stop receiving new requests (Section 1.3). So, it may be important in some situations to issue a provisional response within a NIT to prevent an element from being incorrectly avoided. This document will explore each of the above assertions, noting the problems that follow from them, and propose changes to NIT processing that will address these problems. 1.1 NITs must complete as soon as possible The non-INVITE transaction is designed to have a fixed and finite duration (dependent on T1). A consequence of this design is that participants must strive to complete the transaction as quickly as possible. Consider the race condition shown in Figure 1. Sparks Expires August 8, 2003 [Page 3] Internet-Draft SIP non-INVITE Considerations February 2003 --------------------------------------------------------------------- UAC UAS | request | --- |---. | ^ | `---. | | | `-->| --- | | | ^ | | | | 64*T1 | | | | | | | | | | 64*T1 | | | | | | | | v | | | timeout <=== --- | 200 OK | | | .---| v | .---' | --- |<--' | Figure 1: NI Race Condition --------------------------------------------------------------------- The UAS in this figure believes it has responded to the request in time, and that the request succeeded. The UAC, on the other hand, believes the request has timed-out, hence failed. No longer having a matching client transaction, the UAC core will ignore what it believes to be a spurious response. As far as the UAC is concerned, it received no response at all to its request. The ultimate result is the UAS and UAC have conflicting views of the outcome of the transaction. Therefore, a UAS cannot wait until the last possible moment to send a final response within a NIT. It must, instead, send its response so that it will arrive at the UAC before that UAC times out. Unfortunately, the UAS has no way to accurately measure the propagation time of the request or predict the propagation time of the response. The uncertainty it faces is compounded by each proxy that participates in the transaction. Thus, the UAS's only choice is to send its final response as soon as it possibly can and hope for the best. This result constrains the set of problems that can be solved with a single NIT. Any delay introduced during processing of a request Sparks Expires August 8, 2003 [Page 4] Internet-Draft SIP non-INVITE Considerations February 2003 increases the probability of losing the race. If the timing characteristics of that processing are not predictable and controllable, a single NIT is an inappropriate model for handling the request. One viable alternative is to accept the request with a 202 and send the ultimate results in a new request in the reciprocal direction. In specialized networks, a UAS might have some reliable knowledge of inter-hop latency and could use that knowledge to determine if it has time to delay its final response in order to perform some processing such as a database lookup while mitigating its risk of losing the race in Figure 1. Establishing this knowledge across arbitrary networks (perhaps using resource reservation techniques and deterministic transports) is not currently feasible. 1.2 Provisional responses can delay recovery from lost final responses The non-INVITE client transaction state machine provides reliability for NITs over unreliable transports (UDP) through retransmission of the request message. Timer E is set to T1 when a request is initially transmitted. As long as the machine remains in the Trying state, each time Timer E fires, it will be reset to twice its previous value (capping at T2) and the request is retransmitted. If the non-INVITE client transaction state machine sees a provisional response, it transitions to the Proceeding state, where retransmission continues, but the algorithm for resetting Timer E is simply to use T2 instead of doubling at each firing. (Note that Timer E is not altered during the transition to Proceeding). Making the transition to the Proceeding state before Timer E is reset to T2 can cause recovery from a lost final response to take extra time. Figure 2 shows recovery from a lost final response with and without a provisional message during this window. Recovery occurs within 2*T1 in the case without the provisional. With the provisional, recovery is delayed until T2, which by default is 8*T1. In practical terms, a provisional response to a NIT in currently deployed networks can delay transaction completion by up to 3.5 seconds. --------------------------------------------------------------------- Sparks Expires August 8, 2003 [Page 5] Internet-Draft SIP non-INVITE Considerations February 2003 UAC UAS UAC UAS | | | | --- |----. | --- |----. | ^ | `-->| ^ | `--->| E = T1 | | E = T1 | .-----|(provisional) v | | v |<--' | --- |----. | --- |----. | ^ | `-->| ^ | `--->| | | X<----|(lost final) | | X<-----|(lost final) | | | | | | E = 2*T1 | | | | | | | | | | | | | | | | | v | | | | | --- |----. | | | | | `-->| | | | | .-----|(final) | | | |<-' | | | | | | | | | \/\ /\/ /\/ /\/ /\/ E = T2 \/\ /\/ /\/ /\/ /\/ | | | | | | | v | | | | --- |----. | | | | `--->| | | | .-----|(final) | | |<--' | | | | | Figure 2: Provisionals can harm recovery --------------------------------------------------------------------- No additional delay is introduced if the first provisional response is received after Timer E has reached its maximum reset interval of T2. 1.3 Not responding can temporarily blacklist an element A SIP element's use of SRV is specified in RFC 3263 [2]. That specification discusses how SIP assures high availability by having upstream elements detect failure of downstream elements. It proceeds to define several types of failure detection and instructions for failover. Two of the behaviors it describes are important to this document: Sparks Expires August 8, 2003 [Page 6] Internet-Draft SIP non-INVITE Considerations February 2003 o Within a transaction, transport failure is detected either through an explicit report from the transport layer or through timeout. In either case, the request is retried at the next element from the sorted results of the SRV query. o Between transactions, locations reporting temporary failure (through 503/Retry-After for example) are not used until their requested black-out period expires. The specification notes the benefit of caching locations that are successfully contacted, but does not discuss how such a cache is maintained. It is unclear whether an element should stop using (temporarily blacklist) a location returned in the SRV query that results in a transport error. If it does, when should such a location be removed from the blacklist? Without such a blacklist (or equivalent mechanism), the intended availability mechanism fails miserably. Consider traffic between two domains. Proxy pA in domain A needs to forward a sequence of non- INVITE requests to domain B. Through DNS SRV, pA discovers pB1 and pB2, and the ordering rules of [2] and [3] indicate it should use pB1 first. The first request to pB1 times out. Since pA is a proxy and a NIT has a fixed duration, pA has no opportunity to retry the request at pB2. If pA does not remember pB1's failure, the second request (and all subsequent non-INVITE requests until pB1 recovers) are doomed to the same failure. Caching would allow the subsequent requests to be tried at pB2. Since miserable failure is not acceptable in deployed networks, we should anticipate that elements will, in fact, cache timeout failures between transactions. Then the race in Figure 1 becomes important. If an element fails to respond "soon enough", it has effectively not responded at all, and will be blacklisted at its peer for some period of time. (Note that even with caching, the first request timeout results in a timeout failure all the way back to the original submitter. The failover mechanisms in [2] work well to increase the resiliency of a given INVITE transaction, but do nothing for a given non-INVITE transaction.) 2. 408 for non-INVITE is not useful Consider the race condition in Figure 1 when the final response is 408 instead of 200. Unless the UAC has special knowledge (as discussed in Section 1.1, it is forced into losing the race. Indeed, most existing endpoints will emit a 408 for a non-INVITE request 64*T1 after receiving the request if they haven't emitted an earlier Sparks Expires August 8, 2003 [Page 7] Internet-Draft SIP non-INVITE Considerations February 2003 final response. Such a 408 is guaranteed to arrive at the next upstream element too late to be useful. In fact, in the presence of proxies, these messages are even harmful. When the 408 arrives, each proxy will have already terminated its associated client transaction due to timeout. So, each proxy must forward the 408 upstream statelessly. This, in turn, is guaranteed to arrive too late. As Figure 3 shows, this can ultimately result in bombarding the original requester with spurious 408s. (Note that the proxy's client transaction state machine never enters the Completed state, so Timer K does not enter into play). --------------------------------------------------------------------- UAC P1 P2 P3 UAS | | | | | --- ===---. | | | | ^ | `-->===---. | | | | | | `-->===---. | | | | | | `-->===---. | 64*T1 | | | | `-->=== | | | | | | | | | | | | v | | | | | (timeout) --- === | | | | | .-408=== | | | |<--' | .-408=== | | | .-408-|<--' | .-408=== | |<--' | .-408-|<--' | .-408=== | .-408-|<--' | .-408-|<--' | |<--' | .-408-|<--' | | | .-408-|<--' | | | |<--' | | | | | | | | | Figure 3: late 408s to non-INVITEs --------------------------------------------------------------------- This response bombardment is not limited to the 408 response, though it only exists when participating client transaction state machines are timing out. Figure 4 generalizes Figure 1 to include multiple hops. Note that even though the UAS responds "in time" to P3, the response is too late for P2, P1 and the UAC. Sparks Expires August 8, 2003 [Page 8] Internet-Draft SIP non-INVITE Considerations February 2003 --------------------------------------------------------------------- UAC P1 P2 P3 UAS | | | | | --- ===---. | | | | ^ | `-->===---. | | | | | | `-->===---. | | | | | | `-->===---. | 64*T1 | | | | `-->=== | | | | | | | | | | | | v | | | | | (timeout) --- === | | | | | .-408=== | | .-200-| |<--' | .-408=== .-200-|<--' | | .-408-|<--'.-200-|<--' === | |<--'.-200-|<--' | | === |<--' | | | | | | | | | Figure 4: Additional timeout related error --------------------------------------------------------------------- 3. Non-INVITE timeouts doom forking proxies A single branch with a delayed or missing final response will dominate the processing at proxy that receives no 2xx responses to a forked non-INVITE request. Since this proxy is required to allow all of its client transactions to terminate before choosing a "best response". This forces the proxy's server transaction to lose the race in Figure 1. Any response it ultimately forwards (a 401 for example) will arrive at the upstream elements too late to be used. Thus, if no element among the branches would return a 2xx response, failure of a single element (or its transport) dooms the proxy to failure. This document currently only notes that this problem exists and contains no proposals to address it. 4. Mismatched timer values There are many failure scenarios due to misconfiguration or misbehavior that the SIP specification does not discuss. One is Sparks Expires August 8, 2003 [Page 9] Internet-Draft SIP non-INVITE Considerations February 2003 placing two elements with different configured values for T1 and T2 on the same network. Review of Figure 1 illustrates that the race failure is only made more likely in this misconfigured state (it may appear that shortening T1 at the element behaving as a UAS improves this particular situation, but remember that these elements may trade roles on the next request). Since the protocol provides no mechanism for discovering/negotiating a peer's timer values, exceptional care must be taken when deploying systems with non-defaults to ensure they will _never_ directly communicate with elements with default values. 5. Proposals 5.1 Proposal 1. Disallow non-100 provisionals to non-INVITE Non-INVITE transactions must complete rapidly (Section 1.1). Any information beyond "I'm here" which can be provided by a 100 Trying can be just as usefully delayed to the final response. Sending non- 100 provisionals wastes bandwidth and may harm recovery from transport error (Section 1.2). Counterpoint: Although reliable provisionals are not defined for non- INVITE in [4], it explicitly notes that extensions defining dialog creating requests might use them. Such reliable provisionals could provide a HERFP-solution for a non-INVITE request (allowing the originating UAC to learn the identity and path to each responding UAS). Note, however, that SUBSCRIBE did not choose this solution and provides proof that another option exists. 5.2 Proposal 2. Disallow 100 Trying to non-INVITE before Timer E reaches T2 As shown in Section 1.2, sending a provisional response inside a NIT before Timer E reaches T2 damages recovery from failure of an unreliable transport. 5.3 Proposal 3. Allow 100 Trying after Timer E reaches T2 Without a provisional, a late final response is the same as no response at all and will likely result in blacklisting the late responding element (Section 1.3). Sending a 100 Trying after Timer E reaches T2 prevents this blacklisting without damaging recovery from unreliable transport failure. Note that a non-responding element will be blacklisted regardless of the transport used. 5.4 Proposal 4. Disallow 408 to non-INVITE requests A 408 to non-INVITE will always arrive too late to be useful (Section 2). The client already has full knowledge of the timeout. The only Sparks Expires August 8, 2003 [Page 10] Internet-Draft SIP non-INVITE Considerations February 2003 information this message would convey is whether or not the server believed the transaction timed out. However, with the current design of the NIT, a client can't do anything with this knowledge. Thus the 408 simply wasting network resources and contributes to the response bombardment illustrated in Figure 3. 5.5 Proposal 5. Absorb late non-INVITE responses Modify the non-INVITE client state machine to continue to live after Timer F fires to absorb late responses. This is similar to what is already provided by Timer K for absorbing retransmitted responses, but the absorption behavior must exist even for reliable transports. (Perhaps it would be sufficient to move the Timer F transition to the Completed state and always set Timer K regardless of transport). The advantage of this approach is suppressing late final responses, such as the 200 in Figure 4, at the element where it first becomes useless. 5.5.1 Proposal interdependencies Proposals 1, 2 and 3 should be taken as a unit, either all accepted or rejected. Proposals 4 and 5 may at first seem to be alternatives to solving one problem, but they in fact address discrete issues and should be considered separately. A proxy that implemented Proposal 5 but not 4 could still emit a useless 408. 6. A more radical alternate proposal This section sketches an approach that makes more fundamental changes to the protocol. It has not been carefully analyzed, but no major flaws have been uncovered in initial investigations. Obviously, much more work will be required if this path is pursued. The root causes of the problems this document attempts to address are the fixed-length NIT (which causes the race condition of Figure 1) and the extra mechanics for providing reliability over unreliable transports. If we deprecate the use of UDP, the problems addressed by Proposals 1 and 2 simply go away. If we then change the definition of the non-INVITE transaction to allow it to pend indefinitely (remove Timer F), the race condition in goes away as well. Clients would use CANCEL to pending non-INVITEs to stimulate a final response when they are through waiting, similar to INVITE. This alleviates the problems in Section 2 and Section 3. The 408 response would become meaningful once again, and proxies can wait until all branches complete, forcing a branch to complete with CANCEL if necessary. Sparks Expires August 8, 2003 [Page 11] Internet-Draft SIP non-INVITE Considerations February 2003 ACK is not needed for this pending non-INVITE because we have removed unreliable transports. (ACK was originally needed for INVITE because reliability over UDP became the server's responsibility after its first response and the server needed to know when to stop retransmitting. Other uses for ACK have evolved (such as carrying answer SDP), so it would still need to exist for INVITE even if UDP is abandoned). This change is backwards-safe, if not completely backwards compatible. o Existing client, proposed server: The client's experience is unchanged. It will still abandon the transaction after Timer F fires. The failure scenarios are exactly those we currently have. The server will need to protect itself against never receiving a CANCEL. o Proposed client, existing server: The behavior here is an improvement over the existing client-server behavior. The 408 emitted by and existing server would become meaningful to the proposed client. New methods that take advantage of the indefinite-pending property will be rejected by the existing server with a 501. Existing servers might not be expecting CANCEL to non-INVITEs, but are not compliant to the existing specification if such a CANCEL induces incorrect behavior. We would need to add a constraint, similar to that already on the INVITE transaction, binding clients that receive no response within a short time to abandon the transaction instead of pending indefinitely to account for server failure. It might be possible to adapt this proposal to work with UDP as well. In that case, Proposals 1 and 2 need to be applied in addition to allowing non-INVITE transactions to pend. The downside of this approach is that responsibility for reliability will always remain at the client so retransmissions of the request cannot be squelched. This might call for ensuring that future non-INVITE methods continue to be designed to complete "quickly" even though the transaction can pend indefinitely. 7. Acknowledgments This document attempts to capture many conversations about non-INVITE issues. Significant contributers include Ben Campbell, Steve Donovan, Rohan Mahy, Adam Roach, Jonathan Rosenberg, and Dean Willis. References [1] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, A., Sparks Expires August 8, 2003 [Page 12] Internet-Draft SIP non-INVITE Considerations February 2003 Peterson, J., Sparks, R., Handley, M. and E. Schooler, "SIP: Session Initiation Protocol", RFC 3261, June 2002. [2] Rosenberg, J. and H. Schulzrinne, "Session Initiation Protocol (SIP): Locating SIP Servers", RFC 3263, June 2002. [3] Gulbrandsen, A., Vixie, P. and L. Esibov, "A DNS RR for specifying the location of services (DNS SRV)", RFC 2782, February 2000. [4] Rosenberg, J. and H. Schulzrinne, "Reliability of Provisional Responses in Session Initiation Protocol (SIP)", RFC 3262, June 2002. Author's Address Robert J. Sparks dynamicsoft 5100 Tennyson Parkway Suite 1200 Plano, TX 75024 EMail: rsparks@dynamicsoft.com Sparks Expires August 8, 2003 [Page 13] Internet-Draft SIP non-INVITE Considerations February 2003 Full Copyright Statement Copyright (C) The Internet Society (2003). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Acknowledgement Funding for the RFC Editor function is currently provided by the Internet Society. Sparks Expires August 8, 2003 [Page 14]