Internet Draft SIP for Video February 22, 2001 Internet Engineering Task Force Internet Draft O. Levin draft-levin-sip-for-video-00.txt RADVision Inc. February 22, 2001 Expires: August 2001 SIP Requirements for support of Multimedia and Video STATUS OF THIS MEMO This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress". The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document outlines requirements for a call control protocol for real-time multimedia support over IP. As a part of its broader scope, the document examines important aspects of interactive video communications that need to be addressed by a call control protocol for real-time multimedia support and discusses techniques currently used to deal with them. O. Levin Page 1 Internet Draft SIP for Video February 22, 2001 This document examines the way SIP/SDP/RTP/RTCP can be used today to support multimedia sessions and to deal with some of the presented requirements. In a small number of cases, this document mentions possible directions to enhance SIP in order to add new required functionality or to provide the same functionality in a more efficient way. 1. General The specific motivation underlying this document is to help define a framework to enable SIP/SDP/RTP/RTCP systems to provide real-time multimedia services across global networks with quality video that delivers a satisfying end user experience. Videoconferencing can serve as an example application and a test case for a service that uses SIP as a basic call control protocol for managing a call leg, and uses complementary technologies such as protocols for floor control, camera control, etc. for achieving other functionalities. It is important to emphasize that while each application would use various basic tools to address the specific needs of the particular application, a basic tool should be effective and convenient for general use in its area of expertise, so that it would be appropriate to fulfill needs of emerging applications. In some cases, when discussing a specific desired functionality, this document refers to SIP and SIP/SDP/RTP/RTCP suite of protocols interchangeably. In many cases the discussed functionality exists (or may be achieved in the future) as a result of these protocols working together and complementing each other. Since this document presents merely requirements rather then specific solutions, it is helpful to present the requirements from a system point of view, before mapping them into the protocol (or protocols). In the end this document lists other multimedia protocols besides SIP together with the systems they are used in. This information is presented to highlight some possible obstacles and interoperability problems that O. Levin Page 2 Internet Draft SIP for Video February 22, 2001 need to be considered on the way towards the desired networks convergence. 2. Multimedia Application Requirements 2.1 General Each device has a set of its capabilities in terms of its CPU processing power, additional HW characteristics and the algorithms it supports. During a session's lifetime, its characteristics may be changing dynamically both as a result of network conditions and as part of a broader application (such as videoconferencing). All the requirements, presented in this chapter, are required for providing basic reliable services: meaning establishing a session of a certain quality, based on capabilities agreement during the session establishment and sustained throughout the duration of the session. The problem can be described as a lack of expressiveness in three following areas: - Capabilities Specification - Resource Reservation - Media Stream Control - The desired ultimate goal is to - Express the capabilities (i.e. supported media, CODECs algorithms, bandwidth, etc.) without a need in configuration - Signal the total resources, required for a specific session (probably in terms of capabilities) - Within a session, explicitly open and close media streams, modify the parameters of a certain stream within the boundaries of previously announced capabilities and reserved resources 2.2 Capabilities Specification 2.2.1 Bit Rate Each CODEC (COder DECoder) algorithm is defined to work at a certain rate or rates measured in bits per second. O. Levin Page 3 Internet Draft SIP for Video February 22, 2001 SDP [2] has a concept of _Application-Specific Maximum Bandwidth_ that can be applied to _m_ (media) line and specified in kilobits per second. The required functionality is to express CODEC capabilities for both ranges of rates and a discrete number of rates. The _discrete number of rates_ requirement has a number of purposes: - Efficient resources allocation - Multipoint applications where the rates from different sources should be matched - Interworking with terminals using multiplexing schemes, such as H.320 [12] and H.324 [13], in which only a discrete number of bit rates are available. The defined capabilities may be used both for resource reservation and the actual control of a media stream. 2.2.2 Advanced CODEC schemes support Video CODEC algorithms (among them H.263 [9]) have a concept of CIF _ Common Intermediate Format Definition with its derivatives: QCIF, 4CIF, etc. This definition of resolution implies the number of pixels and the format of the composed picture. The challenge of providing quality video services over imperfect networks results in inventing of new coding algorithms with numerous optional operation modes fitting various environments. A famous example is the latest H.263 Recommendation (_H.263+_) [10] that specifies a coded representation that can be used for compressing the moving picture component of audio-visual services at low bit rates and has innumerous number of standard options described in its Annexes. All of the mentioned characteristics (bit rate, resolution, H.263+ options) are examples of CODECs capabilities. Many CODECs implementations are capable of changing the mode of their operation in real time, as a result of changing conditions, and signaling the new mode within the RTP payload header. RTP has a definition of a specific RTP Payload header for each CODEC scheme carrying both the configuration and O. Levin Page 4 Internet Draft SIP for Video February 22, 2001 the dynamically changing segmentation information describing the format of the transmitted RTP packet. In order to use various options dynamically during a session lifetime, without probability of dropping a call, a _capabilities announcement_ mechanism should exist, expressing modes of operations, supported by each of the sides. In the future, SDPng work [22] may define a language, expressive enough to describe different modes of operations. SDP may be expanded to carry some of the parameters. SIP extensions may be needed to separate the _capabilities announcement_ phase from the actual opening of media streams. 2.2.3 Lip Synchronization One of the basic requirements for a multimedia session is the synchronization between audio and video streams when presenting them to the user. 2.2.3.1 Skew The different timing of two media streams usually derives from an unequal processing time required for the encoding of the streams by the originating device. This difference is referred as a Skew. Skew is defined as a maximum time that the two media streams are delayed from each other as delivered to the transport network. Skew is usually measured in milliseconds. Using RTP [3] timestamping service, during an active multimedia session the receiving device is capable of computing the skew and adjusting its buffers accordingly in order to provide the end user video and audio display in a synchronized manner. Additional useful functionality, helping to deal with the lip synchronization problem, is providing the receiving device with a skew metric before the actual media streams are transmitted. Knowing the maximum skew value in advance allows the receiving device, based on its capabilities, either to reject the call or to allocate its buffers accordingly increasing probability for a high quality call and pleasant end user experience. Currently this functionality is missing from the SIP/SDP/RTP/RTPC definitions. O. Levin Page 5 Internet Draft SIP for Video February 22, 2001 2.2.3.2 Association of media streams A single multimedia session may consist of multiple video or multiple audio streams, addressing multilingual requirements or CODECs multi-rate requirements respectively. In order to implement lip synchronization, an association between a certain video stream and its corresponding audio stream is required. Currently this functionality is missing from the SIP/SDP/RTP/RTPC definitions. 2.3 Resource Reservation An end user may support more than a single CODEC scheme for a specific media type. The trivial reason for choosing a specific CODEC (out of a list of supported CODECs) is to match the CODEC scheme, supported by the other end. In multimedia applications, it is desired to explicitly choose a certain combination of audio and video CODECs without exceeding CPU processing power limitations or certain available network bandwidth. Using the same system (with a defined CPU processing power and a support for various CODEC schemes) you would like to be able to receive different types of multimedia applications. At one time you might want to receive a classical concert program; at another time you might want to be able to participate in a medical surgery session. In this example, the application running on your computer represents its CODECs capabilities, CPU constraints and, may be, the local network limitations. The other side, i.e. the _service provider_, knows minimal CODECs requirements for providing a service with a certain quality. One of the possible ways to express _resources ability_ is by grouping the _capabilities_, discussed above. It is an open question, if SIP, as a peer call control protocol, requires actual _resources reservation commands_ being imposed on the other side. 2.4 Media Stream control This section presents requirements to effectively control a particular media stream automatically. These requirements do not refer to manually driven commands such as floor control or camera control. It is interesting to mention that despite the fact that user commands are not in the scope of a call control O. Levin Page 6 Internet Draft SIP for Video February 22, 2001 protocol (such as SIP), a correlation among them and the media streams should be taken into consideration during the design of both protocols. For example, a particular camera (managed by some application protocol) should be by associated with its originated video stream, managed by SIP. Video requires broader control than voice. Some examples of video specific control are listed towards the end of this section. It is believed that a convenient means for media streams control is required for both voice and video, although supporting of video applications obviously complicates the problem. Currently, in SIP/SDP/RTP/RTCP systems the media control commands are divided between SIP/SDP conventions and commands defined for certain CODECs and carried by _reverse_ RTCP control packets, as in [4]. 2.4.1 An ability to reference a specific media stream The first basic requirement is the ability to reference a particular media stream within a session. This functionality does not currently exist within the SDP. In order to signal a specific change within a single media stream the whole block of media descriptors (i.e. m lines) has to be retransmitted. Moreover, the receiving side has to perform matching against the old information for each one of the m lines in order to recognize the changes. This functionality should be fixed for the _old_ SDP version, without waiting for the results of SDPng work. 2.4.2 Effective addressing of collision conditions during asynchronous operations In most cases, the change of media stream parameters is of an asynchronous nature (see examples below). Therefore, both of the sides may issue a request for a conflicting change or command _simultaneously_, generating a so-called _race-condition_. Currently, SIP solves this problem by introducing the _retry-after_ header field in the INVITE message. One of the disadvantages of this approach is that it locks the O. Levin Page 7 Internet Draft SIP for Video February 22, 2001 whole session, when the collision exists in a certain media stream only. 2.4.3 Explicit start and stop of data transmitting in a certain direction Explicit start and stop of data transmitting in a certain direction is the first basic _control command_ out of set of media controls. Its necessity becomes obvious during the design of PSTN interworking. Early establishment of a media path in one direction only, strict billing regulations are just a few of the examples. Today SIP addresses this functionality by _putting media streams on Hold_ by setting "c" destination addresses to value _0.0.0.0_. 2.4.4 Bandwidth changes Applications involving video are particularly prone to frequent bandwidth changes causing packets lost, error conditions, etc. The first cause for frequent changes is the network changing conditions. Future IP based wireless networks will become a real test bed for SIP services. Similar changing conditions would frequently be caused by the _multimedia nature_ of the video services. Some examples are presented below. Today, in many integrated services, _multimedia communication_ includes a data session (such as T.120 [16]) bundled together with voice and video services. Opening and closure of the data stream may significantly change the desired parameters of the media streams. The same effect may exist in multimedia applications, where instead of using QCIF derivatives, video streams are presented to the user in separate windows. This mode of operation may be advantages to the user, who has a better control over the session and can use network resources in a more effective way. Additional example of an _application condition_ is a multi-conferencing service, where adding of a participant O. Levin Page 8 Internet Draft SIP for Video February 22, 2001 may result in a change of transmitted stream parameters or in a reconsideration of conference capabilities. 2.4.5 Video/CODEC Specific Commands Various video specific techniques have been used in today's networks in order to cope with the conditions mentioned above with minimum service degradation and as seamlessly as possible to the users. Below are some of the examples. H.261 and H.263 video CODECs have a notion of picture's building blocks: _full picture_, GOB and MacroBlock (MB). The decoder would have an ability to recognize synchronization degradation and explicitly request from an encoder for a _full picture_, a whole GOB or a whole MacroBlock. In SIP/SDP/RTP/RTCP systems, the only analogous functionality is defined in RFC-2032 _RTP Payload Format for H.261 Video Streams_ [4] that defines a _Full INTRA- frame Request_ (FIR) to be carried in RTCP _reverse_ control packet. This technique is definitely an exception to a normal RTCP design and therefore does not work in all the cases. No corresponding functionality has been defined for H.263 CODEC in [5] and [6]. A simple example of a video specific command is a request to _freeze a picture_ in this case originated from the encoder towards the decoder. In case the encoder is aware of oncoming massive changes in the transmitted picture, it would request the decoding side to stop presenting the changes, until a new stable image is encoded and transmitted. Another inherent example of a video specific command is a request to change the tradeoff between temporal and spatial resolutions, i.e. the tradeoff between the rate of the samples and the resolution of the picture. This request would be originated by the decoder towards the encoder (if the encoder has the capability to dynamically change the tradeoff). 2.4.6 Transmission of media stream commands Today the media stream commands are transmitted by RTCP. _FIR_ command defined for H.261 and described above O. Levin Page 9 Internet Draft SIP for Video February 22, 2001 is the example for this technique. The benefit of this approach is that the commands flow the same path as the media stream itself and therefore are synchronized in time. On the other hand, the RTCP approach has a number of following drawbacks: - RTCP is an unreliable transport channel - RTCP mechanisms were originally designed to work in primarily multicast environment. This may introduce complications for stream control commands issued in both directions. - Difficulty in RTCP potential use of capabilities defined by SDP (or SDPng) 3. Interoperability with existing video systems Today several protocols are defined to support multimedia systems both for Circuit Switched and Packets networks. Part of them (such as H.320[12] and H.323[11]) are deployed, others (such as H.324M) are intended to be used in future networks. It is important to be aware of the architecture of these systems and the interworking challenges they introduce. Below is the list of these specifications. 3.1 H.320 ITU-T Recommendation H.320 [12] _Narrow-band visual telephone systems and terminal equipment_ is defined for use over ISDN networks. Its deployment is especially successful in Europe. H.320 Specification defines an umbrella system using the following protocols: H.221, H.242, H.230, H.224, H.281, H.120. 3.2 H.324 and H.324M ITU-T Recommendation H.324 [13] _Terminal for low bit- rate multimedia communication _ is defined for use over regular PSTN networks. H.324M is the profile of H.324 defined for Mobile networks. H.324M is the 3GPP choice for Videoconferencing over Circuit Switched Networks. O. Levin Page 10 Internet Draft SIP for Video February 22, 2001 H.324 Recommendation defines an umbrella using the following protocols: H.223, H.245, H.120. 3.3 H.323 ITU-T Recommendation H.323 [11] _Packet-based multimedia communications systems_ is defined for use over Packet Based Networks which may not provide a guaranteed Quality of Service. H.323 Recommendation defines an umbrella for use of the following protocols: H.225.0, H.245, H.282, H.283, H.224, H.281, T.120 4. Conclusion This draft is a first attempt to present and summarize issues needed for video services support in SIP/SDP/RTP/RTCP systems. Part of the solutions may be defined in a short period of time. More advanced features or complicated problems will be resolved in the future by SDPng and SIP extensions related to it. It is important to be aware of the current limitations or open issues in the standard because, based on the imperativeness of the requirements, SIP allows for extensions adding functionality in a standard interoperable manner. An alternative possible approach might be the definition of conventions for certain SIP based multimedia systems. 5. Security Considerations This document does not introduce new security requirements to existing SIP/SDP/RTP/RTCP systems. 6. References [1] M. Handley, H.Schulzrinne, E.Schooler, and J.Rosenberg, "SIP:Session Initiation Protocol", RFC 2543, IETF, March 1999. [2] M. Handley and V. Jacobson, "SDP: Session Description Protocol", RFC 2327, IETF, April 1998. [3] H. Schulzrinne, S. Casner, R. Frederick, and V. O. Levin Page 11 Internet Draft SIP for Video February 22, 2001 Jacobson, "RTP: a transport protocol for real-time applications," Request for Comments1889, IETF, Jan. 1996. [4] Turletti, T. and C. Huitema, "RTP Payload Format for H.261 Video Streams", RFC 2032, IETF, October 1996. [5] Zhu, C., "RTP Payload Format for H.263 Video Streams", RFC 2190, IETF, September 1997. [6] Bormann, Cline, Deisher, Gardos, Maciocco, Newell, Ott, Sullivan, Wenger, Zhu, "RTP Payload Format for the 1998 Version of ITU-T Rec. H.263 Video (H.263+)", RFC 2429, IETF, October 1998. [7] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P.Leach, and T.Berners-Lee, "Hypertext transfer protocol -- HTTP/1.1,"RFC 2616, IETF, June1999. [8] ITU-T Recommendation H.261 (1993), Video codec for audiovisual services at p . 64 kbit/s. [9] ITU-T Recommendation H.263 (1996), Video coding for low bit rate communication. [10] ITU-T Recommendation H.263 (1998), Video coding for low bit rate communication. [11] ITU-T Recommendation H.323 (2000), Packet-based multimedia communications systems. [12] ITU-T Recommendation H.320 (1997), Narrow-band visual telephone systems and terminal equipment. [13] ITU-T Recommendation H.324 (1996), Terminal for low bit rate multimedia communication. [14] ITU-T Recommendation H.225.0 (1999), Call signalling protocols and media stream packetization for packet based multimedia communication systems. [15] ITU-T Recommendation H.245 (2000), Control protocol for multimedia communication. [16] ITU-T Recommendation T.120 (1996), Data protocols for multimedia conferencing. O. Levin Page 12 Internet Draft SIP for Video February 22, 2001 [17] ITU-T Recommendation H.223 (1996), Multiplexing protocol for low bit rate multimedia communication. [18] ITU-T Recommendation H.224 (1994), A real time control protocol for simplex applications using the H.221 LSD/HSD/MLP channels. [19] ITU-T Recommendation H.281 (1994), A far end camera control protocol for video conferences using H.224. [20] ITU-T Recommendation H.282 (1999), Remote Device Control Protocol for Multimedia Applications. [21] ITU-T Recommendation H.283 (1999), Remote Device Control Logical Channel Transport. [22] Kutscher/Ott/Bormann, "Requirements for Session Description and Capability Negotiation", draft- kutscher-mmusic-sdpng-req-01.txt, IETF, November 2000. 7. Acknowledgements Sasha Ruditsky, Yair Miranda, Eli Doron, Itamar Gilad and Danny Levin participated in earlier discussions on this topic. 8. Authors' Addresses Orit Levin RADVision Inc., 575 Corporate Drive Suite 420 Mahwah, NJ 07430 Phone: +1 201 529 4300 Email: orit@radvision.com Full Copyright Statement Copyright (c) The Internet Society (2000). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without O. Levin Page 13 Internet Draft SIP for Video February 22, 2001 restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED ARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. O. Levin Page 14