Internet Engineering Task Force SIP WG Internet Draft J.Rosenberg,H.Schulzrinne draft-rosenberg-sip-conferencing-models-00.txt dynamicsoft,Columbia U. November 17, 2000 Expires: May, 2001 Models for Multi Party Conferencing in SIP STATUS OF THIS MEMO This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as work in progress. The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract The Session Initiation Protocol (SIP) can support multi-party conferencing in many different ways. In this draft, we define the various multi-party conferencing models, and for each, discuss how they are used and then analyze their relative benefits and drawbacks. 1 Introduction The Session Initiation Protocol (SIP) [1] has been defined for the establishment, maintenance, and termination of calls between one or more users. However, despite its origins as a large scale multiparty conferencing protocol, SIP is used today primarily for point to point calls. This configuration is the focus of the SIP specification and most of its extensions. As a result, there is a lot of confusion about how SIP supports multi-party conferencing. J.Rosenberg,H.Schulzrinne [Page 1] Internet Draft conf November 17, 2000 We seek to remedy this problem by describing, in a consistent and complete fashion, the various multi-party conferencing models supported by standard SIP. For each model, we discuss: o How the model works. o How users are invited to join. o How users can join an existing conference without being invited o How well the model scales. o Which entities need to be aware of the model. o How participants learn about each other. We also identify missing pieces and recommend standard activity to fill them in. This document itself does not define any new extensions of any kind. 2 End System Mixing The first model we call "end system mixing". In this model, user A calls user B, and they have a conversation. At some point later, A decides to conference in user C. To do this, A calls C, using a completely separate SIP call. This call uses a different Call-ID, different tags, etc. There is no call set up directly between B and C. A receives media streams from both B and C, and mixes them. A sends a stream containing A's and C's streams to B, and a stream stream containing A's and B's streams to C. This model is depicted graphically in Figure 1. Basically, user A handles both signaling and media mixing. B and C are unaware of the multi-party call, from a SIP perspective at least. From an RTP perspective, A is a mixer, and so the RTCP reports from A will contain SDES information that indicates the existence of an additional party in the media stream. Note that this model has the serious drawback that the conference ends when the mixing UA leaves the call. 2.1 Inviting Users to Join Any user in the conference can invite another user to join, so long as they are capable of performing the required mixing and signaling J.Rosenberg,H.Schulzrinne [Page 2] Internet Draft conf November 17, 2000 +----------+ | | -- | | --- | B | SIP call --- | | --- .. | | --- .. +----------+ -- ... ... +----------+ .. RTP | | .. | | | A | .. | | .. | | .. RTP +----------+ .. -- .. -- .. --- . +----------+ -- | | -- | | SIP call -- | | | C | | | +----------+ Figure 1: Three Way Calling using End System Mixing functions. To invite a new user to join, a user in the conference simply calls them using normal SIP procedures. The only difference is that the stream sent to that new user contains the streams received from the other parties in the call. In fact, it is perfectly acceptable for complex connectivity graphs to be constructed, as a result of different users inviting other users to join. For example, take our case of A calling B, and then calling C. If, later on, C calls D, C will performing the mixing of the streams it gets from A (which actually contain media from A and B), along with its own stream, and send that to D. This results in a connectivity graph that looks like: J.Rosenberg,H.Schulzrinne [Page 3] Internet Draft conf November 17, 2000 A------B | | C------D Note, however, that there is a possibility of loops. From here, if D calls B, and brings that stream into the conference, a loop is created. This loop can be detected using the mechanisms described in the RTP specification [2]. However, we expect these conditions to be extremely rare. Presumably, D knows B is in the conference already, and so would not likely call B and invite them in. 2.2 Users Joining In this model, there is not any explicit conference "identifier" that can be used to join. This conference model, by its nature, is built around ad-hoc conferences. However, it is still possible for a user to join in the following way. Lets say a new user, E, simply calls B, unaware even, that B is in a conference (E might actually be aware, but the SIP messaging is no different). B's softphone, recognizing that B is already in a conference, asks B if E should be brought into the conference right away. If B clicks "yes", the call to E is answered. The media stream sent to E contains media from B, along with the media B is already receiving from A. If B had instead clicked no, E can easily be added to the conference later. No SIP signaling at all is needed to do this. B simply starts sending the mixed media to E. 2.3 Scalability A drawback of this model is its scalability. Viewing the conference from a graph perspective, if the number of edges touching a vertex (its degree) equals N, the user corresponding to that vertex has to perform up to N separate media stream encodings. We say "up to", as it depends on the number of paricipants who are talking at once. If only one pariticpant is talking, the non-talking "mixer" endpoints don't need to do any additional encoding. If everyone is talking, it is N encodes. Since encoding is generally a complex process, a typical workstation these days can handle two or three simultaneous encodes using a low rate codec like G.723.1. The problem can be mitigated somewhat by distributing the mixing responsibilities (making the graph deep rather than wide). However, this requires a J.Rosenberg,H.Schulzrinne [Page 4] Internet Draft conf November 17, 2000 conscious effort of the participants regarding who is to make the call to add a new user. This is unlikely to happen in practice. Another limitation to scalability is bandwidth. If the degree of a vertex is N, the user needs enough bandwidth to send and receive up to N streams, for a total of 2N. On a 56K modem, using a G.723.1 codec, this limits the degree to two (remember RTP overheads). This limitation exists even if only one user is talking. In this case, a mixing host receives the encoded packet stream, and needs to send a copy to each participant it is connected to. For these reasons, this conferencing model is ideal for three-way conferences (i.e., degrees of two), but doesn't scale up much higher. 2.4 Location of Service Logic This model does not require any extension to SIP in order to work. It does require knowledge of this mechanism within the UA performing the mixing. Non-mixing participants do not need to know anything special. 2.5 Discovering Participant Identities The identities of other participants in the conference is NOT known through SIP. Rather, it is learned through RTP. UAs with degrees greater than one are RTP mixers. As such, they take the RTCP SDES of the streams they mix, and aggregrate them into the RTCP stream sent out. Since RTCP messages are sent infrequently, there may be a delay between when a user joins, and when their presence is known to the other participants. 3 Large-Scale Multicast Conferences Large-scale multicast conferences were the original motivation for both the Session Description Protocol (SDP) [3] and SIP. In a large- scale multicast conference, one or more multicast addresses are allocated to the conference (more than one may be needed if layered encodings are in use). Each participant joins that multicast groups, and sends their media to those groups. Signaling is not sent to the multicast groups. The sole purpose of the signaling is to inform participants of which multicast groups to join. Large-scale multicast conferences are usually pre-arranged, with specific start and stop times (which is why this information exists in SDP). Protocols such as the Session Announcement Protocol (SAP) [4] are used to announce these conferences. However, multicast conferences do not need to be pre-arranged, so long as a mechanism exists to dynamically obtain a multicast address. SAP itself was originally used for this purpose; this has been supplanted by the J.Rosenberg,H.Schulzrinne [Page 5] Internet Draft conf November 17, 2000 malloc architecture [5], still under development. So, if there are N participants, there will be point to point SIP relationships with pairs of participants. Each participant sends a single media stream to the group, and receives up to N-1 streams at any time. Note that the number of streams that a user will receive depends on who is actually sending at any given time. If the stream is audio, and silence suppression is utilized, the number of streams a user will receive at any given time is equal to the number of users talking at any given time. Even for very large conferences, this is usually just a small number of users. 3.1 Inviting Users to Join Inviting users to join is simple. Any user may invite any other user to join. The SIP INVITE request contains SDP that indicates multicast addresses for each media line. The SDP in the 200 OK response may actually be empty. From Section B.3 of RFC2543: For multicast, receive and send multicast addresses are the same and all parties use the same port numbers to receive media data. If the session description provided by the caller is acceptable to the callee, the callee can choose not to include a session description or MAY echo the description in the response. The called party then joins the multicast groups indicated in the SDP, using multicast protocols such as IGMP [6]. Note that it is not even necessary for users to send each other BYE messages when the conference is over, especially for large-scale, pre-arranged conferences that have explicit end times indicated in SDP. SDP aside, a participant can simply leave the conference at any time by leaving the multicast groups. No SIP signaling is needed to accomplish this. 3.2 Users Joining Users can join a conference of this type without being invited. All they need is the multicast addresses, ports, and codecs being used. These can be obtained through any number of means, including SAP. SDP conference descriptions can even be obtained from web pages, for example. Once the addresses are obtained, the user simply joins the appropriate multicast groups. Note that absolutely no SIP signaling is required in this case. 3.3 Scalability J.Rosenberg,H.Schulzrinne [Page 6] Internet Draft conf November 17, 2000 The scalability of conferences of this type is can be excellent, especially for audio conferences. However, it is scalable under the assumption that multicast itself can scale to very large groups. Indeed, in local networks, protocols like DVMRP [7] and PIM-DM have tremendous scalability for conferences with very large numbers of members (the so called dense modes). Given the existence of scalable multicast, the primary bottleneck to scalability of this conference type is the periodicity of RTCP reporting. Work has been done on improving the problematic cases [8] so that conferences with well over a million members are possible. Scaling is a bit harder for video conferences. Unlike voice, where silence suppression allows for no data to be sent during periods of inactivity, the same is not the case for video. This makes it hard to scale without flooding users with lots of video packets. Security is also hard for multicast conferences. Group key management, especially when users leave the group, is very complex. Unfortunately, multicast has not been widely deployed across backbones (some do, like Internet2, but they are the exception rather than the rule). The MBone has collapsed, for all intents and purposes. Very few ISPs support multicast. As a result, wide area conferences are not really viable using multicast. However, these conferences are very suitable for LAN or enterprise conferences, where multicast is often deployed. 3.4 Location of Service Logic This conferencing model does not require any SIP extensions. It does require that SIP UAs are prepared to receive SIP invitations with multicast addresses in the SDP. These UAs need to be prepared to mirror the SDP in the response. They should also be prepared to never receive a BYE for the conference. 3.5 Discovering Participant Identities The identity of the participants in the session is learned entirely through RTCP. Each user a group multicasts RTCP packets with their name, email address, and so on. Note, however, that in large conferences, there may be significant amounts of time between a participant joining, and sending of their first RTCP SDES packet (this is for receivers only; senders will become known much faster). 4 Dial-In Conference Servers Dial-In conference servers closely mirror dial-in conference bridges in the traditional PSTN. J.Rosenberg,H.Schulzrinne [Page 7] Internet Draft conf November 17, 2000 A dial-in conference server acts as a normal SIP UA. Users call it, and the server maintains point to point SIP relationships with each user that calls in. The server takes the media from the users who dial into the same conference, mixes them, and sends out the appropriate mixed stream to each participant separately. +-----+ | | | A | | | +-----+ | . | . | . | . | . +---------+ +-----+ | | +-----+ | |---------| Conf. |---------| | | D | | Server | | B | | |.........| |.........| | +-----+ | | +-----+ +---------+ | . | . | . | . +-----+ | | | C | | | +-----+ Figure 2: Dial-In Conference Servers The model is depicted in Figure 2. Note that each UA (A,B,C,D) has a point to point SIP and RTP relationship with the conference server. Each call has a different Call-ID. Each user sends their own media to the server. The media delivered to user A by the server is the media mixed from users B,C and D. The media delivered to user B by the J.Rosenberg,H.Schulzrinne [Page 8] Internet Draft conf November 17, 2000 server is the media mixed from users A, C and D. The media delivered to user C by the server is the media mixed from users A, B and D. The media delivered to user D is the media mixed from users A, B and C. The conference is identified by the request URI of the calls from each participant. This provides numerous advantages from a services and routing point of view [9]. For example, one conference on the server might be known as sip:conference34@servers.com. All users who call sip:conference34@servers.com are mixed together. Dial-In conference servers are usually associated with pre-arranged conferences. However, the same model applies to ad-hoc conferences. An ad-hoc conference server creates the conference state when the first user joins, and destroys it when the last one leaves. The SIP and RTP interfaces are identical to the pre-arranged case. Since conferencing servers are nothing more than SIP UASes, they can use any of the procedures SIP allows a UAS to use. This includes authentication. So, for example, a specific conference may have a password associated with it. Users who join are challenged (with a 401) using digest authentication. The realm, in this case, would identify the conference. The INVITE that comes back would have an Authorization header that includes the response to the challenge - the name of the user trying to join the conference, and the conference password, hashed as defined in [10]. Conferences can also limit the number of participants. When a new user tries to join, but the conference is full, the conference server can just reject the request with a "500 Conference Full" response. 4.1 Inviting Users to Join Inviting users to join is done using the SIP REFER message [11]. If user A wishes to ask user B to join, A would send B a REFER that looks like: REFER sip:B@example.com SIP/2.0 From: sip:A@example.com To: sip:B@example.com Refer-To: sip:conference34@servers.com This would cause B to send an INVITE message to the conference server: J.Rosenberg,H.Schulzrinne [Page 9] Internet Draft conf November 17, 2000 INVITE sip:conference34@servers.com From: sip:B@example.com To: sip:conference34@servers.com Referred-By: sip:A@example.com Since the request URI identifies the conference, this will cause B to get added to conference 34. 4.2 Users Joining Users joining is easily done. The participant that wishes to join simply sends an INVITE to the conference server, with the conference ID in the request URI. The conference ID (which is a SIP URL), can be learned by any number of means, including having it on a web page, receiving it in an email, etc. For example, if B wishes to join sip:conference34@servers.com, B would send the following request: INVITE sip:conference34@servers.com From: sip:B@example.com To: sip:conference34@servers.com 4.3 Scalability The scalability of this model is limited by the bandwidth and processing power of the conference server. If there are N participants in a conference, M of which are sending media streams, the server will need to manage N signaling relationships, perform N RTP stream decodes, and N RTP stream encodes (assuming M > 0). The encoding is the primary processing bottleneck, and the sending of the N media streams is the primary bandwidth bottleneck. However, conference servers can be built using heavy duty hardware, and have high bandwith access. Furthermore, since we are using the request URI to name the conferences, we can use standard SIP techniques for distributing conferences across servers [9]. 4.4 Location of Service Logic The SIP UA of the conference participants does not require any special processing. The RTP implementation in those clients, however, J.Rosenberg,H.Schulzrinne [Page 10] Internet Draft conf November 17, 2000 should support RTCP and be prepared to receive contributing sources. All of the new logic for providing this service resides in the conferencing server. No SIP extensions are needed, simply logic that resides above the SIP stack to manage the conferencing service. 4.5 Discovering Participant Identities The identities of other participants in the conference are NOT known through SIP. Rather, it is learned through RTP. THe conference server is an RTP mixer. As such, it takes the RTCP SDES of the streams it mixes, and aggregrates them into the RTCP stream sent out. This will allow participants to gradually (over a few seconds), learn the identities of the other participants. 5 Ad-hoc Centralized Conferences In an ad-hoc centralized conference, two users A and B start with a normal SIP call. At some point later, they decide to add a third party. Instead of using end system mixing, they would prefer to use a conference server, as defined in Section 4. This model corresponds roughly to the centralized multipoint conference model of H.323. One of the participants takes responsibility for transitioning to a conference server. The first step in this process is the discovery of a conference server that supports ad-hoc conferences. This can be done through static configuration, or through any of a number of standard service discovery protocols, such as the Service Location Protocol [12]. Once the server is discovered, a conference ID is chosen. This ID must be globally unique. The conference ID is then prepended to the server, and a SIP URL for the ad-hoc conference is formed. For example, if the server "a.servers.com" is used, and the unique ID is "a7hytaskp09878a", the SIP URL for this conference is sip:a7hytaskp09878a@a.servers.com. The user who is performing the transition (say, user A) then sends an INVITE to this URL. This creates the initial conference state in the server. A then sends a REFER to the other party in the call (say B), referring them to sip:a7hytaskp09878a@a.servers.com. B sends an INVITE to this address, and is added to the conference. Once the 200 OK response to the REFER is sent from B to A, A hangs up to B. A and B are now in a conference using a conference server. From here, operation is identical to the system described in Section 4. J.Rosenberg,H.Schulzrinne [Page 11] Internet Draft conf November 17, 2000 It is also possible to transition from a end system mixed conference (even one with a complex connection topology), to a centralized conference server. One user takes responsibility for initiating the transition. It proceeds as described above. However, the REFER request is sent to all SIP peers adjacent to the user. In addition, when a SIP UA receives a REFER, they must not only act on it as described above, but also generate a REFER to any of their adjacent SIP peers. In essence, the REFER message is propagated along the connection graph, starting at the root (which is the user who initiates the transition). The transition will work so long as the graph has no cycles (which is needed anyway, as discussed above), and so long as only one user attempts to initiate the transition. If multiple users attempt to initiate the transition at the same time, the conference will break into two disjoint ad-hoc conferences, with membership depending on the temporal dynamics of the REFER propagation. 5.1 Inviting Users to Join Once the ad-hoc conference has been created on the server, inviting users proceeds as defined in Section 4.1. 5.2 Users Joining Once the ad-hoc conference has been created on the server, joining proceeds as defined in Section 4.2. 5.3 Scalability The scalability of this conference model is identical to that of dial-in conference servers, as described in Section 4.3. 5.4 Location of Service Logic The logic for handling the transition process must be located in at least one UA in the conference. All UAs that are mixers in a end system mixed conference must know to propagate the REFER requests they receive during the transition. 5.5 Discovering Participant Identities Once the ad-hoc conference is established, conference identities are determined through RTCP, as in the dial-in case. 6 Dial-Out Conferences Dial-out conferences are a simple variation on dial-in conferences. Instead of the users joining the conference by sending an INVITE to J.Rosenberg,H.Schulzrinne [Page 12] Internet Draft conf November 17, 2000 the server, the server chooses the users who are to be members of the conference, and then sends them the INVITE. Typically dial out conferences are pre-arranged, with specific start times and an initial group membership list. Once the users accept or reject the call from the dial out server, the behavior of this system is identical to the dial-in server case of Section 4. Thus, a dial-out conference server will generally need to support dial-in access for the same conference, if it wishes to allow joining after the conference begins. Note that, from the participants perspective, they will learn the conference identity (the URL) from the From field in the INVITE messages received from the server. 6.1 Inviting Users to Join Once the conference is established, inviting users to join is identical to the scenario described in Section 4.1. Note that the URL to be used in the REFER is obtained from the From field of the INVITE received from the dial-out server. 6.2 Users Joining Once the conference is established, joining is identical to the scenario described in Section 4.2. Note that the URL to be used in the INVITE of new participants is obtained from the From field of the INVITE received from the dial-out server by the initial participants. 6.3 Scalability The scalability of this conference model is identical to that of dial-in conference servers, as described in Section 4.3. 6.4 Location of Service Logic The SIP UA of the conference participants does not require any special processing. The RTP implementation in those clients, however, should support RTCP and be prepared to receive contributing sources. All of the new logic for providing this service resides in the conferencing server. No SIP extensions are needed, simply logic that resides above the SIP stack to manage the conferencing service. 6.5 Discovering Participant Identities Once the conference is established, conference identities are determined through RTCP, as in the dial-in case. J.Rosenberg,H.Schulzrinne [Page 13] Internet Draft conf November 17, 2000 7 Centralized Signaling, Distributed Media In this conferencing model, there is a centralized controller, as in the dial-in and dial-out cases. However, the centralized server handles signaling only. The media is still sent directly between participants, using either multicast or multi-unicast. Multi-unicast is when a user sends multiple packets (one for each recipient, addressed to that recipient). This is referred to as a "Decentralized Multipoint Conference" in H.323. Interestingly, this conference model is possible baseline SIP. It works through third party call control [13]. The conference server uses re-INVITEs to each participant when a new one joins. The re- INVITEs add a media stream that gets sent to the new participant (and similarly in the reverse direction). Let us assume for the moment that a conference already exists with three participants. In this state, each participant is sending media directly to each other. This is because the SDP that the conference server has given to each participant contains three media lines, each of type audio, with connection addresses and ports corresponding to each of the three users. The call flow from here is shown in Figure 3. A new participant joins the conference. It does so by sending an INVITE (1)to the server, with the conference ID in the request URI. The SDP in the INVITE contains a single media stream, with an IP address and port where it would like to receive media (D). The 200 response from the conference server (2) contains a single media line with an IP address of 0.0.0.0 and a random port, indicating hold. The next step is for the server to obtain two more addresses where the new participant will be receiving media (it already has one from the original INVITE). To do this, it sends a re-INVITE to the new participant (4). This reINVITE contains two additional media streams (for three total), all three of which are on hold. The 200 response to the re-INVITE (5) contains two additional IP addresses and ports where the user is willing to receive media. Now the server needs to inform the other parties that they should begin sending media to the new user. It first sends a re-INVITE to user C (7). This re-INVITE adds an additional media stream to the two already that C has been sending. This new media stream uses one of the three connection addresses and ports returned by D in message (5). Call this address/port D1. The other two are D2 and D3. The 200 OK response from user C (8) contains the address and port where C is willing to receive a new, third media stream. Call this port C3. The J.Rosenberg,H.Schulzrinne [Page 14] Internet Draft conf November 17, 2000 server holds on to this port, as it will use it later on, sending it to D, so that D sends media there. At this point, however, C can begin sending media to D. This re-INVITE process happens for B and for A as well. In the re- INVITE to B (10), the server adds an additional media line (above the two already in use by C) using address/port D2. The response (11) contains a new address/port to send media to B. Call this port B3. In the re-INVITE to A (13), the server adds an additional media line using address/port D3. The response (14) contains a new address/port to send media to A. Call this port A3. Finally, the server sends a re-INVITE (15) to the new party. This re-INVITE takes all three streams off hold, and updates their connection addresses and ports with C3, B3, and A3, respectively. The 200 OK response (16) returns the same ports and addresses returned in message (5) (as noted in [13], these addresses/ports MUST NOT change). Now, D can send media to A,B and C. The result of these manipulations is, indeed, a full mesh of unicast RTP streams between all participants. Unlike the case of end system mixing, the stream sent by any participant to all of the others is identical. Each particpant needs to mix, but it mixes the media it receives, and plays that out the speakers. This is normal behavior for multiple streams of the same type. Note that the SIP relationship is still point-to-point. There are four calls at the end of Figure 3, one from each participant to the server, each with a different Call- ID. Note that hybrids are easily possible. Certain users can instead be mixed (sending audio to the conference server), while others are set to send audio to each other. 7.1 Inviting Users to Join Inviting users to join works identically to the dial-in conference bridge scenario 4. 7.2 Users Joining A user joins in the same way described in section 4. 7.3 Scalability The scalability of this conferencing model depends on many factors. From a media perspective, the conference server never even touches a single media stream. However, for N participants, each participant needs to be able to receive, decode, and mix N-1 media streams. For J.Rosenberg,H.Schulzrinne [Page 15] Internet Draft conf November 17, 2000 | | | |(1) INV D | | | | |-------------->| | | | |(2) 200 hold | | | | |<--------------| | | | |(3) ACK | | | | |-------------->| | | | |(4) INV 3held | | | | |<--------------| | | | |(5) 200 3recv | | | | |-------------->| | | | |(6) ACK | | | | |<--------------| | | | (7) INV +D1 | | | | |<------------------------------| | | | (8) 200 +C3 | | | | |------------------------------>| | | | (9) ACK | | | | |<------------------------------| | |(10) INV +D2 | | | | |<---------------------------------------------| | |(11) 200 +B3 | | | | |--------------------------------------------->| | |(12) ACK | | | | |<---------------------------------------------| |(13) INV +D3 | | | | |<-----------------------------------------------------------| |(14) 200 +A3 | | | | |----------------------------------------------------------->| |(15) ACK | | | | |<-----------------------------------------------------------| | | | |(16) INV A3,B3,C3 | | | |<--------------| | | | |(17) 200 | | | | |-------------->| | | | |(18) ACK | | | | |<--------------| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A B C D Server Figure 3: Centralized Signaling, Decentralized Media J.Rosenberg,H.Schulzrinne [Page 16] Internet Draft conf November 17, 2000 users accessing the server through dial-in modems, this will severely limit the sizes of these conferences. However, the processing burden is much less than that of the end system mixing model. This is because each end user needs to decode N-1 streams, but only encode 1. Decoding is much, much cheaper than encoding, so supporting many decodes is not necessarily a problem. This is especially the case when silence suppression is in use. In that case, streams are only sent by talking users. This means any given user only needs to decode (and receive) as many streams at a time as there are users talking. THis can vastly improve scalability of the conference. There is a signaling burden on the server, however. If there are N users in the conference, addition of a new user (the N+1th) requires N+3 INVITE transactions, each of which has three messages. Similarly, departure of a user requires N BYE transactions, each of which has 2 messages. For large N, and highly dynamic conferences, this can represent a potential burden. However, we believe this bottleneck is much farther out than the processing and bandwidth bottlenecks at the end users. For these reasons, we believe this conference model is ideal in corporate enterprises, where bandwidth is more plentiful and PCs are generally faster. 7.4 Location of Service Logic Nearly all of the logic for implementing this conferencing service lives in the server itself. The only requirement from the end users is that they support multiple, parallel media streams of the same type, and that they be prepared to mix those streams together. They must also support the third party control primitives [13], which don't require anything beyond baseline SIP, but are not likely supported unless explicit actions are taken to do so. It is this combination - no need for media processing in the server, combined with no need for specialized SIP processing in the end systems, that makes this model attractive. 7.5 Discovering Participant Identities Conference identities are discovered through RTCP. Each user will receive N-1 RTP streams, each of which has its own RTCP channel that carries the participant identification. 8 Summary of Models J.Rosenberg,H.Schulzrinne [Page 17] Internet Draft conf November 17, 2000 Table 1 shows a summary of the differences between the various models. Table 1: Summary of Models Name signaling media inviting joining discovering scale End-Mixing tree tree normal normal RTCP small invite invite Multicast pairs m-cast normal multicast RTCP large invite join Dial-Up star star refer normal RTCP medium invite Ad-Hoc star star refer normal RTCP medium invite Dial-Out star star refer normal RTCP medium invite Decentral star fullmesh refer + normal RTCP medium server invite and messaging server msg. 9 Whats Missing - Full Mesh The sections above cover a wide range of conferencing models, but not all of them. One model, in particular, is not supported by SIP. That model is the fully distributed multiparty model. In this conferencing model, each user has a point to point SIP relationship with every other user. Each user also has a point to point RTP relationship with every other user, as is done in the decentralized conference of Section 7. Two earlier drafts were written on the subject, but they specified protocols that were overly complex and still had race conditions and unhandled cases. The primary difficulty is that it requires every participant to learn the identity of every other participant. As participants come and go, this requires some kind of state flooding mechanism that causes this information to propagate, and eventually converge, across participants. While these kinds of distribution mechanisms have been done for multiparty conferences [14] Fitting such a distribution mechanism into SIP is not trivial, especially with the complex requirements that were initially targeted. Furthermore, the distributed nature of the signaling makes enforcement of any kind of conference policy pretty much impossible. Failures can also result in unusual conditions. Specifically, it is J.Rosenberg,H.Schulzrinne [Page 18] Internet Draft conf November 17, 2000 fairly easy for the conference mesh to break in certain places, resulting in a graph where every user hears most of the other users, but not all. This can happen, for example, if user A is invited into a conference, but is rejected by one of the users already into the conference (because the SIP relationships are point-to-point, a new user needs to establish a SIP call with all existing participants), this situation can occur. With large conferences, this becomes a very real possibility. Earlier work tried to avoid such conditions. We believe a solution can be found by simplifying the requirements. For example, we will abandon the requirement to only add a user to the conference if all other users agree to add them. We will also try to achieve gradual convergence in shared state, rather than the rapid convergence proposed in previous work. We will not worry about message efficiency or message frequency. The primary design objective should be KISS. As a baseline model, we believe that each INVITE, 200 OK response, and ACK simply contain a header called Members. This header is a list of URLs, and for each URL, there is a parameter that indicates whether they are in the conference right now, and when they joined, or whether they were previously in the conference, and when they left. A UA simply performs a re-INVITE as it receives new information. A periodic re-INVITE (ala session timer [15] will also be needed to heal partitions and deal with other conditions that may arise). More work is needed to validate the model and to see what other capabilities are needed. 10 Security Considerations The use of a server that performs the mixing on behalf of other users, which is the case for all but one of the conference models described here, introduces security risks. That entity must be trusted by the others to properly mix the media - not omitting a stream, for example. As such, it is recommended that participants in a conference authenticate the identity of the server. In the dial-in, dial-out, and decentralized conferences, this will require authentication of responses by participants. Mixing also eliminates the privacy possible with end-to-end media transport with mixing in the receivers. Such privacy is still possible in the large-scale multicast conferences, but requires shared keying material for the conference. Doing this for highly dynamic groups is still an open research problem. 11 Conclusion J.Rosenberg,H.Schulzrinne [Page 19] Internet Draft conf November 17, 2000 In this draft, we have shown how to use baseline SIP (assuming endpoints that support the mixing and/or third party call control feature sets) to construct several multiparty conferencing models. These include end system mixing, large-scale multicast conferences, dial-in conference servers, dial-out conferences, ad-hoc centralized conferences, and centralized signaling, distributed media conferences. We note that this covers all of the multipoint conferencing models described in H.323v1 [16]. Further work is needed to see how (and if) to support the hierarchical conference bridges defined in H.323v2 [17]. 12 Authors Addresses Jonathan Rosenberg dynamicsoft 200 Executive Drive Suite 120 West Orange, NJ 07052 email: jdrosen@dynamicsoft.com Henning Schulzrinne Columbia University M/S 0401 1214 Amsterdam Ave. New York, NY 10027-7003 email: schulzrinne@cs.columbia.edu 13 Bibliography [1] M. Handley, H. Schulzrinne, E. Schooler, and J. Rosenberg, "SIP: session initiation protocol," Request for Comments 2543, Internet Engineering Task Force, Mar. 1999. [2] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, "RTP: a transport protocol for real-time applications," Request for Comments 1889, Internet Engineering Task Force, Jan. 1996. [3] M. Handley and V. Jacobson, "SDP: session description protocol," Request for Comments 2327, Internet Engineering Task Force, Apr. 1998. [4] M. Handley, C. Perkins, and E. Whelan, "Session announcement J.Rosenberg,H.Schulzrinne [Page 20] Internet Draft conf November 17, 2000 protocol," Request for Comments 2974, Internet Engineering Task Force, Oct. 2000. [5] D. Thaler, M. Handley, and D. Estrin, "The internet multicast address allocation architecture," Request for Comments 2908, Internet Engineering Task Force, Sept. 2000. [6] W. Fenner, "Internet group management protocol, version 2," Request for Comments 2236, Internet Engineering Task Force, Nov. 1997. [7] D. Waitzman, C. Partridge, and S. E. Deering, "Distance vector multicast routing protocol," Request for Comments 1075, Internet Engineering Task Force, Nov. 1988. [8] J. Rosenberg and H. Schulzrinne, "Timer reconsideration for enhanced RTP scalability," in Proceedings of the Conference on Computer Communications (IEEE Infocom) , (San Francisco, California), March/April 1998. [9] J. Rosenberg, P. Mataga, and H. Schulzrinne, "An application server component architecture for sip," Internet Draft, Internet Engineering Task Force, Nov. 2000. Work in progress. [10] J. Franks, P. Hallam-Baker, J. Hostetler, S. Lawrence, P. Leach, A. Luotonen, and L. Stewart, "HTTP authentication: Basic and digest access authentication," Request for Comments 2617, Internet Engineering Task Force, June 1999. [11] R. Sparks, "SIP call control," Internet Draft, Internet Engineering Task Force, Sept. 2000. Work in progress. [12] E. Guttman, C. Perkins, J. Veizades, and M. Day, "Service location protocol, version 2," Request for Comments 2608, Internet Engineering Task Force, June 1999. [13] J. Rosenberg, H. Schulzrinne, and J. Peterson, "Third party call control in SIP," Internet Draft, Internet Engineering Task Force, Mar. 2000. Work in progress. [14] C. Elliott, "A 'sticky' conference control protocol," Internetworking: Research and Experience , Vol. 5, pp. 97--119, 1994. [15] S. Donovan and J. Rosenberg, "SIP session timer," Internet Draft, Internet Engineering Task Force, Oct. 2000. Work in progress. [16] International Telecommunication Union, "Visual telephone systems J.Rosenberg,H.Schulzrinne [Page 21] Internet Draft conf November 17, 2000 and equipment for local area networks which provide a non-guaranteed quality of service," Recommendation H.323, Telecommunication Standardization Sector of ITU, Geneva, Switzerland, May 1996. [17] International Telecommunication Union, "Packet based multimedia communication systems," Recommendation H.323, Telecommunication Standardization Sector of ITU, Geneva, Switzerland, Feb. 1998. J.Rosenberg,H.Schulzrinne [Page 22]