Internet Engineering Task Force SIP WG Internet Draft Rosenberg/Mataga/Schulzrinne draft-rosenberg-sip-app-components-00.txt dynamicsoft/Columbia U. November 15, 2000 Expires: May 2001 An Application Server Component Architecture for SIP STATUS OF THIS MEMO This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as work in progress. The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract An application server is defined as an entity that is capable of providing advanced features to users. Examples of features include call forwarding, call screening, debit card calling, web interactive voice response, etc. However, the set of functions needed to enable a broad range of such applications is quite large - it includes speech recognition, DTMF recognition and digit collection, text-to-speech synthesis, database interfacing, audio and video coding and decoding, audio and video bridging and mixing, and signaling, to name a few. Supporting such a large set of functions on the same box presents a major challenge. To solve this problem, the industry is proposing a decomposition of the application server into two components - a media server that handles the media component, and an application server that handles the call control, data, and signaling. The interface that has been proposed between these two elements is a control mechanism along the lines of MGCP or Megaco. In this paper, we propose an orthogonal decomposition, which breaks an application Rosenberg/Mataga/Schulzrinne [Page 1] Internet Draft AS Components November 15, 2000 server into application server components. Each component represents a application server in its own right, but it provides a well defined component that by itself may be a complete, but simpler, application. 1 Introduction An observable trend in VoIP systems is the continuing decomposition of monolithic elements into component subparts, with the corresponding development of standardized interfaces between components. This kind of decomposition can be observed in the MGCP/megaco [1] gateway decomposition of a large gateway into a signaling gateway (SG), media gateway (MG) and media gateway controller (MGC), often referred to as a softswitch. Following that decomposition, the softswitch was further decomposed into a pure call control component (still referred to as a softswitch) and an application server (AS), which provides features and services. The AS was then decomposed, breaking it into a signaling piece (still referred to as an application server), and a media server (MS), which provides the media components of applications. Protocols like MGCP [2] and Megaco [3] have been proposed as the interface between an AS and MS. This paper proposes an additional decomposition of an application server into application server components (ASCs). This decomposition is orthogonal to the MS/AS decomposition, and differs significantly in its goals and benefits. The primary motivation is the recognition that most complex (and interesting) applications require a common set of core pieces - speech recognition and text-to-speech, translation services, conference servers, messaging servers, etc. Each of these components is complex and a full-fledged application in its own right. In most cases, a complex application really doesn't care about the details of the operation of the component. In many cases, these components run on separate servers, and often, would be provided by separate providers. What is needed, then, is a well-defined, distributed interface to these application server components. Here, we motivate a distributed decomposition of applications into components, and then show why, for many of these, the interface is ideally suited for a distributed, session establishment and termination interface that follows a standardized pattern of addressing and parameter passing. We believe the Session Initiation Protocol (SIP) [4] is ideally suited for such an interface. 2 Why Decompose The first question to address is "why decompose an application server". Rosenberg/Mataga/Schulzrinne [Page 2] Internet Draft AS Components November 15, 2000 Decomposition is the act of breaking a large, monolithic system into a number of smaller compoents that interact according to specified behaviors. Decomposition of large components offers a number of benefits: Scale. As systems need to serve more and more users, there are two approaches to scaling up. One is to buy increasingly faster hardware, so that the monolithic servers can keep up with increasing use. The second is to distribute the work across components, so that multiple servers perform the work. Distribution is fundamentally cheaper, since the cost of large monolithic systems increases exponentially with capacity, compared to the linear increase in cost with multiple, smaller units. Distribution of work can be done through load balancing, where each server remains homogeneous, but the work is spread across numerous servers, or it can be done through specialization, where the work is split into separate functions, and each function placed on a separate server. Specialization is ideal in cases where the work has different requirements for it to be completed. As an example, a component of an application may require special purpose hardware. This component can distributed to a specialized processor, with a normal off the shelf processor handling the more generic software tasks. Several of the components that we are describing fit into this category (such as the TTS server). Sharing of resources. By decomposing a server into components, a many-to-many interaction between them becomes possible. This means that one component can provide services to many other components. This provides for sharing of resources, which ultimately results in cost reduction. Expertise. Building a complex application requires expertise in call control, media services, compression, web, speech recognition, etc. It is highly unlikely that one organization will have enough expertise in all of these to build them all. By decomposing an application server into subpieces, organizations with expertise in one particular piece can build that one. The result is that the complete system can be composed of best in breed components. Speed of deployment. By decomposing, upgrading existing applications and deploying new ones becomes simpler. The decomposition provides isolation. This isolation means that Rosenberg/Mataga/Schulzrinne [Page 3] Internet Draft AS Components November 15, 2000 one component can be changed or improved without affecting others. That makes it easy to add new features to an application, or to deploy a new one by using components already deployed. Decomposition does have its drawbacks. Primary amongst them is security. In general, the more boxes in a system, and the more they interact with each other, the more complex the security is. As a result, any distributed system has inherently more complex security issues. Another drawback is reliability. A system with multiple boxes, where the system requires all boxes to work in order to function, is less reliable than a system with a single box which must work. 3 Tightly Coupled Decomposition As an example of decomposition, it has been proposed to break the application server into a signaling and control component (the AS), plus a media server component (the MS). This decomposition is shown in Figure 1. Calls arrive at the AS component over SIP. The AS then accesses the MS using MGCP, and learns the IP address and port where the media for the call can be sent. This is returned in the 200 OK response by the AS. The AS then begins to instruct the MS to perform specific functions - collect digits, play tones and announcements, and to report the digits and tones back to the AS for further processing. Typically, the MGCP interface between the two devices is fairly "busy"; there is a lot of messaging for complex applications. In this model, there is a tightly coupled relationship between the MS and AS. The MS cannot function without the AS, and the AS needs to perform tight, low-level controls over the detailed operation of the media server. To some degree, breaking of an application server into these two components represents an implementation detail of how one builds a large, monolithic application server. It is not generally possible for the two components to be owned by separate providers. In fact, it has yet to be shown that complete interoperability and integration is possible with two components from different vendors, let alone different providers. This decomposition also does not provide a true separation of function. Most applications that require media interaction (IVR, credit card and debit card, etc.) have very cleanly separated media phases and signaling phases. The details of the media interactions Rosenberg/Mataga/Schulzrinne [Page 4] Internet Draft AS Components November 15, 2000 .................... . . . +-------------+ . . | | . SIP . | | . -------------+ AS | . . | | . . | | . . | | . . +-------------+ . . | . . | . . | . . |MGCP . . | . . | . . | . . +-------------+ . . | | . . | | . RTP . | | . -------------+ MS | . . | | . . | | . . +-------------+ . . . .................... Complete Application Server Figure 1: MGCP-based decomposition are usually not important to the signaling component, and vice a versa. As an example, consider a debit card application. The application starts with the user making a call. As part of the call processing, interaction is needed with the user via the media stream to determine the debit card number. The precise set of menu operations and interactions used to obtain this number aren't important to the call/signaling processing piece; only the result (the number), is important. Once the number is returned, media processing ceases, and data and call processing commence. The debit card is looked up in a subscriber database, and if enough time Rosenberg/Mataga/Schulzrinne [Page 5] Internet Draft AS Components November 15, 2000 remains, the call is completed. The signaling component monitors the call, and when the card has run out of minutes, the call is terminated. Consider the case where the application provider decides that the menus presented for debit card collection are confusing, and they need to be changed. This change really affects the media processing only; ideally, we would like to have no change whatsoever in the data processing and signaling part of the application. However, in the decomposition afforded by MGCP, the AS component contains both the signaling and call control, in addition to the control of the IVR menus and and processing. Thus, the AS needs to be updated, even though what has changed is really an IVR component. The MGCP decomposition also presents a burden for software developers on the AS. They need to understand, and program, the detailed interactions with the MS that are provided by MGCP, in addition to the detailed signaling and data processing operations. The developers will also need to build and manage the low level state representing the controlled entity, which can be painful. The result is longer development times, less code reuse, and slower innovation. It has been argued that one of the benefits of the MGCP decomposition is that it offloads the "burden" of call control from the media server. However, from a complexity standpoint, the MGCP processing required is probably on par with (if not more than), the simple amount of call control and SIP processing needed if SIP were used directly. From a reliability perspective, an MGCP style decomposition is less desirable. Since the components are strongly coupled, the system will fail so long as any of the pieces fail. Failure can also be introduced because of additional network resources needed for communications between the boxes. The result is that the MGCP decomposition may actually increase the probability of failure, as compared to no decomposition at all. Another decomposition that has been proposed is to break a proxy into a routing and call control component, plus a services component. The interface between the two is then a transactional interface for services, similar in concept to INAP, based upon state transitions within a call model. This is another form of tight coupling, since it requires the services component to have detailed knowledge of the operational model of the call control component. We believe that this decomposition is limiting, for the same reasons the AS/MS decomposition is limiting. 4 The Decoupled Model Rosenberg/Mataga/Schulzrinne [Page 6] Internet Draft AS Components November 15, 2000 4.1 Architecture As a result of this, we see the master/slave decomposition as being ideal for a single vendor to build a large system. However, this decomposition does not solve the other distribution needs we have motivated above. As a result, we propose that the AS be decomposed into an application component responsible for coordinating the overall execution of the application (called the controller), and application server components that provide pieces of the overall application. These components are only loosely coupled with the coordinating application server. The loose coupling implies that the interaction between them is the same as the interaction between the user and the coordinating application server, which is, in turn, the same as the interation between the application server components and other application server components. The components can easily be from separate vendors, and the interactions support the needed security and routing features to allow them to be owned by separate providers, even. The architecture is shown in Figure 2. The goal of the decoupling is to break the application into as coarse-grained pieces as possible. Each component (the coordinator included) should need to know as little as possible about the detailed operations performed by other components. A coarse-grained decomposition means that there is a clean and simple break in the functionality provided by the components. This enables significantly simpler interfaces between those components. Each component is really interested in passing a request for service to another, letting the other component perform its task, and then getting the final result of the task back as an output. From a software engineering perspective, this represents the classic function call; the call signaling component is making a function call to the media part. It is interested only in the return value - the debit card number, for example - and does not really care about the implementation of it. From a protocol perspective, this is a classic client-server system. The client makes a request of the server, and the server does whatever it needs to do to return the final response. The problem more closely resembes the client-server system than the function call, however. This is because we need the interaction to be across the network, rather than between code within the same process. This is because one of the key concepts here is that components can be provided by separate service providers. In such a model, where does the state for the sessions live? Here, we define a session as the complete set of interactions amongst all Rosenberg/Mataga/Schulzrinne [Page 7] Internet Draft AS Components November 15, 2000 components for the delivery of the service. Thus, a session might span multiple protocols, and even multiple calls. Not surprisingly, session state is distributed amongst the components, and the distribution follows the architectural model of Figure 2. The top level server, the controller, maintains the high level pieces of state that deal with overall delivery of the service, and the state required to coordinate the interactions with the component servers. Each component server maintains only the state needed to execute their component, and to manage interactions with components below them. A component server does not know about the complete service being delivered, and does not know about sibling servers. This aspect of our model - hierarchical distribution of session state, leads to one of the primary benefits of the architecture - ease of development. Someone building a new application by reusing existing components only needs to manage the high level state for delivery of the service. State related to the details of operation of one of the components - timings between digits in an IVR server, for example, is not relevant to the coordinator, and does not need to be managed. The difference between classic RPC or client/server interactions and the interactions between the components here is that the relationship between the components represents a long lived association (i.e., a session), during which a session level service is being provided, rather than a simple input/output service. As an example, consider a component providing continuous real-time text-to-speech translation services. The application coordinator that wishes to use this service acts as a client, initiating a request for service to the server (in this case, the TTS server). However, the text is not passed as an "argument" to the TTS server, it is continually streamed for the duration of an active session, and the TTS server would continuously stream back the speech version of the text, which is the output of the service. Another example is a voice messaging server. The messaging server provides basic services like message drop, message retrieve, and message management. Each of these represent procedures that can be executed by a client component. To drop a message, for example, the client component would initiate a session with the messaging server. A prompt would be played over that session, something like "please record your message for Joe now", and then the component takes the media input stream, records it, and saves it. When it is done, the session is terminated. In some cases, the session may require a "side channel" over which intermediate data is passed, needed to control the session interactions from that point forward. IVR is the classic example. In some cases the coordinating application server can kick off the IVR script, and then only get back the final result - a menu option, a Rosenberg/Mataga/Schulzrinne [Page 8] Internet Draft AS Components November 15, 2000 +-----------+ | | | | | AS | |coordinator| | | | | +-----------+ SIP, -- \ --- RTP? -- \ ---- SIP, -- \ ---- RTP? -- \ SIP, ---- -- \ RTP? ---- -- \ -- +----------+ +-----\----+ +----------+ | | | | | | | | | | | | | | | | | | | ASC | | ASC | | ASC | | | | | | | | | | | | | +----------+ +----------+ +----------+ \ / / \\ SIP, / / SIP, \ RTP? // / RTP? \\ / SIP, / \ / RTP? / +----------+ +----------+ | | | | | | | | | | | | | ASC | | ASC | | | | | | | | | +----------+ +----------+ Figure 2: Decoupled Architecture Rosenberg/Mataga/Schulzrinne [Page 9] Internet Draft AS Components November 15, 2000 credit card number, or what have you. In other cases, the coordinating component may need to get intermediate results, so that it can guide the operation of the IVR moving forward. This requires a companion control channel that provides data output from the component server back to the client, and then returns further high level instructions from the client back to the server. There is a thin line in some cases between this control channel and the tightly coupled interactions of a master-slave MGCP relationship. However, the loosely coupled nature of the interaction can be maintained by using coarse-grained data passing over a distributed client-server protocol, such as HTTP or Corba. From this architectural description, it is clear that a client-server session establishment protocol, which allows for passing of parameters that describe service, is the ideal mechanism to coordinate the interaction between components. Clearly, SIP is perfect in such a role. Following the example above, an IVR application server component would be completely responsible for the execution of the IVR piece of an application, including both the media and the signaling call control. It would know the menus to maneuver through, and it would know when to collect digits and present prompts. The coordinating application server would request service from the IVR component by initiating a call to it (possibly using third party call control [5] to direct the media directly to the IVR without passing through itself; more on that below). The application component takes the media from the incoming call, running it against the IVR application. When the IVR is done, the final result - in this case, the credit card number, is passed back to the coordinating AS, possibly throug an HTTP POST operation. The coordinating AS then terminates the call with the IVR. 4.2 Benefits of the Decoupling This decoupled interaction between components provides several important benefits: Separation of Businesses. The decoupled interaction between components is needed to allow the components to be provided by separate providers. Master-slave control interactions do not work well across service providers, let alone across vendors. By allowing separate providers to offer the components, new businesses can be created that specialize in the piece they are providing. Rosenberg/Mataga/Schulzrinne [Page 10] Internet Draft AS Components November 15, 2000 Rapid Development. Since the components can easily be placed in separate boxes from separate vendors, or even in separate providers, we achieve a separation of function that allows each piece to be developed in complete isolation. We also get reuse of components for new applications. This allows for rapid service creation. Better Interoperability. It can be argued that the decoupled interaction between components is more like to be interoperable that a master-slave mechanism. This is largely based on the assumption that a master-slave interaction requires a lot more messaging and exchange between the components, whereas the decoupled client-server mechanism requires less. The fewer information that passes back and forth, the easier it is to interoperate. Architectural Flexibility. The loose coupling of the components means that a server, such as a conferencing application or IVR, need not be implemented as an actual server. Rather, complex networks of components, with proxies providing routing of requests in arbitrarily complex ways, can be built to provide a service. Since the interaction is SIP, the application controller accessing the service doesn't know whether it is communicating with a single server or a network built in this fashion. That allows ASPs flexibility in how they can construct their service networks. Reliability The loose coupling of the components improves reliability compared to a tight coupling. Thats because the system can probably still continue to operate in the failure of a single component. For example, if a TTS server fails during a session, an application server can use a server from a completely different provider, or it can use a media server instead, converting the text to VoiceXML scripts. Depending on the service, the TTS component could possible be skipped altogether. Note, however, that the reliability is still not as good as a monolithic system. Having ten identical boxes each running a complete set of services is better than spreading the service across ten boxes, where some subset cause total failure. 5 Architecture for the Interfaces Up to now, we have been fairly vague about exactly how such an Rosenberg/Mataga/Schulzrinne [Page 11] Internet Draft AS Components November 15, 2000 interface would work in practice. We have argued that it is SIP, but not described in detail how SIP is actually used for this function. SIP (along with SDP [6]) clearly provides the facilities for initiation and termination of the sessions between the controller and components, and for specification of the media addresses to and from which media is sent. However, SIP leaves a lot of flexibility in terms of naming, additional message content, session duration, and control. Here, we discuss each of these in turn. 5.1 Naming In any remote procedure call system, a key component is naming. The identified resource must be properly addressed so that the underlying message passing system can properly determine where the request should go. The same is true in SIP. Messages are routed based on the request URI, as it serves as the primary naming tool for routing messages. In its application to AS component interaction, the request URI serves as the primary tool to identify the resource to which the session is addressed. A critical piece of defining a session level service that can be accessed by SIP is defining the naming of the resources within that service. This point cannot be understated. As an example, consider a conferencing service. In this case, the primary resource that is being accessed is a mixing service. We would like to have a way to identify which conference is being addressed by any given call. All calls for the same conference are all bridged together. By default, the bridging would operate in an N-1 configuration (that is, each user receives a mixed media stream that represents all of the other users besides themself). Conferences can be set up in two ways - ad-hoc, which are not pre-established at all, and exist so long as there is a participant in them, and scheduled, where they exist for a certain period of time. One might imagine that a conferencing service breaks its URI namespace into two pieces - one piece that represents ad-hoc conferences, and another that represents scheduled conferences. Ad- hoc conferences are addressed using a URI of the form .adhoc@conferences.com. All users who initiate a call to the URI sip:as9dahas89.adhoc@conferences.com are bridged together. The conference state is established when the first call to a conference occurs, and destroyed when the last call terminates. In contrast, scheduled conferences might be named by .scheduled@conferences.com, so that a call to sip:conference12.scheduled@conferences.com allows a user access to a pre-arranged conference. Rosenberg/Mataga/Schulzrinne [Page 12] Internet Draft AS Components November 15, 2000 There are several benefits to naming ad-hoc conferences vs. scheduled ones in this fashion. The primary one is convenience; the name makes it the type of conference apparent to any entities that are interested. Secondly, it can avoid certain misconfigurations. Let's say there are no conventions for naming of ad-hoc versus scheduled conferences. I am asked to join a scheduled conference (conf2321@conferences.com), but I mis-type the URL in my browser (conf2123@conferences.com). I don't want this to drop me into an ad- hoc conference where I sit for 15 minutes thinking others will eventually join. If ad-hoc conferences are named differently, a call to cond2123@conferences.com is never going to be an ad-hoc conference, and so my call will be rejected immediately. For an application server to use a conferencing service as a component, the AS must know the URI namespace conventions used to identify the various conferences. The above information, for example, would be provided by the conferencing provider to its customers. This same concept of using the request URI as a service identifier has been described in detail for voicemail systems [7]. The great advantage of using the request URI as a service identifier comes because of the combination of two facts. First, unlike in the PSTN, where numbers are limited, URIs come from an infinite space. They are plentiful, and they are free. Secondly, the primary function of SIP is call routing through manipulations of the request URI. In the traditional SIP application, this URI represents people. However, the URI can also represent services, as we propose here. This means we can apply the routing services SIP provides to routing of calls to services. The result - the problem of service invocation and service location becomes a routing problem, for which SIP provides a scalable and flexible solution. Since there is such a vast namespace of services, we can explicitly name each service in a finely granular way. This allows the distribution of services across the network. In the conferencing example above, since we have separated the names of ad-hoc conferences from scheduled conferences, we can program proxies to route calls for ad-hoc conferences to one set of servers, and calls for scheduled ones to another, possibly even in a different provider. In fact, since each conference itself is given a URI, we can distribute conferences across servers, and easily guarantee that calls for the same conference always get routed to the same server. This is in stark contrast to conferences in the telephone network, where the equivalent of the URI - the phone number - is scarce. An entire conferencing provider generally has one or two numbers. Conference IDs must be obtained through IVR interactions with the caller, or through a human attendant. This makes it difficult to distribute conferences across servers all over the network, since the Rosenberg/Mataga/Schulzrinne [Page 13] Internet Draft AS Components November 15, 2000 PSTN routing only knows about the dialed number. Care must be taken not to push this concept too far. Naming of services should not become so fine-grained that all parameters associated with the service simply become encoded into the request URI as well. The right level of granularity can be determined based on routing. If a service is represented by multiple URLs, but requests for each of those URLs are always routed in the same way, the naming is too fine-grained. 5.2 Additional Message Content Sometimes, connecting to a service requires the service to know additional information that is not appropriate for the request URI. As an example, the conferencing server might need to know the name, address, phone number, company, and email address of the participants, which it converts to speech and uses as an announcement when the user joins and leaves the bridge. This kind of content can easily be carried in the body of the SIP messages used to establish and manage the session with the service. For simple data, SIP headers may be appropriate. In the conferencing example above, the conferencing service might mandate that a vCard be attached to all INVITEs, in order to provide that information. When existing data formats (like a vCard) are not defined to provide the needed information, it can be encoded in an XML document, for example, and carried along in the INVITE. Each service would need to specify the content that it needs in order to process the session invitation. 5.3 Session Duration The duration of the session that is established with a server depends entirely on the nature of the service. For example, for a conference, the initiation of the call begins the mixing service for that user, and the termination of the call results in that user leaving the conference. For an IVR service, the INVITE request begins the interaction with the service. Once the INVITE transaction completes, the IVR would play out the initial prompt, and begin collecting data from the caller. How the IVR terminates depends on its usage. When the initiator of the service is an application server, we would argue that in almost all cases, it should be the responsibility of the controller to determine when the interaction is complete (and thus terminate the call with a BYE). However, when the initiator is an end Rosenberg/Mataga/Schulzrinne [Page 14] Internet Draft AS Components November 15, 2000 user, the IVR will usually be the one to terminate the session. We discuss IVR interactions in more detail below in Section 6.1. 5.4 Third Party Call Control Third party call control, as defined in [5], plays an integral role in this architecture. In many cases, the controller orchestrating a service wishes to invoke the resources of an IVR or conferencing server. However, the AS is not the actual source of the media that drives the IVR. The source of the media is the end user that initiated the call to the controller. What is needed, then, is a way for the AS to call the IVR or conferencing server, and pass it the media information of the end user. Similarly, the media address of the IVR server (described in the SDP from the media server), needs to be passed to the end user that initiated the call. By using third party call control, an application server can direct the media of the end user to and from the components that it is using to provide the application. Once one service is complete, the controller can move the media to a different component. SIP re-INVITEs also allow the controller to request the caller to send multiple media streams, one, for example, containing only DTMF and tones. This allows for DTMF control of services without carrying DTMF in SIP itself. Figure 3 shows how we use a component server to collect DTMF input for a service; specifically, a simple (and perhaps useless) service that allows a caller to press '1' to indicate that they want to put the call on hold. The service is, in principal, useless, since hold is so common that the end user can do this themselves. However, it is useful for example purposes. The caller sends an INVITE request to the called party (1), which is routed to a server handling calls for the domain of the called party. In this case, the server is an application server. The AS decides that it would like to offer the caller advanced services based on DTMF events sent mid-call. As a result, it decides to invoke the services of a media server component. The AS will use third party call control mechanisms to have the caller send any DTMF related media to the media server, in addition to sending its media to the called party. To accomplish this, the AS sends an INVITE to the media server (2), with an indication that the media stream is send only (this is accomplised using the sendonly SDP attribute [6]). The request URI of this INVITE binds that session to a service that looks for any in-band DTMF, and reports it back to the AS through an HTTP GET or POST operation. In section 6.1, we show how this is easily done with a VoiceXML driven IVR server. Rosenberg/Mataga/Schulzrinne [Page 15] Internet Draft AS Components November 15, 2000 The media server responds with a 200 OK (3) that contains SDP with the address where the media should be sent to. The application server ACKs this response (4), and holds on to that SDP. The AS then proxies the original INVITE request (5), and the called party answers the call (6). This acceptance is proxied upstream (7), and then acknowledged (8,9). At this point, media is flowing between the caller and called party (10). The next step for the AS is to get a stream of DTMF digits to flow from the caller to the media server. To do this, it sends a re-INVITE to the caller (11). This re-INVITE contains the same SDP as the response (6) from the called party, but with the addition of a new media line. This media line is audio, and contains a single codec, the RTP payload format for DTMF and tones [8]. The connection address and port are from the SDP returned from the media server. This tells the caller to send an additional media stream to the media server, using only the DTMF codec. The result is that RTP packets are sent only when the caller presses a button on the phone. The caller accepts this re-INVITE (12), and the AS acknowledges it (13). Now, DTMF only RTP is flowing between the caller and the media server (14). At some point later, the caller presses the 1 key (which, for example, might imply call hold). This is processed by the media server, and the result is an HTTP request being sent to the AS (15). The HTTP request contains the value of the collected digit. The AS receives this request, and knows that the user keyed in a 1. Recognizing this input as call hold, the AS sends a re-INVITE to the called party (17). The SDP in this re-INVITE is the same as the SDP in the original INVITE from the called party (1), except that the connection address is set to zero, indicating call hold. The called party accepts the re-INVITE (18), and this is ACKed by the AS (19). The called party is now on hold. Note that the call flow remains unchanged if the stimulus were based on voice recognition instead of DTMF. The only difference would be that a general purpose codec, such as G.711, would be used instead of RFC 2833 for communications between the caller and the media server. This achieves an important unification. Independent of the type of stimulus - voice, DTMF, or, in fact, direct http requests from the caller (if they were using a softphone), the service execution code is unchanged. Others have proposed that DTMF digits be carried in SIP directly from the caller to the AS [9,10]. However, this approach does not work for anything beyond DTMF, while our approach works for DTMF, speech, and web interfaces. Another drawback of the DTMF-in-SIP approach is that all entities on the call signaling path will receive any DTMF digits dialed by the called party. Furthermore, since the caller doesn't know if there is an entity interested in DTMF, it is required Rosenberg/Mataga/Schulzrinne [Page 16] Internet Draft AS Components November 15, 2000 Caller Coordinator Media Server Callee | | | | |(1) SIP INV | | | |--------------->|(2) SIP INV | | | |----------------->| | | |(3) 200 OK | | | |<-----------------| | | |(4) SIP ACK | | | |----------------->| | | |(5) SIP INV | | | |----------------------------------->| | |(6) 200 OK | | |(7) 200 OK |<-----------------------------------| |<---------------| | | |(8) SIP ACK | | | |--------------->|(9) SIP ACK | | | |----------------------------------->| |(10) RTP | | | |.....................................................| | | | | |(11) SIP INV | | | |<---------------| | | |(12) 200 OK | | | |--------------->| | | |(13) SIP ACK | | | |<---------------| | | |(14) RTP | | | |...................................| | | | | | | |(15) HTTP GET | | | |<-----------------| | | |(16) 200 OK | | | |----------------->| | | | | | | |(17) SIP INV | | | |------------------+---------------->| | |(18) 200 OK | | | |<-----------------+-----------------| | |(19) SIP ACK | | | |------------------+---------------->| | | | | | | | | | | | | | | | | | | | | Figure 3: Call Flow for DTMF Enabled Hold Service Rosenberg/Mataga/Schulzrinne [Page 17] Internet Draft AS Components November 15, 2000 to send DTMF within SIP messages all the time, even if no entity is interested. There have been proposals for adding a subscription/notification mechanism on top of this to avoid this problem. However, this further complicates the system by adding a requirement for the caller to support a subscription and notification service just for DTMF. Our approach fits well within the existing SIP framework, and requires no additional work from the end users. Furthermore, it transparently supports multiple application server components receiving DTMF. This is because an AS is able to send a DTMF stream to a component by adding a new media line to the list of media streams being sent by the caller. The list of media streams being sent by the caller is observed by each AS through the initial INVITE, along with any subsequent re-INVITEs which might modify it. Consider the situation with two application servers, A and B, depicted in Figure 4. The original call setup starts with the caller, flows through A, then B, then the called party. At some point later, A sends a re-INVITE (10) to the caller, adding a media stream, just as described in Figure 3. The SDP in this INVITE will be the same as provided by the caller in message (1), plus the additional DTMF stream. Note that this re-INVITE does not pass through B. Now, B decides to add a media stream for DTMF. So, it sends a re-INVITE (13). This goes first to A. As far as A is concerned, this re-INVITE is from the called party. A computes the difference between what it believes the called party should perceive as the set of media streams, and what is in the re-INVITE (13). This difference (the additional DTMF stream added by B) is added to the SDP that A had sent to the caller previously (10), and the result is sent in a re- INVITE to the caller (14). This SDP now contains the media streams meant for the actual called party, along with two DTMF streams; one for A, and one for B. The caller thus sends DTMF to both servers. A further advantage of our approach is that the DTMF can even be sent using multicast, since it is being sent in RTP rather than as part of SIP. This allows for tremendous scalability, if needed, in the number of entites receiving the DTMF streams. 5.5 Side Channels Side channels are used for passing of events from the application server components back to the client, and for passing control commands from the client to the application server component. Unfortunately, side channels complicate the simple session level interface between components. It is our belief, at least for the Rosenberg/Mataga/Schulzrinne [Page 18] Internet Draft AS Components November 15, 2000 components described here, that only minimal side channels are needed. Specifically, the only service below that requires one to be effective is the IVR service, for which HTTP forms an ideal side channel. If the side channel becomes so complex as to introduce extensive synchronization, bandwidth, and transactional issues, the relationship between the components becomes tightly coupled once more, and the benefits we are espousing here begin to disappear. As such, we believe that a reasonable side channel for decoupled server interactions is defined as follows: o The event reporting and control components have no real time requirements. o Event reporting from the component back to the client accessing it are infrequent; specifically, the intervals are much larger than the round trip times between the client and the component. o Control from the client to the component is infrequent; specifically, the intervals are much larger than the round trip times between the client and component. o Event reporting is coarsely granular, so that the client does not need to explicitly subscribe to specific events in order to avoid be overwhelmed with data. o The amount of data passed in both the events and in the control is small. o There are no requirements for transaction support. Note that protocols like MGCP and megaco do not meet these requirements, as they require tight timing, synchronization, and explicit subscriptions. HTTP, as used in VoiceXML, however, does meet these requirements. 6 Patterns for Accessing Components In this section, we propose a set of patterns that define the interaction of a controller with an application server component. These patterns manifest themselves in the description of the service invoked when a session is initiated, a discussion of the naming conventions of the service, and a description of any back channel used for control and data passing. 6.1 Interactive Voice Response Services Rosenberg/Mataga/Schulzrinne [Page 19] Internet Draft AS Components November 15, 2000 Caller A B Callee | | | | |(1) SIP INV | | | |-------------->|(2) SIP INV | | | |--------------->|(3) SIP INV | | | |---------------->| | | |(4) 200 OK | | |(5) 200 OK |<----------------| |(6) 200 OK |<---------------| | |<--------------| | | |(7) SIP ACK | | | |-------------->|(8) SIP ACK | | | |--------------->|(9) SIP ACK | | | |---------------->| |(10) SIP INV | | | |<--------------| | | |(11) 200 OK | | | |-------------->| | | |(12) SIP ACK | | | |<--------------| | | | | | | | |(13) SIP INV | | |(14) SIP INV |<---------------| | |<--------------| | | |(15) 200 OK | | | |-------------->|(16) 200 OK | | | |--------------->| | | |(17) SIP ACK | | |(18) SIP ACK |<---------------| | |<--------------| | | | | | | | | | | | | | | | | | | | | | | Figure 4: Multiple Application Servers and DTMF We have touched upon the basics of the interaction between a controller and an IVR server. The controller initiates a call to the server, the server executes some kind of IVR service, and data is Rosenberg/Mataga/Schulzrinne [Page 20] Internet Draft AS Components November 15, 2000 A number of questions still need to be answered, however: 1. How is the IVR service identified? 2. How can the controller specify the details of the dialog the IVR carries out with the user? 3. How does data from the IVR get passed back to the controller? 4. How is intermediate control performed (e.g., to interrupt or reset IVR based on some event at the controller, in this case)? We believe that VoiceXML [11] represents the ideal partner for SIP in the development of distributed IVR servers. VoiceXML is an XML based scripting language for describing IVR services at an abstract level. VoiceXML supports DTMF recognition, speech recognition, text-to- speech, and playing out of recorded media files. The results of the data collected from the user are passed to a controlling entity through an HTTP form POST operation. The controller can then return another script, or terminate the interaction with the IVR server. From a naming perspective, the primary issue is how a request URI is associated with a script to invoke when the call is answered. We see three primary mechanisms: 1. There is a one-to-one binding of the address in the request URI to a script to execute. These bindings are published by the provider of the IVR service. 2. The initial script to execute is actually carried as content in the body of the SIP INVITE request. The request URI indicates that the desired service is execution of content in the request (i.e., sip:executebody@servers.com). 3. The initial script to execute is fetched by the VoiceXML server; the URL to fetch it from is passed in the SIP INVITE message that initiates the IVR session. This can be accomplished either with the application/uri MIME type as a body, or using the new *-Info headers [12] which provide references to content to fetch. We believe that the third approach is probably the best one. SIP is not the ideal transfer mechanism. Passing a URI allows a far better transfer tool, namely HTTP, to be used to actually fetch the script back from the controller. Rosenberg/Mataga/Schulzrinne [Page 21] Internet Draft AS Components November 15, 2000 HTTP is then also used to pass back form data from the IVR to the controller. The results of the HTTP POST can also contain additional VoiceXML scripts to execute. It represents the side channel discussed in section 5.5 Note that in some cases, there needs to be interactions between the HTTP server that receives the HTTP POST requests, and the controller that initiates and terminates the SIP sessions with the IVR. This is the case when the data collected by the VoiceXML server is used to guide signaling behavior. For example, a pre-paid calling application might use the IVR to collect the users PIN code. The PIN code is looked up, and the number of minutes remaining is determined. This amount of time must be known to the SIP controller, as it will need to hang up the call once this time expires. Some kind of session sharing mechanism is needed between the SIP controller and the HTTP server in this case. Figure 5 shows the interaction between an application server acting in a coordinating role, and an IVR server component. In this example, consider an application where the user makes a call, but the system needs additional information to determine where to forward it to. The user is prompted for the info, and once the name of the desired called party is obtained and looked up, the call is completed to the requested destination. First, in step (1), the caller sends an INVITE to the controller. The controller then creates a brand new call to the IVR application server (2), using the SDP from the INVITE in (1). The IVR accepts the call (3), and the SDP from that acceptance is returned in a 183 response to the caller (4). The call to the IVR is acked (5), and now a media stream exists between the caller and the IVR server. The IVR server, in step (6), fetches the initial VoiceXML script to execute, which is returned by the controller (7). The prompts are played to the caller, and the identity of the called party is collected. This is passed to the controller through another POST (8), which returns an empty VoiceXML script (9)[1] complete, the controller hangs up with it (10 and 11). The information the controller got in the POST (8) is used to determine the next hop SIP server, and the initial INVITE is proxied there (12). Its important to observe the all call control related to executing the service lives within the controlling application server. The IVR _________________________ [1] Note that it is unusual for an empty script to be returned; this is because we want the AS to maintain control of the call signaling Rosenberg/Mataga/Schulzrinne [Page 22] Internet Draft AS Components November 15, 2000 application server deals strictly with the media component. This division of work, as we have discussed above, allows for independent evolution of the call control and media components of services. For example, if the desired called party did not have a reachable SIP address, but they did have an email address, the call could be redirected to a mailto URL. To support this twist, only the controlling application server code need change. The media component remains completely and totally unchanged. Readers familiar with VoiceXML will observe that VoiceXML almost achieves this perfect separation. It lacks any call control excepting a two - for call transfer and call termination. These tags are clearly not sufficient for many services. Our architecture would argue that instead of adding call control to VoiceXML, all control should be removed, so that call control can be left to other server components. The separation of the control from the media component also allows the media component to change without affecting the control component. In fact, because of the http interface between the two, the media server can be completely removed and replaced with a normal web browser, with only a small effect on the call control component. As an example, if the calling party was coming from a web enabled SIP client (known by the presence of the Accept header with text/html as a value in the INVITE request), the controller could return an HTTP URL in the 183 with an actual web form that gets filled out by the caller. This would be instead of using an IVR server to collect the data. Interestingly, the representation of the collected data is identical in both cases. Both use an HTTP POST operation to send the data to the controller. This allows the data collection code in the controller to be unified across both voice access and web access. 6.2 Conferencing Servers Conferencing servers today vary in type and complexity. Some are dialup only, supporting IVR access. Others support ad-hoc conferencing with web interfaces. Others still support three way calling as part of a PBX system. We observe once more that all of these conferencing "servers" are really conferencing applications that are just bundled as a server. These conferencing applications can be decomposed into components in exactly the way we have described above. At the core of each of these conferencing applications is a mixing service. This service is responsible for taking N audio or video streams, mixing them according to some matrix, and returning the mixed stream to each participant. Issues such as conference policy, provisioning of conferences, and authentication are all completely separate and Rosenberg/Mataga/Schulzrinne [Page 23] Internet Draft AS Components November 15, 2000 | INVITE (1) | | |------------------------>| | | | INVITE (2) | | |------------------------->| | | 200 OK (3) | | |<-------------------------| | 183 (4) | | |<------------------------| | | | ACK (5) | | |------------------------->| | MEDIA | | |----------------------------------------------------| | | | | | HTTP GET (6) | | |<-------------------------| | | HTTP 200 OK (7) | | |------------------------->| | | | | | | | | | | | | | | HTTP GET (8) | | |<-------------------------| | | | | | HTTP 200 OK (9) | | |------------------------->| | | | | | BYE (10) | | |------------------------->| | | 200 OK (11) | | |<-------------------------| | | | | | INVITE (12) | | |---------------------------------------> | | | | | | | | | | | | | | | | | | | | | Caller Controller IVR Server Figure 5: Interaction of App Server and IVR Component Rosenberg/Mataga/Schulzrinne [Page 24] Internet Draft AS Components November 15, 2000 outside of this basic mixing component. For this reason, we argue that a large variety of conferencing applications can be easily constructed by having the mixing service as separate application server component. What does the interface to such a mixing server look like? For the call control interface, users would join a conference by calling the server. The server would answer the call, thus appearing as a SIP UAS. The media sent from the user is mixed with other users in the conference, and the media sent back to the user is the mixed stream. The user can leave the conference by sending a BYE to the server, and the server can kick a user out of the conference by sending the user a BYE. Since the primary resource being accessed is a conference, it is no surprise that we would argue that the request URI of an incoming call defines the conference a user is mixed in to. In other words, all users that call the server with the same request URI, are all mixed together. The conferences are not defined by Call-ID or other SIP header fields. Using the request URI has tremendous advtanges from a routing and naming perspective, as we have discussed more generally above. It is not neccesary (in fact, not even advisable), for the conferencing server to require that the URIs that define the conference be set up ahead of time. Conference lifecycles in the mixing server are very simple. Conference state is created when the first call arrives for a particular URI, and ends when the last user with a call to that URI hangs up. This model allows the same mixing server to support both ad-hoc conferences, and pre-arranged conferences too. Pre-arranged conferences are handled through policy and control in a coordinating server external to the mixing server. This server lives entirely in the call control and signaling plane, not in the media plane. SIP (and RTP, of course) alone is not sufficient for complete usage of a conferencing server. Media mixing policies (effectively, the matrix indicating which users hear which other users, and with what relative volumes) need to be set. Information on the status of the conference, such as the identity of the current speaker, number of users currently being mixed, etc., may need to be reported back to some control entity. These represent the requirements for the side channel. In IVR servers, the side channel used HTTP. We argue that to unify these concepts, HTTP is ideally suited here as well. Updates to the mixing policy can be made through HTTP POST requests against the mixing server, using well defined interfaces (possibly SOAP). Similarly, information about the status of the conference can be Rosenberg/Mataga/Schulzrinne [Page 25] Internet Draft AS Components November 15, 2000 obtained through HTTP GET operations against the mixing server. The side channel here meets the requirements outlined in Section 5.5; it is not real time in nature, does not reuqire transactional support, and passes relatively infrequent data and control. In fact, such a side channel will often not be needed at all. In 90 default mixing policy (the so-called N-1 matrix, where each user hears everyone but themselves, all at equal volume, with no floor control) will suffice. Fans of the INFO method [13] will argue that instead of using HTTP for the control, why not INFO? This would eliminate the need for an additional protocol, after all. The answer is the same as to why SIP should not simply replace HTTP - the two have different strengths and weakenesses. SIP is a poor data transfer protocol. It has insufficent support for transfer of medium to large data sets, which is important here. Furthermore, we may want to allow an entity separate from the one that initiated the session to control the session. Usage of INFO would only work from the same device (because of the sequence numbering). In the next few sections, we show how this basic application server component can be used, along with a controller and other components, to build more complex conferencing applications. 6.2.1 Web Scheduled Conference Services In this application, we'd like a conferencing service where all conferences must be pre-scheduled. The pre-scheduling is done through a web page. At the page, the user will enter the start time (but not mandatory stop time) of the conference, the maximum number of attendees, and the identities of the attendees (if known). Once entered in a form, the server returns a SIP URL representing the conference. To implement this, we use an coordinating application server that has a SIP and HTTP interface, along with the mixing application server just described. Figure 6 shows a call flow for this service. A web client is first used to submit the information. Let us suppose a simple case where the conference can have up to two participants, and the conference starts immediately. The HTTP POST representing the form data is sent to the controller (1). It stores the information for the conference in a local data store, and chooses a SIP URL for the conference. This URL can be anything, so long as it is different from any URLs handed out so far by the controller. The URL is returned to the web client in step (2). As an additional convenience feature, the URL could be emailed to the participants. This would require the controller to Rosenberg/Mataga/Schulzrinne [Page 26] Internet Draft AS Components November 15, 2000 have an SMTP interface, in addition to HTTP and SIP. Note that this SIP URL points to the controller, NOT the mixing server. A few moments later, the first participant calls in using a SIP INVITE (3). The call is routed to the controller. It checks the conference ID. It finds that the policy permits up to two participants (not a practical example, but simplifies the call flow). It stores data indicating that one participant has now joined, and the proxies the INVITE request in step (4) to the mixer. The request URI in this request will have the same user part as (3), but the host part now represents the mixer. The mixer receives the INVITE, creates the initial conference state (as this is the first call for that URL), and returns a 200 OK (5), which is forward to the caller (6), and then ACKed (7 and 8). In step (9), the second caller calls in. The controller sees that only one participant is on the call so far, so the second call is accepted. The controller stores the fact that there are now 2 participants, and proxies the INVITE (10). The INVITE is accepted by the mixer (11), and the response forwarded to the second caller (12), and then ACKed (13 and 14). The two participants A and B can now hear each other. A third caller then calls in (15). The controller checks its records, and notices that this conference is now full. So, it rejects the INVITE (16), which is acknowleged (17). The astute reader will observe that, strictly speaking, the HTTP server does not really need to be co-resident with the SIP server in the controller. The initial conference setup can be stored in a database by a web server, and the controller can simply read this database. However, in more complex cases, we may wish to have web access to learn dynamic information about the conference as it progresses (for example, which users are in the conference). For this kind of dynamic session state, using a shared database between components is cumbersome. Rather, an integrated HTTP/SIP server is much better suited, where integrated implies only that it has built in mechanisms for session state sharing between the SIP and HTTP components. For this simple conferencing service, it was sufficient for the controller to act as a proxy. Thats because it does not need to forcibly kick anyone out of the conference once they are in. To support that kind of functionality, third party call control is needed. Let us examine a more complex service in the next section. 6.2.2 Web Scheduled, IVR supported, Time Limited Conference Rosenberg/Mataga/Schulzrinne [Page 27] Internet Draft AS Components November 15, 2000 | | | | (1) HTTP POST | | |--------------------------->| | | | | | (2) 200 OK | | |<---------------------------| | | | | | | | | | | | (3) INVITE | | | |----------------------->| (4) INVITE | | | | | |--------------------->| | | | | | (5) 200 OK | | | | | (6) 200 OK |<---------------------| | |<-----------------------| | | | | | (7) ACK | | | |----------------------->| (8) ACK | | | | | |--------------------->| | | | | | | | | | | (9) INVITE | | | | |------------------->| (10) INVITE | | | | | |--------------------->| | | | | | (11) 200 OK | | | | | (12) 200 OK |<---------------------| | | |<-------------------| | | | | | (13) ACK | | | | |------------------->| (14) ACK | | | | | |--------------------->| | | | | | | | | | | (15) INVITE | | | | | |--------------->| | | | | |(16) 500 Full | | | | | |<---------------| | | | | |(17) ACK | | | | | |--------------->| | | | | | | | | | | | | | | | | | | | Web A B C Controller Mixer Figure 6: Web Scheduled Conference Services Rosenberg/Mataga/Schulzrinne [Page 28] Internet Draft AS Components November 15, 2000 In this more complex example, we once again wish to use a web interface to set up the conferences. However, we wish to add a stop time. If there are participants in the conference when the stop time arrives, a warning announcement is played 10 minutes prior, and then they are kicked off. In addition, when a user joins the conference, before they are added, they hear an announcement that states the name of the person that set up the conference, and what the start and stop times are. They are then asked to speak their name. Then, they are dropped in. The conference server then speaks their name, so that everyone knows who just joined. This seemingly complex service is very easily constructed by adding an IVR server as described above. Now, we have a controller, a mixing server, and an IVR server, all working together to build the service. Each provides a specific component towards the overall solution, yet each is an application server in its own right, with both signaling and media interfaces. We assume that the web setup is done as above. This time, the stop time is provided, along with the name of the person setting up the conference. The call flow for the initial participant is shown in Figure 7. The initial participant sends an INVITE, which is forwarded to the controller. The controller matches the request URI against the conference that the user wishes to join. The controller recognizes that it needs to play an announcement. So, in step (2), it initiates a call to an IVR server. This call is accepted in step (3), and the resulting SDP is passed back to the UAC in step (4) in a provisional response. After ACKing the call with the IVR in step (5), the controller receives an HTTP GET to fetch the root VoiceXML script in step (6). The controller dynamically generates the VoiceXML script, whose content will cause the server to read out "Welcome to the conference, Bob. The call will start at 10 am, and end at 11am.". The name of the caller, Bob, is extracted from the INVITE (1). Once the prompt has been played, the IVR server prompts the caller for their name, and the result is recorded into a file. Then, the VoiceXML server attempts to fetch the next VoiceXML script from the controller (8). Before responding, the controller reconnects the media stream from the media server into the conference bridge. To do this, it first sends an INVITE to the conferencing server, using SDP indicating send only (9). The server accepts (10), and the controller ACKs (11). The SDP from the acceptance (10) is passed in a re-INVITE (12) to the IVR server. The IVR server then accepts (13) and the controller ACKs (14). Now, a unidirectional media stream from the IVR Rosenberg/Mataga/Schulzrinne [Page 29] Internet Draft AS Components November 15, 2000 server into the conference bridge is set up. The controller returns the next VoiceXML script (15), which tells the IVR server to play the previously recorded file into the conference, announcing the joining user. Once this is done, the IVR server fetches the next script (16), and gets back an empty response (17). The controller then disconnects from the IVR server (18,19). Finally, the controller re-INVITEs the conference server (20), updating the SDP to be that from the initial INVITE (1). The SDP from the acceptance (21) is passed on to the caller (22). Now, the caller is connected to the mixer as the first user in the conference. The second user would join in much the same way. Approximately 10 minutes before the end of the conference, a timer fires inside of the controller. It is time to play a warning announcement into the conference. The call flow for this is shown in Figure 8. The basic idea is to initiate a call to the IVR server and mixer, connect them using third party call control, and then have the IVR server play the announcement into the conference. The controller then hangs up. In step (1), the controller sends an INVITE to the mixer with a single audio stream on hold (i.e., "empty"). The request URI of the request is that of the conference. The mixer returns a 200 OK in step (2), and an ACK is sent in (3). The SDP from (2) is then used in step (4) to call the IVR server, which answers with its SDP in step (5). This is used in a re-invite (7,8,9) to the mixer to update the IP address and port as that of the IVR server. The IVR server then fetches the root VoiceXML document from the controller (11). This document instructs the server to read out some kind of conference warning - "Warning, your conference will end in 10 minutes". Once this is done, the IVR server fetches the next document (13), which is empty. The controller then hangs up with both the mixer (17) and the IVR server (19), disconnecting the IVR server from the conference. These examples demonstrate the component model we are proposing. The mixing component does not have application level intelligence. It has a call control interface, allowing it to exist anywhere (and be provided by any ASP service) and yet be a callable resource by other application server components. By combining a controller with an IVR server and the mixing server, complex and useful applications can be constructed in a distributed fashion. 6.3 Continuous Text-to-Speech Rosenberg/Mataga/Schulzrinne [Page 30] Internet Draft AS Components November 15, 2000 Caller Controller IVR Server Mixing Server | | | | | (1) INVITE | | | |-------------->| (2) INVITE | | | |----------------->| | | | (3) 200 OK | | | (4) 183 |<-----------------| | |<--------------| | | | | (5) ACK | | | |----------------->| | | | (6) HTTP GET | | | |<.................| | | | (7) 200 OK | | | |.................>| | | | | | | | (8) HTTP GET | | | |<.................| | | | (9) INVITE | | | |------------------------------------->| | | (10) 200 OK | | | |<-------------------------------------| | | (11) ACK | | | |------------------------------------->| | | (12) INVITE | | | |----------------->| | | | (13) 200 OK | | | |<-----------------| | | | (14) ACK | | | |----------------->| | | | | | | | (15) 200 OK | | | |.................>| | | | (16) HTTP GET | | | |<.................| | | | (17) 200 OK | | | |.................>| | | | (18) BYE | | | |----------------->| | | | (19) 200 OK | | | |<-----------------| | | | (20) INVITE | | | |------------------------------------->| | | (21) 200 OK | | | (22) 200 OK |<-------------------------------------| |<--------------| | | | (23) ACK | | | |-------------->| (24) ACK | | | |------------------------------------->| | | | | | | | | | | | | Caller Controller IVR Server Mixing Server | (1) INVITE empty SDP | | |---------------------->| | | (2) 200 OK SDP A | | |<----------------------| | | (3) ACK | | |---------------------->| | | | (4) INV SDP A | |------------------------------------------------->| | (5) 200 OK SDP B | | |<-------------------------------------------------| | | (6) ACK | |------------------------------------------------->| | (7) INV SDP B | | |---------------------->| | | (8) 200 OK SDP A | | |<----------------------| | | (9) ACK | | |---------------------->| | | | (11) HTTP GET | |<-------------------------------------------------| | | (12) 200 OK | |------------------------------------------------->| | | | | | | | | (13) HTTP GET | |<-------------------------------------------------| | | (14) 200 OK | |------------------------------------------------->| | | | | (15) BYE | | |------------------------------------------------->| | | (16) 200 OK | |<-------------------------------------------------| | (17) BYE | | |---------------------->| | | (18) 200 OK | | |<----------------------| | | | | | | | Controller Mixer IVR Server Figure 8: Advanced Web Scheduled Conference Service: Warning Rosenberg/Mataga/Schulzrinne [Page 32] Internet Draft AS Components November 15, 2000 Announcement Another example of an application server component is a continuous Text-to-Speech (TTS) converter. This kind of service allows a real time text stream (encapsulated in RTP using the RTP payload format for text [14] to be received, which is then converted to speech and returned as an audio stream encoded using a traditional speech codec, be it G.723.1, G.711, or what have you. Like the IVR server and mixing server, the TTS server acts as a user agent server. It answers incoming calls, and basically mirrors incoming text back as speech. It continutes to do so until the call is hung up by the initiating client. A TTS service can be done using VoiceXML with an IVR server, as in the examples above. However, the difference is that here, the text stream to be converted is in the data path, not the control path. The stream is likely to be generated by other entities in the system, not the controller. 6.3.1 Service Interface It is likely that the text-to-speech conversation process differs significantly depending on the language. As such, separate URIs SHOULD be used for language specific TTS services. Specifically, the convention sip:-@ is RECOMMENDED. The language tags SHOULD be selected from the set defined in RFC1766 [15]. One of the unfortunate limitations of SDP is that it is not currently possible for a single media stream to be composed of separate media formats in each direction. The text over RTP stream is, in fact, based on the top level text MIME type (text/t140). As a result, two media streams are needed for this service - a unidirectional audio stream and a unidirectional text stream. First, the client INVITEs the server. The SDP MUST indicate a two media streams. One stream MUST be of type audio. It SHOULD contain the set of audio codecs acceptable to the client. The stream MUST be marked as recv-only. The other stream MUST be of type text. It MUST contain a single codec, which is a dynamic payload number bound to text/t140. The stream MUST be marked as send-only. The 200 OK response from the TTS server that accepts the call has SDP with a two media lines, one of type audio, and one of type text, in the same order the streams appeared in the INVITE, as mandated by RFC2543. The audio stream SHOULD contain a subset of the codecs listed in the audio stream in the INVITE. The audio stream MUST be marked as send- Rosenberg/Mataga/Schulzrinne [Page 33] Internet Draft AS Components November 15, 2000 only. The text stream MUST contain a single codec, which is a dynamic payload type number bound to text/t140. The stream MUST be marked as receive-only. The client then ACKs the request. The TTS server SHOULD attempt to convert all text received on the incoming text stream to speech, and return the resulting speech on the outgoing audio stream. 6.3.2 Hearing Impaired Service The TTS server is extremely useful in supporting hearing impaired services. Examples of such services are described in describes a service where a controller accesses a TTS service. 6.4 Messaging Servers Another type of application server component is a messaging server. Messaging servers allow for callers to record audio messages for users on the system. Users can also call into the server to retrieve these messages, delete them, and file them. The system operates through the use of voice prompts combined with DTMF detection and/or speech recognition. The prompts that are played are context dependent. A messaging server can be viewed as a specialized version of an IVR server with an application specific controller associated with it. In fact, a messaging server can be implemented in this way exactly. However, the combination is also usefully viewed as a component in its own right, due to the frequent need for messaging components in more complex applications. 6.4.1 Service Interface The service interface for communicating with a messaging server is described in detail in [7]. The interface provides well known URIs for the most common resources within a messaging server - user specific message drops with a variety of drop conditions (called party busy, called party not there, etc.), message retrievals using a variety of authentication mechanisms (PIN, SIP level authentication), and message drops that are not user specific, so that the target user is queried for as part of the interface. 6.4.2 Web Enabled Message Drops An example usage of this application component is a web front end that allows users to leave voicemail for company employees through the company web page. The page has a URL for each company employee. If some user A clicks on a URL for employee B, A's phone rings. When A picks up, they hear a greeting to record a message for employee B. Rosenberg/Mataga/Schulzrinne [Page 34] Internet Draft AS Components November 15, 2000 The call flow for this application is the combination of third party call control combined with access to the service. It is shown in Figure 9. The caller, from a web page, clicks on the URL for the user they wish to leave a message for. The result is an HTTP request (1) to the controller. The URI in this request would be some controller-specific identifier that tells the controller what it needs to do. The controller then calls the user (3) using an SDP with a single media stream on hold initially. This is accepted (4), and the resulting SDP is used in an INVITE to the messaging server (6). The URI of this INVITE is that for message drop with standard greeting (sip:sub- jdrosen-deposit@voiceserver.com). The call is accepted (7) and the 200 OK is used in a re-INVITE to the caller (9) to set the address of the media stream to that of the voicemail server. After the call is accepted (10) and ACKed (11), the caller hears the voice drop prompt for the messaging server, and can record their message. 7 Security Considerations In many cases, authorization may need to be made to allow a caller access to a session level resource. Traditional SIP level authentication mechanisms can be used to accomplish this. Note, however, that in many cases the caller is the controller, which is acting as a third party call controller. In these cases, a two level trust model is really needed. The trust relationship in such situations is really between the session level resource and the controller (perhaps through an explicit business arrangement), and then between the controller and the caller. Thus, controllers should authenticate themselves to session resources they contact, rather than trying to proxy credentials from the caller. 8 Conclusion In this paper, we have argued that rapid deployment of complex communications applications will require a distributed model where application components are spread across the network. These components could be offered by separate providers, for example, enabling an ASP component model to evolve. We have observed that many of the components can be described as having some kind of session level resource that can be communicated with, usually in an automated fashion. Access to these resources is typically parameterized. As a result, SIP access, using the request URI as a service indicator, is an ideal way to communicate across these components. To validate this model, we examined the specific service interfaces that would be defined by IVR servers, conferencing servers, text-to- Rosenberg/Mataga/Schulzrinne [Page 35] Internet Draft AS Components November 15, 2000 | | | | | | (1) HTTP GET | | |-------------------->| | | | (2) 200 OK | | |<--------------------| | | | (3) INV | | | |<-------------| | | | (4) 200 OK | | | |------------->| | | | (5) ACK | | | |<-------------| | | | | (6) INV | | | |--------------------->| | | | (7) 200 OK | | | |<---------------------| | | | (8) ACK | | | |--------------------->| | | (9) INV | | | |<-------------| | | | (10) 200 OK | | | |------------->| | | | (11) ACK | | | |<-------------| | | | | | | | | | | | | | Web SIP Controller Messaging Caller Server Figure 9: Web Enabled Message Drops Rosenberg/Mataga/Schulzrinne [Page 36] Internet Draft AS Components November 15, 2000 speech servers and messaging servers. We gave call flows of complex applications built up from these components using the specified interfaces. 9 Author's Addresses Jonathan Rosenberg dynamicsoft 72 Eagle Rock Avenue First Floor East Hanover, NJ 07936 email: jdrosen@dynamicsoft.com Peter Mataga dynamicsoft 72 Eagle Rock Avenue First Floor East Hanover, NJ 07936 email: jdrosen@dynamicsoft.com Henning Schulzrinne Columbia University M/S 0401 1214 Amsterdam Ave. New York, NY 10027-7003 email: schulzrinne@cs.columbia.edu 10 Bibliography [1] N. Greene, M. Ramalho, and B. Rosen, "Media gateway control protocol architecture and requirements," Request for Comments 2805, Internet Engineering Task Force, Apr. 2000. [2] M. Arango, A. Dugan, I. Elliott, C. Huitema, and S. Pickett, "Media gateway control protocol (MGCP) version 1.0," Request for Comments 2705, Internet Engineering Task Force, Oct. 1999. [3] F. Cuervo, N. Greene, C. Huitema, A. Rayhan, B. Rosen, and J. Segers, "Megaco protocol 0.8," Request for Comments 2885, Internet Engineering Task Force, Aug. 2000. [4] M. Handley, H. Schulzrinne, E. Schooler, and J. Rosenberg, "SIP: Rosenberg/Mataga/Schulzrinne [Page 37] Internet Draft AS Components November 15, 2000 session initiation protocol," Request for Comments 2543, Internet Engineering Task Force, Mar. 1999. [5] J. Rosenberg, H. Schulzrinne, and J. Peterson, "Third party call control in SIP," Internet Draft, Internet Engineering Task Force, Mar. 2000. Work in progress. [6] M. Handley and V. Jacobson, "SDP: session description protocol," Request for Comments 2327, Internet Engineering Task Force, Apr. 1998. [7] B. Campbell and R. Sparks, "Control of service context using SIP Request-URI," Internet Draft, Internet Engineering Task Force, Oct. 2000. Work in progress. [8] H. Schulzrinne and S. Petrack, "RTP payload for DTMF digits, telephony tones and telephony signals," Request for Comments 2833, Internet Engineering Task Force, May 2000. [9] V. Bharatia, E. Cave, and B. Culpepper, "SIP INFO method for event reporting," Internet Draft, Internet Engineering Task Force, Apr. 2000. Work in progress. [10] T. Choudhuri, C. Haun, P. Sollee, S. Orton, and S. Whynot, "SIP INFO method for DTMF digit transport and collection," Internet Draft, Internet Engineering Task Force, Apr. 2000. Work in progress. [11] VoiceXML Forum, "Voice extensible markup language (voicexml) version 1.00," voicexml forum specification, VoiceXML Forum, Mar. 2000. [12] M. Handley, H. Schulzrinne, E. Schooler, and J. Rosenberg, "SIP: Session initiation protocol," Internet Draft, Internet Engineering Task Force, Aug. 2000. Work in progress. [13] S. Donovan, "The SIP INFO method," Request for Comments 2976, Internet Engineering Task Force, Oct. 2000. [14] G. Hellstrom, "RTP payload for text conversation," Request for Comments 2793, Internet Engineering Task Force, May 2000. [15] H. Alvestrand, "Tags for the identification of languages," Request for Comments 1766, Internet Engineering Task Force, Mar. 1995. Rosenberg/Mataga/Schulzrinne [Page 38] Internet Draft AS Components November 15, 2000 Table of Contents 1 Introduction ........................................ 2 2 Why Decompose ....................................... 2 3 Tightly Coupled Decomposition ....................... 4 4 The Decoupled Model ................................. 6 4.1 Architecture ........................................ 7 4.2 Benefits of the Decoupling .......................... 10 5 Architecture for the Interfaces ..................... 11 5.1 Naming .............................................. 12 5.2 Additional Message Content .......................... 14 5.3 Session Duration .................................... 14 5.4 Third Party Call Control ............................ 15 5.5 Side Channels ....................................... 18 6 Patterns for Accessing Components ................... 19 6.1 Interactive Voice Response Services ................. 19 6.2 Conferencing Servers ................................ 23 6.2.1 Web Scheduled Conference Services ................... 26 6.2.2 Web Scheduled, IVR supported, Time Limited Conference ..................................................... 27 6.3 Continuous Text-to-Speech ........................... 30 6.3.1 Service Interface ................................... 33 6.3.2 Hearing Impaired Service ............................ 34 6.4 Messaging Servers ................................... 34 6.4.1 Service Interface ................................... 34 6.4.2 Web Enabled Message Drops ........................... 34 7 Security Considerations ............................. 35 8 Conclusion .......................................... 35 9 Author's Addresses .................................. 37 10 Bibliography ........................................ 37 Rosenberg/Mataga/Schulzrinne [Page 39]