Internet Engineering Task Force SIPPING WG Internet Draft J. Rosenberg dynamicsoft draft-rosenberg-sipping-markup-00.txt April 24, 2002 Expires: October 2002 A Framework for Stimulus Signaling in SIP Using Markup STATUS OF THIS MEMO This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress". The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt To view the list Internet-Draft Shadow Directories, see http://www.ietf.org/shadow.html. Abstract In order for SIP applications to work, they will frequently need to collect user input and provide feedback to users. Traditionally, user input has been done in the PSTN through DTMF. Much work has occurred on extending these DTMF models into the domain of SIP, typically by transporting DTMF digits or user input through some SIP message to an application server. We propose a broader framework for stimulus using markup. The approach can support traditional DTMF user input, but also a rich variety of devices, user interfaces and stimulus that goes well beyond DTMF. J. Rosenberg [Page 1] Internet Draft markup April 24, 2002 Table of Contents 1 Introduction ........................................ 3 2 Framework ........................................... 3 2.1 Extensibility ....................................... 6 2.2 Lifecycle ........................................... 7 2.3 Security ............................................ 7 2.4 Feature Interaction ................................. 8 3 DTMF Input Using DML ................................ 8 3.1 Overview ............................................ 8 3.2 DML Syntax .......................................... 9 4 Example ............................................. 10 4.1 Pre-Paid Calling Card ............................... 10 4.1.1 HTML ................................................ 11 4.1.2 VoiceXML ............................................ 11 4.1.3 DML ................................................. 12 4.2 Voice Recorder ...................................... 15 4.2.1 DML ................................................. 15 4.2.2 HTML Flow ........................................... 17 5 Requirements Analysis ............................... 17 6 Conclusion .......................................... 21 7 To Do ............................................... 21 8 Authors Addresses ................................... 22 9 Normative References ................................ 22 10 Informative References .............................. 22 J. Rosenberg [Page 2] Internet Draft markup April 24, 2002 1 Introduction Stimulus signaling is input provided by a user to a network application, where the user agent has no understanding of the semantics of that user input. It is merely passed blindly to the network application for processing. This is in contrast to functional signaling, where the user agent understands the semantic of the feature the user is trying to invoke, and explicitly requests the network to provide it. Much has been written on the relative pros and cons of both approaches. However, it would appear that both are needed in a complete SIP system. Stimulus signaling in the PSTN has traditionally been done through DTMF input and with speech recognition. In both cases, the user agent (the phone) has no awareness of the input, and merely passes it into the network for consumption. Not surprisingly, a great deal of attention has focused on providing these capabilities in SIP [1]. The IETF has standardized techniques for carrying DTMF within RTP [2]. Drafts have been written on how SIP applications, controlled by VoiceXML, can use the speech or DTMF carried in RTP to perform their functions [3] [4]. However, this approach requires that the network application receive the media stream, and process it in order to obtain the stimulus. There has been growing consensus that for DTMF- only applications, this is too heavyweight to be the sole solution for stimulus signaling within SIP. The result has been a number of drafts written on the transport of DTMF (or a generalization of DTMF to user key events) within SIP, rather than within RTP [5] [6]. Most recently, there has been generation of a requirements specification that details the problem that is to be solved [7]. In this draft, we propose a framework for meeting the requirements in [7]. However, we look at the problem more broadly than past solutions. Rather than considering just DTMF, or just a keyboard, or any specific form of user input, we provide a framework for stimulus signaling of any type, with any kind of user interface. The framework is based on the usage of markup languages, such as VoiceXML and HTML (indeed, both of those can be used with the framework). Section 2 discusses the proposed framework. Section 3 considers the problem of DTMF as user input, and proposes the DTMF Markup Language (DML) within the proposed framework. Section 4 provides some application examples using this framework. 2 Framework J. Rosenberg [Page 3] Internet Draft markup April 24, 2002 +------------+------------+ +-------------+ User | User | User | | | Interface |Presentation| Input | | Application | | | | | Server | +------------+------------+ | | | ^ Network V | | | | . Interface . | | | | . . | | | +-------------------------+ +---------+---+ ^ V ^ V | | | | | | | | | +----------------+ | | | +----------------------------------+ Figure 1: A Model for Stimulus In the most general sense, a system for providing stimulus signaling to applications can be modeled as shown in Figure 1. In this model, there is a user agent that has a user interface of some sort. This user interface has two components - a presentation component and a user input component. The display component (which may be non- existent in some agents) provides feedback to the user. This feedback could be through speech, it could be through text on a two-line LCD display, it could be text or graphics on a small display, or it could be through a web page on a PC. The presentation component provides sufficient context for the user to provide input back to the application. The user input component is responsible for taking the user input, and sending it to the network application. The user input could be in the form of speech, DTMF, a keypress on a keyboard, or a click on a hyperlink. As the user provides input, this may (or may not) result in a change in the user interface. J. Rosenberg [Page 4] Internet Draft markup April 24, 2002 When the user sends a SIP request that initiates a dialog, this request will be routed through the SIP network. It will potentially pass through one or more application servers en route to the recipient. In this context, an application server is a proxy or a B2BUA which provides one or more features for the benefit of the originator, recipient, or both. These application servers may require user interaction in order to deliver their features. Such interaction requires a user interface providing user input, and potentially providing a presentation component. Each application has its own independent user interface requirements. In this proposed framework, the user interface is provided through markups. These markups provide the user with a presentation of the feature, and provide the appropriate context for the user to provide input. User input is transmitted to the application using HTTP form posts. These posts may, in addition to posting user input, also return a new piece of markup for rendering. This is exactly the model used by existing markup languages, including HTML, WML, and VoiceXML. The user initially obtains the markups through HTTP references provided in headers placed into SIP messages (For the moment, we assume this is a new header, App-Info. It remains to be determined whether an existing header (Call-Info, specifically), can be used here). An application server that wishes to present the originator with a user interface places the reference in a response, and an application server that wishes to present the recipient with a user interface places the reference in a request. When the user agent invokes the reference to fetch the content, it will obtain the initial markup that presents the user interface. There can be more than one reference in a SIP message. This can happen if there are multiple applications that wish to be involved in a single dialog. Each reference is marked with an identifier that indicates the name and owner of the application. This allows the UA to render each interface separately, and possibly to discard the reference if the user does not want to interact with that application. The HTTP URLs handed out in the SIP messages are correlated back to the appropriate application server (and application instance) through implementation specific means. The key idea is that the element which hands out the URL decides how to format it so that the HTTP request routes back to the appropriate server and provides the appropriate context. No specific standardization activities are needed. The behavior of the server on receipt of these HTTP posts do not need to be standardized either. A post operation can result in SIP actions, as needed - hanging up the call, performing a re-INVITE, and so on, in addition to modifying the user interface itself by returning additional markup in the response. J. Rosenberg [Page 5] Internet Draft markup April 24, 2002 2.1 Extensibility It is critical that a variety of different markups can be supported, depending on the capabilities of the user agents. To support that, standard MIME negotiation features are used. In the initial request sent by the UAC, an Accept header is included. This header lists the content types of the markups which are supported by the UAC. It can also use q values to prioritize the ones it supports. For example, consider a VoIP phone with a large display. The phone supports HTML, WML, and DTMF input through DML. An INVITE request generated by the phone might look like, in part: INVITE sip:callee@example.com SIP/2.0 From: sip:caller@example.com Accept: text/html;q=1.0, text/wml;q=0.5, text/dml;q=0.2 This indicates a preference for html, followed by wml, followed by DML. Usage of the Accept header in the request allows the application server to return a URL that points to a markup using one of the supported formats of the UAC. Negotiating the type for usage by the UAS works differently. The application pre-emptively inserts an App- Info header into the request, with multiple URLs listed. Each URL is associated with the different types supported by the application, and is labeled with the type. A q value is used to provide the prioritization from the perspective of the application. The UAS can then select one, and use it. A Require header can be used to force the UAS to reject the request if none of the types are supported, in which case the application falls back to non-markup based methods for user input. This mechanism allows for a variety of different markups to be defined that are ideally suited to the particular user input. For example, many phones have a UI that supports a set of buttons, each of which is associated with a specific line of text on a display. A markup language can be specified for this environment which provides the text for each button, and a URL to be invoked when the button is pressed. The usage of markups also facilitates application of the W3C Composite Capabilities and Preference Profiles (CC/PP) for the server to determine the detailed capabilities of the user agent. A CC/PP document can be included in the initial fetch of the markup, so that the server can tune the markup accordingly. In the case of our phone, it would learn how many buttons the phone has, what kind of display capability it has, and so on, in order to return the proper markup. J. Rosenberg [Page 6] Internet Draft markup April 24, 2002 2.2 Lifecycle The users interaction with the application is based on a well defined lifecycle. The interaction with the application begins when the user agent fetches the first document using an HTTP URL provided to it in a request or response. The interaction continues until one of the following conditions occur: o The dialog associated with the message in which the first URL was obtained, has terminated. The URL need not have been obtained in the request that created the dialog. It could have been provided in a mid-dialog request (an INFO, for example). However, once the dialog terminates, the interaction terminates. o An HTTP POST request is made that generates an error response (any 4xx or 5xx response). o A markup is provided to the user that has a format specific means for terminating the interaction. In this context, "terminating" the interaction means that the URLs held by the user agent are no longer valid and referencible. Any pop-up windows or other user interface artifacts should be cleared. Once terminated, the interaction can be restarted. This is possible only if the dialog is still active. If it is, the application server can send a new URL in a request or response. Once this URL is fetched, the interaction begins again. OPEN ISSUE: It would be very useful to be able to push URLs in MESSAGE requests [8] that initiate a user interaction with an application. However, these can (and probably should) occur outside of the dialog. How do the lifecycle rules work in that case? How does the user know what dialog the URL is associated with? Does it need to know? Can these be used to support user interfaces that occur totally outside of the scope of a dialog? For example, when the user registers, they get a MESSAGE request with a user interface for activating features, making calls, etc. Do we want to support that as part of the framework? There are big security issues. 2.3 Security It is critical that the user only interact with applications which are legitimately involved in the processing of the session. In this J. Rosenberg [Page 7] Internet Draft markup April 24, 2002 case, "legitimately" means that the interaction is with a server which was on the SIP request path that established the dialog. The benefit of this model is that it maps the authorization policy for application interaction to a known policy. Presumably, the SIP network has been set up to route the requests through servers that are authorized to process the call. Those authorization policies can be enforced through known SIP security mechanisms, such as server to server TLS and IPSec. Thus, a SIP network can already be built to ensure that requests and responses only flow through servers that are authorized to handle those requests and responses. If we specify that the policy for app interaction is the same - a user can only interact with servers that are authorized to handle the requests and responses that establish the dialog - the enforcement of this policy is a solved problem. So long as the HTTP URL for interaction with the application is obtained from a SIP message, and not through an out- of-band means, no further security tools are needed to enforce this policy. It is also necessary to secure the interaction with the HTTP server. The server may need to validate that the user of the HTTP client it is interacting with is the same user that was involved in the SIP signaling. Similarly, the HTTP client may need to validate that the server it is communicating with is the one that handed out the URL in the SIP message. These two authentication functions are readily performed with traditional web security techniques - HTTP basic or digest authentication over TLS/SSL. 2.4 Feature Interaction It is fundamental to this framework that the user can interact with multiple applications involved in the same dialog. There need not be any coordination between those applications. However, there are potential issues with feature interaction that are worth noting. One issue is how to determine which application should receive the user input. This is a non-problem with markups that can present a user interface. When the user presses a button or clicks a link, that input is directed to the markup that owns that button or link. The situation is more complex with DTMF input, where there is no way to determine which application the input is meant for. In this case, there is little choice but to send it to both applications. It is very likely that the wrong thing will happen in this case. Hopefully this will convince vendors to move towards user interfaces that don't have this problem. 3 DTMF Input Using DML 3.1 Overview J. Rosenberg [Page 8] Internet Draft markup April 24, 2002 The vast majority of voice devices today have no display, and have a 12 key keypad that allows the user to enter digits. This limited interface has been the model for interaction with communications applications for decades. So, while limited, it is critical that it can be well supported by the framework proposed here. Our approach is define a DTMF Markup Language (DML), which is extremely simple. It doesn't provide any kind of user interface. It merely presents the UA with a set of digit maps. These digit maps (defined in MGCP [9]) represent a series of digits that the application wishes to find a match against. Effectively, each is a condition that can be satisfied through a set of user input. Each digit map is associated with an HTTP URL that is to be invoked when that condition is met. Multiple conditions can be provided, each of which has its own URL. When one of the conditions is met, an HTTP POST is made, passing the digits collected. There are two modes of operation - stop-and-wait, and immediate. In stop-and-wait, once a condition is matched, any additional user input is buffered locally. The result of the form POST is another DML document, which provides a new set of matching conditions. The buffered digits, along with ones collected subsequently, are then applied against the new conditions. The stop-and-wait approach is consistent with the behavior of most other markups (VoiceXML, HTML, WML), but results in a throughput of one match per RTT. An application may need to receive each digit, one at a time, and might not be able to wait for the next markup to be returned. In immediate mode, a new markup is not obtained from the POST operation. Matching continues on the existing markup. An additional match causes another form POST, even if the existing one is in progress. A sequence number is provided as part of the POST operation, so that they can be properly ordered at the application. It is proposed that DML is based on the digit maps specified in MGCP [9], and that the behavior of a UA interpreting a DML document is identical to that of an MGCP gateway matching digits to a digit map. The digit maps are not limited to just DTMF; other packages could be used for events like hookflash. Usage of a markup, instead of a dedicated protocol, such as MGCP or MEGACO [10], allows for support of a far wider set of devices within the same framework. However, basing it on an existing approach gives good evidence of correctness and implementability. We are agnostic as to whether MEGACO or MGCP should be used as the base. 3.2 DML Syntax A DML document is an XML document. All DML documents are encoded in US-ASCII (there is no need for UTF-8), and are well-formed. The top level tag is dml. The DML tag contains a set of conditions, each of which is expressed with a condition tag. The content of the condition J. Rosenberg [Page 9] Internet Draft markup April 24, 2002 tag is the digit map that is to be matched. The condition tag has an attribute, mode, which indicates whether immediate mode or stop-and- wait mode is being used. The default is stop-and-wait. There is also an href attribute that provides an HTTP URL to use for the HTTP form post. The digits collected as a result of the matching operation are posted using the URL parameter "digits". The following example DML document represents the matching conditions described in the example in Section 2.1.5 of MGCP [9]: 0T 00T [1-7]xxx 8xxxxxxx #xxxxxxx *xx 91xxxxxxxxxx 9011x.T If the user pressed *77, the URL http://server33.apps.example.com?case=6&digits=*77 would be invoked by the user agent. This would return the next DML document to execute. 4 Example The best way to illustrate the proposed framework is with examples. 4.1 Pre-Paid Calling Card In this example, a user makes a SIP call using a prepaid calling card application. The application supports user input through DML, VoiceXML, and HTML. The HTML version is much richer, allowing the user to enter their prepaid card number through the web, and allowing them to see how much money is remaining through a web page. They can also make another call by clicking a button, and then entering a new J. Rosenberg [Page 10] Internet Draft markup April 24, 2002 number. When the client supports VoiceXML, the credit card information is collected through voice prompts and DTMF recognition done locally on the user agent. When the user is done with the call, they can hang up, or press a "long pound" to enter a new number. When the client supports DML only, the user experience is identical to the VoiceXML case. However, the calling card information is collected by a VoiceXML server in the network. Once the call is established, the long pound is detected through DML running on the client. 4.1.1 HTML The call flow for this scenario is shown in Figure 2. First, the caller sends an INVITE (1) which is routed to the application server providing the pre-paid calling card application. The INVITE contains an Accept header listing text/html as a supported type. So, the application server generates a 183 (2) with an App-Info header. This header has an HTTP URL which is routed back to the app server. The client fetches this URL (3) and gets back an HTML document. This document has a form, thanking the user for making a call, and asking for the destination number and prepaid calling card number. The user types these in, and clicks submit. This results in a form POST (5). This form post does two things. First, it returns another HTML document (6). This document indicates the amount of minutes that the user has remaining. It also has buttons for determining the time and dollars remaining, and a button for hanging up. The form post also causes the pre-paid appliation to send the INVITE (as a B2BUA) to a gateway, based on the information entered by the user (7). This call completes normally (8-11) and the user can talk. At some point, they decide to end the call in order to dial someone else. So, they click the hangup button on the form. This results in another form POST (12). This causes the application server to terminate the call with the gateway (14), and re-INVITE the caller to put them on hold (16- 18). The form POST returned another page (13) which allows the user to enter a new number to call. The user enters a new number, and clicks submit. This results in another form POST (19). This returns an HTML document similar to the one returned in message 4, showing the user how much time remains, and providing a button to hang up. The POST also causes the application server to use third party call control to connect the user to the gateway once more in order to reach the second number (21-26). 4.1.2 VoiceXML Interestingly, the call flow for this scenario is identical to that of Figure 2! In this case, the caller is calling from a SIP phone that supports local interpretation of VoiceXML. Or, it could be a user on a PSTN phone dialing through a gateway that supports VoiceXML J. Rosenberg [Page 11] Internet Draft markup April 24, 2002 interpretation. Either way, the UAC sends the INVITE request (1). This time, the INVITE contains an Accept header with the type text/vxml listed. The application server sends a 183 (2) with an HTTP URL that can be used to fetch the VoiceXML script. The UA fetches the script (3), and its returned (4). The script asks the user to enter their calling card number and the phone number to reach. Here, the UA itself performs the text-to-speech (or, it might fetch recorded speech from a web server if that is what the VoiceXML script tells it to do). The UA also performs the DTMF recognition as specified by the VoiceXML script. Pushing this functionality into the UA allows for "voip-free" recognition. No media needs to be sent on the wire, no compression or DTMF encoding needs to occur. The media is collected locally and interpreted locally, all on the same platform. When all of that is done, the UA performs an HTTP form POST to the URL in the VoiceXML script, which has been crafted to route back to the application. This form POST can be structured identically to the one used in the HTML example. As such, the pre-paid application can work identically whether the input is collected through voice or a web page. The application flow proceeds as in the HTML example. However, instead of ending the call by clicking a button on the web page, the user presses a long pound. The VoiceXML document returned in message 6 was written to wait for the user to enter a long pound. The resulting form POST (12) returns another VoiceXML script, which asks the user to enter the next number. This result is POSTed (19), and the call is established. 4.1.3 DML In this case, the user agent only supports DML. To collect the credit card information and dialed number, the application must use a network based VoiceXML server. However, once the call is established, the long pound is detected at the client with DML. Traditionally, detection of the long pound was done by "forking" the media at the UA, sending one stream to the VoiceXML server throughout the duration of the call. The VoiceXML server would hunt for the long-pound throughout the call. In this approach, there is no forked media, and there is no involvement from the VoiceXML server after the call is established. This provides a substantial savings of DSP resources and of network bandwidth. The call flow is shown in Figure 3. The initial INVITE (1) contains an Accept header that indicates support for text/dml. The application server acts as a B2BUA, and connects the caller to a VoiceXML server (2-6). The INVITE towards the VoiceXML server contained an HTTP URL in the request URI [3]. This causes the VoiceXML server to fetch the J. Rosenberg [Page 12] Internet Draft markup April 24, 2002 Caller App Server Gateway |(1) INVITE | | |------------------>| | |(2) 183 w. HTTL URL| | |<------------------| | |(3) HTTP GET | | |------------------>| | |(4) HTTP 200 OK | | |<------------------| | |(5) HTTP POST | | |------------------>| | |(6) 200 OK | | |<------------------| | | |(7) INVITE | | |------------------>| | |(8) 200 OK | | |<------------------| | |(9) ACK | | |------------------>| |(10) 200 OK | | |<------------------| | |(11) ACK | | |------------------>| | |(12) HTTP POST | | |------------------>| | |(13) HTTP 200 OK | | |<------------------| | | |(14) BYE | | |------------------>| | |(15) 200 OK | | |<------------------| |(16) INVITE | | |<------------------| | |(17) 200 OK | | |------------------>| | |(18) ACK | | |<------------------| | |(19) HTTP POST | | |------------------>| | |(20) HTTP 200 OK | | |<------------------| | |(21) INVITE no SDP | | |<------------------| | |(22) 200 OK SDP1 | | |------------------>| | | |(23) INVITE SDP1 | | |------------------>| | |(24) 200 OK SDP2 | | |<------------------| |(25) ACK SDP2 | | |<------------------| | | |(26) ACK | | |------------------>| Figure 2: HTML Input Flow J. Rosenberg [Page 13] Internet Draft markup April 24, 2002 Caller App Server VXML Server Gateway |(1) INVITE | | | |------------------>| | | | |(2) INVITE | | | |------------------>| | | |(3) 200 OK | | | |<------------------| | | |(4) ACK | | | |------------------>| | |(5) 200 OK | | | |<------------------| | | |(6) ACK | | | |------------------>| | | | |(7) HTTP GET | | | |<------------------| | | |(8) HTTP 200 OK | | | |------------------>| | | |(9) HTTP POST | | | |<------------------| | | |(10) HTTP 200 OK | | | |------------------>| | | |(11) BYE | | | |------------------>| | | |(12) 200 OK | | | |<------------------| | |(13) INVITE | | | |<------------------| | | |(14) 200 OK | | | |------------------>| | | |(15) ACK | | | |<------------------| | | |(16) INVITE no SDP | | | |w. App-Info | | | |<------------------| | | |(17) 200 OK SDP1 | | | |------------------>| | | | |(18) INVITE SDP1 | | | |-------------------------------------->| | |(19) 200 OK SDP2 | | | |<--------------------------------------| |(20) ACK SDP2 | | | |<------------------| | | | |(21) ACK | | | |-------------------------------------->| |(22) HTTP GET | | | |------------------>| | | |(23) HTTP 200 w.DML| | | |<------------------| | | |(24) HTTP POST | | | |------------------>| | | |(25) HTTP 200 OK | | | |<------------------| | | | |(26) BYE | | | |-------------------------------------->| | |(27) 200 OK | | | |<--------------------------------------| |(28) INVITE no SDP | | | |<------------------| | | |(29) 200 OK SDP3 | | | |------------------>| | | | |(30) INVITE SDP3 | | | |------------------>| | | |(31) 200 OK SDP4 | | | |<------------------| | |(32) ACK SDP4 | | | |<------------------| | | | |(33) ACK | | | |------------------>| | J. Rosenberg [Page 14] Internet Draft markup April 24, 2002 Figure 3: Prepaid Calling Card using DML script by invoking that URL (7). The script that is returned (8) asks the user to enter their calling card number and the destination number. It is collected over the RTP stream established with the caller. Once collected, the result is POSTed to the application server (9). Once done, the VoiceXML server is no longer needed. So, the application server terminates the dialog with it (11). It then re-INVITEs the caller, putting them on hold (13-15). The next step is to connect them to the gateway using third party call control. The INVITE sent to the caller for the 3pcc flow (16) contains an App-Info header. This header contains a URL for a DML document. This document asks the UA to listen for a long pound. An example of how the document might look is: #L After the 3pcc exchange, the caller fetches this document (22,23). The user has their conversation. At the end, instead of hanging up, the enters a long pound. This matches the condition in the DML document, causing an HTTP form POST to the application server (24). The application server hangs up with the gateway (26). Once again using third party call control, the application server now connects the caller with the VoiceXML server (28-33). The flow proceeds from here as it does in message 7 onwards. The VoiceXML server would collect the next phone number, and then the application server would connect the caller to the gateway. 4.2 Voice Recorder Another example application is a voice recorder. The voice recorder application allows a user to record their conversation. We consider two forms of input. In one mode, the user only uses DTMF to control the recording. They can press 1 to start it, and 2 to stop it. There are no voice prompts or greetings. The user needs to know to press 1 and 2 to start and stop. In the HTML version, the user gets a pop-up console that has buttons to start and stop recording. The implementation is based on the app component model [4] and uses a conference server component and a recording server component. 4.2.1 DML J. Rosenberg [Page 15] Internet Draft markup April 24, 2002 A call flow for the DML version of the application is shown in Figure 4. The caller sends an INVITE, which is routed to the application server (1). This INVITE has an Accept header listing text/dml as a supported type. The application server generates a 183 (2) with an App-Info header containing an HTTP URL for the DML document. The application server acts as a B2BUA, and completes the call to the called party (3-7). The caller fetches the DML document (8), which is returned to them. The document for this application is straightforward: 1 At some point during the call, the user presses 1. This matches the first DML condition, causing an HTTP form POST to the URL http://server33.example.com?digits=1 (10). What follows are a series of third party call control exchanges. These exchanges connect the caller to a conference server, the callee to the conference server, and then connect a recording server (controlled by RTSP) to the conference server. This is an application unaware conference server as described in [4], which mixes together the media from all users in the same context. Messages 12-17 connect the caller to the conference server. Messages 18-23 connect the callee to the conference server. Messages 24-29 bring the recording server into the conference. The controller then uses RTSP (30) to instruct the RTSP server to record the contents of the media it is receiving. Message 11 will have also caused another DML document to be returned to the caller: 2 If the user presses 2, this is reported to the application server, which can stop recording (not shown). J. Rosenberg [Page 16] Internet Draft markup April 24, 2002 RTSP usage is most definitely not quite right in this call flow. 4.2.2 HTML Flow It is hopefully not a surprise to the reader at this point to learn that the call flow for the HTML version of this application is identical to the DML version. The only difference is that instead of returning DML documents, the HTTP operations return HTML documents. These documents have buttons that cause the appropriate form POSTS for starting and stopping recording. They would also provide the user with other options, such as a link to listen to the recorded audio, for example. 5 Requirements Analysis The following analyzes this framework against the general requirements outline in [7]: R1: The mechanism must support collecting device/user input which is associated with an established SIP session but must also support collecting device/user input that is outside of any established sessions. The framework supports collecting input associated with a SIP session. It can, without any changes, also support collection of input outside of a session, although more thought is needed on the security implications of doing so. R2: The mechanism must transport user indications to network elements independently of the media plane. The framework sends the indications using HTTP, outside of the media plane. R3: The transport mechanism must be sensitive to the limited bandwidth constraints of some signaling planes, for instance, reliability through blind retransmission is not acceptable. Reliability is done through TCP. The amount of bandwidth used is very small, since the HTTP requests need not contain anything but the request line. R4: The mechanism must support multiple network entities requesting and receiving indications independently of each other. J. Rosenberg [Page 17] Internet Draft markup April 24, 2002 Caller App Server Conf Server Record Server Callee |(1) INVITE | | | | |----------->| | | | |(2) 183 | | | | |<-----------| | | | | |(3) INVITE | | | | |------------------------------------->| | |(4) 200 OK | | | | |<-------------------------------------| | |(5) ACK | | | | |------------------------------------->| |(6) 200 OK | | | | |<-----------| | | | |(7) ACK | | | | |----------->| | | | |(8) HTTP GET| | | | |----------->| | | | |(9) HTTP 200| | | | |w. DML | | | | |<-----------| | | | |(10) HTTP POST | | | |----------->| | | | |(11) 200 OK | | | | |<-----------| | | | |(12) INVITE | | | | |no SDP | | | | |<-----------| | | | |(13) 200 | | | | |SDP1 | | | | |----------->| | | | | |(14) INVITE | | | | |SDP1 | | | | |----------->| | | | |(15) 200 OK | | | | |SDP2 | | | | |<-----------| | | | |(16) ACK | | | | |----------->| | | |(17) ACK | | | | |SDP2 | | | | |<-----------| | | | | |(18) INVITE | | | | |no SDP | | | | |------------------------------------->| | |(19) 200 | | | | |SDP3 | | | | |<-------------------------------------| | |(20) INVITE | | | | |SDP3 | | | | |----------->| | | | |(21) 200 OK | | | | |SDP4 | | | | |<-----------| | | | |(22) ACK | | | | |----------->| | | | |(23) ACK | | | | |SDP4 | | | | |------------------------------------->| | |(24) INVITE | | | | |no SDP | | | | |------------------------>| | | |(25) 200 | | | | |SDP5 | | | | |<------------------------| | | |(26) INVITE | | | | |SDP5 | | | | |----------->| | | | |(27) 200 OK | | | | |SDP6 | | | | |<-----------| | | | |(28) ACK | | | | |----------->| | | | |(29) ACK | | | | |SDP6 | | | | |------------------------>| | | |(30) RTSP RECORD | | | |------------------------>| | | |(31) 200 OK | | | | |<------------------------| | Figure 4: Voice Recorder App using DML J. Rosenberg [Page 18] Internet Draft markup April 24, 2002 Each application server can independently request receipt of indications, by placing its own HTTP URL in an App-Info header. R5: A network entity desiring user indications must be able to request user indications from another network entity. The entity receiving a request must be able to respond with its capability/intent to transmit user indications. Requesting of user indications is done by passing an HTTP URL to the network entity (originator or recipient) from which indications are desired. Capability and intent of transmission of user indications is indicated by fetching the URL. R6: The mechanism must support filtering so that only user indications of interest are transmitted. For HTML, VoiceXML and WML, user indications are directed, so that they are only provided to the application when that is what the user really wants. This is the ideal filtering scenario. For user interfaces that lack the ability to direct input, such as DTMF, the markup provides filtering. DML provides equivalent filtering capabilities to MGCP. R7: User activity indications must not be generated unless implicitly or explicitly requested by an entity. User input is only sent if an application has requested it by passing an HTTP URL referencing the markup for the interaction. R8: The mechanism must support user indications via keys or buttons and at the very least must define support for user interaction via a standard, generic computer keyboard. The framework supports interactions using any kind of input. Specific support for DTMF input is specified using DML. If generic keyboard input is desired (not clear that it is), a markup can be defined for it. R9: The mechanism must support the definition of device and/or user-specific buttons. The framework supports interactions using any kind of input, including device or user-specific buttons. A markup would need to be defined for it. The author suspects that this will be far more complex than would appear at first J. Rosenberg [Page 19] Internet Draft markup April 24, 2002 glance, given the potential variabilities in the user input capabilities of devices (buttons, switches, jog-dials, sliders, etc.) R10: The mechanism must be extensible so that some non key-based user indications can be supported in the future, for instance, sliders, dials or wheels. The framework supports interactions using any kind of input, including sliders, dials, or wheels. R11: A requestor must be able to determine the makeup/contents of the user interface possessed by a target device. This is done through the Accept header in the request, which lists the set of supported markups. It can also be accomplished at a finer level of detail through CC/PP documents present in the form post used to retrieve the markup. R12: The mechanism must support reliable delivery at least as good as the session control protocol. The framework provides fully reliable delivery. R13: For key-based indications, the mechanism must provide some form of indication of key press duration. The DML capabilities are equivalent to MGCP. If this is provided in MGCP, it is provided here. If not, the markup can be extended to support it. R14: For key-based indications, the mechanism must provide some form of indication of relative key-press start time (relative to other key presses). The framework can support any kind of user input as long as a suitable markup is defined. If key-based input indications beyond DTMF are needed, these features can all be added to the markup. R15: The receiving application must be able to detect user activity indication loss due to packet loss from received user activity indications. HTTP is sent over TCP, so user input indications are reliable. J. Rosenberg [Page 20] Internet Draft markup April 24, 2002 R16: The mechanism must allow for end-to-end security/privacy between source and destination. The exact requirements here need to be defined in more detail. A requirement of this level of generality cannot be usefully answered. R17: Both entities must be able to authenticate each other. This is done using HTTP Basic/Digest over SSL. D1: The mechanism should be simple to implement and execute on devices with simple interfaces. The framework supports devices ranging from the profoundly stupid to brilliantly complex. D2: There should be a separation between the transport mechanism in the signaling plane and the message syntax. Yes. Transport is done using HTTP form posts. The message syntax is a function of the markup, which is separate. D3: The mechanism should attempt to reduce recovery delays under packet loss scenarios. The behavior is exactly that as provided by TCP. D4: The mechanism should support routing and identification that is compatible with use in a SIP-based network. Since the HTTP URLs are handed out by the entity that the URLs need to route to, there are no routing issues we are aware of. 6 Conclusion This document has proposed a framework for supporting stimulus for SIP applications based on markup. This framework supports a broad range of device capabilities and user input modes, leveraging existing markup languages (such as HTML, VoiceXML and WML). To handle simple phones that only support DTMF, we defined a simple DTMF markup language that provides equivalent functionality to MGCPs digit maps. 7 To Do o More details on DML. J. Rosenberg [Page 21] Internet Draft markup April 24, 2002 o The 3pcc interactions aren't quite right; they use one of the flows that is not recommended, for simplicity. Need to upgrade them to the proper flows. o More information on CC/PP. 8 Authors Addresses Jonathan Rosenberg dynamicsoft 72 Eagle Rock Avenue First Floor East Hanover, NJ 07936 email: jdrosen@dynamicsoft.com 9 Normative References [1] J. Rosenberg, H. Schulzrinne, et al. , "SIP: Session initiation protocol," Internet Draft, Internet Engineering Task Force, Feb. 2002. Work in progress. 10 Informative References [2] H. Schulzrinne and S. Petrack, "RTP payload for DTMF digits, telephony tones and telephony signals," RFC 2833, Internet Engineering Task Force, May 2000. [3] J. Rosenberg, "A SIP interface to voiceXML dialog servers," Internet Draft, Internet Engineering Task Force, July 2001. Work in progress. [4] J. Rosenberg, P. Mataga, and H. Schulzrinne, "An application server component architecture for SIP," Internet Draft, Internet Engineering Task Force, Mar. 2001. Work in progress. [5] B. Culpepper, R. Fairlie-Cuninghame, and J. Mule, "SIP event package for keys," Internet Draft, Internet Engineering Task Force, Mar. 2002. Work in progress. [6] R. Mahy, "Signaled digits in SIP," Internet Draft, Internet Engineering Task Force, Aug. 2001. Work in progress. [7] B. Culpepper and R. Fairlie-Cuninghame, "Network application interaction requirements," Internet Draft, Internet Engineering Task Force, Mar. 2002. Work in progress. J. Rosenberg [Page 22] Internet Draft markup April 24, 2002 [8] B. Campbell and J. Rosenberg, "Session initiation protocol extension for instant messaging," Internet Draft, Internet Engineering Task Force, Apr. 2002. Work in progress. [9] M. Arango, A. Dugan, I. Elliott, C. Huitema, and S. Pickett, "Media gateway control protocol (MGCP) version 1.0," RFC 2705, Internet Engineering Task Force, Oct. 1999. [10] F. Cuervo, N. Greene, A. Rayhan, C. Huitema, B. Rosen, and J. Segers, "Megaco protocol version 1.0," RFC 3015, Internet Engineering Task Force, Nov. 2000. Full Copyright Statement Copyright (c) The Internet Society (2002). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. J. Rosenberg [Page 23]