Internet Engineering Task Force                               SIPPING WG
Internet Draft                                              J. Rosenberg
                                                             dynamicsoft
draft-rosenberg-sipping-markup-00.txt
April 24, 2002
Expires: October 2002


         A Framework for Stimulus Signaling in SIP Using Markup

STATUS OF THIS MEMO

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress".

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   To view the list Internet-Draft Shadow Directories, see
   http://www.ietf.org/shadow.html.


Abstract

   In order for SIP applications to work, they will frequently need to
   collect user input and provide feedback to users. Traditionally, user
   input has been done in the PSTN through DTMF. Much work has occurred
   on extending these DTMF models into the domain of SIP, typically by
   transporting DTMF digits or user input through some SIP message to an
   application server. We propose a broader framework for stimulus using
   markup. The approach can support traditional DTMF user input, but
   also a rich variety of devices, user interfaces and stimulus that
   goes well beyond DTMF.








J. Rosenberg                                                  [Page 1]

Internet Draft                   markup                   April 24, 2002






                           Table of Contents



   1          Introduction ........................................    3
   2          Framework ...........................................    3
   2.1        Extensibility .......................................    6
   2.2        Lifecycle ...........................................    7
   2.3        Security ............................................    7
   2.4        Feature Interaction .................................    8
   3          DTMF Input Using DML ................................    8
   3.1        Overview ............................................    8
   3.2        DML Syntax ..........................................    9
   4          Example .............................................   10
   4.1        Pre-Paid Calling Card ...............................   10
   4.1.1      HTML ................................................   11
   4.1.2      VoiceXML ............................................   11
   4.1.3      DML .................................................   12
   4.2        Voice Recorder ......................................   15
   4.2.1      DML .................................................   15
   4.2.2      HTML Flow ...........................................   17
   5          Requirements Analysis ...............................   17
   6          Conclusion ..........................................   21
   7          To Do ...............................................   21
   8          Authors Addresses ...................................   22
   9          Normative References ................................   22
   10         Informative References ..............................   22





















J. Rosenberg                                                  [Page 2]

Internet Draft                   markup                   April 24, 2002


1 Introduction

   Stimulus signaling is input provided by a user to a network
   application, where the user agent has no understanding of the
   semantics of that user input. It is merely passed blindly to the
   network application for processing. This is in contrast to functional
   signaling, where the user agent understands the semantic of the
   feature the user is trying to invoke, and explicitly requests the
   network to provide it. Much has been written on the relative pros and
   cons of both approaches. However, it would appear that both are
   needed in a complete SIP system.

   Stimulus signaling in the PSTN has traditionally been done through
   DTMF input and with speech recognition. In both cases, the user agent
   (the phone) has no awareness of the input, and merely passes it into
   the network for consumption. Not surprisingly, a great deal of
   attention has focused on providing these capabilities in SIP [1]. The
   IETF has standardized techniques for carrying DTMF within RTP [2].
   Drafts have been written on how SIP applications, controlled by
   VoiceXML, can use the speech or DTMF carried in RTP to perform their
   functions [3] [4]. However, this approach requires that the network
   application receive the media stream, and process it in order to
   obtain the stimulus. There has been growing consensus that for DTMF-
   only applications, this is too heavyweight to be the sole solution
   for stimulus signaling within SIP.

   The result has been a number of drafts written on the transport of
   DTMF (or a generalization of DTMF to user key events) within SIP,
   rather than within RTP [5] [6]. Most recently, there has been
   generation of a requirements specification that details the problem
   that is to be solved [7].

   In this draft, we propose a framework for meeting the requirements in
   [7]. However, we look at the problem more broadly than past
   solutions. Rather than considering just DTMF, or just a keyboard, or
   any specific form of user input, we provide a framework for stimulus
   signaling of any type, with any kind of user interface. The framework
   is based on the usage of markup languages, such as VoiceXML and HTML
   (indeed, both of those can be used with the framework).

   Section 2 discusses the proposed framework. Section 3 considers the
   problem of DTMF as user input, and proposes the DTMF Markup Language
   (DML) within the proposed framework. Section 4 provides some
   application examples using this framework.

2 Framework





J. Rosenberg                                                  [Page 3]

Internet Draft                   markup                   April 24, 2002




                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
            +------------+------------+    +-------------+                
     User   |    User    |    User    |    |             |                
  Interface |Presentation|   Input    |    | Application |                
            |            |            |    |    Server   |                
            +------------+------------+    |             |                
            |     ^   Network  V      |    |             |                
            |     .  Interface .      |    |             |                
            |     .            .      |    |             |                
            +-------------------------+    +---------+---+                
                  ^            V                ^    V                    
                  |            |                |    |                    
                  |            |                |    |                    
                  |            +----------------+    |                    
                  |                                  |                    
                  +----------------------------------+                    
                                                                          



   Figure 1: A Model for Stimulus


   In the most general sense, a system for providing stimulus signaling
   to applications can be modeled as shown in Figure 1. In this model,
   there is a user agent that has a user interface of some sort. This
   user interface has two components - a presentation component and a
   user input component. The display component (which may be non-
   existent in some agents) provides feedback to the user. This feedback
   could be through speech, it could be through text on a two-line LCD
   display, it could be text or graphics on a small display, or it could
   be through a web page on a PC. The presentation component provides
   sufficient context for the user to provide input back to the
   application. The user input component is responsible for taking the
   user input, and sending it to the network application. The user input
   could be in the form of speech, DTMF, a keypress on a keyboard, or a
   click on a hyperlink. As the user provides input, this may (or may
   not) result in a change in the user interface.



J. Rosenberg                                                  [Page 4]

Internet Draft                   markup                   April 24, 2002


   When the user sends a SIP request that initiates a dialog, this
   request will be routed through the SIP network. It will potentially
   pass through one or more application servers en route to the
   recipient. In this context, an application server is a proxy or a
   B2BUA which provides one or more features for the benefit of the
   originator, recipient, or both. These application servers may require
   user interaction in order to deliver their features. Such interaction
   requires a user interface providing user input, and potentially
   providing a presentation component. Each application has its own
   independent user interface requirements.

   In this proposed framework, the user interface is provided through
   markups. These markups provide the user with a presentation of the
   feature, and provide the appropriate context for the user to provide
   input. User input is transmitted to the application using HTTP form
   posts. These posts may, in addition to posting user input, also
   return a new piece of markup for rendering. This is exactly the model
   used by existing markup languages, including HTML, WML, and VoiceXML.

   The user initially obtains the markups through HTTP references
   provided in headers placed into SIP messages (For the moment, we
   assume this is a new header, App-Info. It remains to be determined
   whether an existing header (Call-Info, specifically), can be used
   here). An application server that wishes to present the originator
   with a user interface places the reference in a response, and an
   application server that wishes to present the recipient with a user
   interface places the reference in a request. When the user agent
   invokes the reference to fetch the content, it will obtain the
   initial markup that presents the user interface. There can be more
   than one reference in a SIP message. This can happen if there are
   multiple applications that wish to be involved in a single dialog.
   Each reference is marked with an identifier that indicates the name
   and owner of the application. This allows the UA to render each
   interface separately, and possibly to discard the reference if the
   user does not want to interact with that application.

   The HTTP URLs handed out in the SIP messages are correlated back to
   the appropriate application server (and application instance) through
   implementation specific means. The key idea is that the element which
   hands out the URL decides how to format it so that the HTTP request
   routes back to the appropriate server and provides the appropriate
   context. No specific standardization activities are needed. The
   behavior of the server on receipt of these HTTP posts do not need to
   be standardized either. A post operation can result in SIP actions,
   as needed - hanging up the call, performing a re-INVITE, and so on,
   in addition to modifying the user interface itself by returning
   additional markup in the response.




J. Rosenberg                                                  [Page 5]

Internet Draft                   markup                   April 24, 2002


2.1 Extensibility

   It is critical that a variety of different markups can be supported,
   depending on the capabilities of the user agents. To support that,
   standard MIME negotiation features are used. In the initial request
   sent by the UAC, an Accept header is included. This header lists the
   content types of the markups which are supported by the UAC. It can
   also use q values to prioritize the ones it supports. For example,
   consider a VoIP phone with a large display. The phone supports HTML,
   WML, and DTMF input through DML. An INVITE request generated by the
   phone might look like, in part:


   INVITE sip:callee@example.com SIP/2.0
   From: sip:caller@example.com
   Accept: text/html;q=1.0, text/wml;q=0.5, text/dml;q=0.2



   This indicates a preference for html, followed by wml, followed by
   DML.

   Usage of the Accept header in the request allows the application
   server to return a URL that points to a markup using one of the
   supported formats of the UAC. Negotiating the type for usage by the
   UAS works differently. The application pre-emptively inserts an App-
   Info header into the request, with multiple URLs listed. Each URL is
   associated with the different types supported by the application, and
   is labeled with the type. A q value is used to provide the
   prioritization from the perspective of the application. The UAS can
   then select one, and use it. A Require header can be used to force
   the UAS to reject the request if none of the types are supported, in
   which case the application falls back to non-markup based methods for
   user input.

   This mechanism allows for a variety of different markups to be
   defined that are ideally suited to the particular user input. For
   example, many phones have a UI that supports a set of buttons, each
   of which is associated with a specific line of text on a display. A
   markup language can be specified for this environment which provides
   the text for each button, and a URL to be invoked when the button is
   pressed. The usage of markups also facilitates application of the W3C
   Composite Capabilities and Preference Profiles (CC/PP) for the server
   to determine the detailed capabilities of the user agent. A CC/PP
   document can be included in the initial fetch of the markup, so that
   the server can tune the markup accordingly. In the case of our phone,
   it would learn how many buttons the phone has, what kind of display
   capability it has, and so on, in order to return the proper markup.



J. Rosenberg                                                  [Page 6]

Internet Draft                   markup                   April 24, 2002


2.2 Lifecycle

   The users interaction with the application is based on a well defined
   lifecycle. The interaction with the application begins when the user
   agent fetches the first document using an HTTP URL provided to it in
   a request or response. The interaction continues until one of the
   following conditions occur:

        o The dialog associated with the message in which the first URL
          was obtained, has terminated. The URL need not have been
          obtained in the request that created the dialog. It could have
          been provided in a mid-dialog request (an INFO, for example).
          However, once the dialog terminates, the interaction
          terminates.

        o An HTTP POST request is made that generates an error response
          (any 4xx or 5xx response).

        o A markup is provided to the user that has a format specific
          means for terminating the interaction.

   In this context, "terminating" the interaction means that the URLs
   held by the user agent are no longer valid and referencible. Any
   pop-up windows or other user interface artifacts should be cleared.

   Once terminated, the interaction can be restarted. This is possible
   only if the dialog is still active. If it is, the application server
   can send a new URL in a request or response. Once this URL is
   fetched, the interaction begins again.


        OPEN ISSUE: It would be very useful to be able to push URLs
        in MESSAGE requests [8] that initiate a user interaction
        with an application. However, these can (and probably
        should) occur outside of the dialog. How do the lifecycle
        rules work in that case? How does the user know what dialog
        the URL is associated with? Does it need to know? Can these
        be used to support user interfaces that occur totally
        outside of the scope of a dialog? For example, when the
        user registers, they get a MESSAGE request with a user
        interface for activating features, making calls, etc. Do we
        want to support that as part of the framework? There are
        big security issues.

2.3 Security

   It is critical that the user only interact with applications which
   are legitimately involved in the processing of the session. In this



J. Rosenberg                                                  [Page 7]

Internet Draft                   markup                   April 24, 2002


   case, "legitimately" means that the interaction is with a server
   which was on the SIP request path that established the dialog. The
   benefit of this model is that it maps the authorization policy for
   application interaction to a known policy. Presumably, the SIP
   network has been set up to route the requests through servers that
   are authorized to process the call. Those authorization policies can
   be enforced through known SIP security mechanisms, such as server to
   server TLS and IPSec. Thus, a SIP network can already be built to
   ensure that requests and responses only flow through servers that are
   authorized to handle those requests and responses. If we specify that
   the policy for app interaction is the same - a user can only interact
   with servers that are authorized to handle the requests and responses
   that establish the dialog - the enforcement of this policy is a
   solved problem. So long as the HTTP URL for interaction with the
   application is obtained from a SIP message, and not through an out-
   of-band means, no further security tools are needed to enforce this
   policy.

   It is also necessary to secure the interaction with the HTTP server.
   The server may need to validate that the user of the HTTP client it
   is interacting with is the same user that was involved in the SIP
   signaling. Similarly, the HTTP client may need to validate that the
   server it is communicating with is the one that handed out the URL in
   the SIP message. These two authentication functions are readily
   performed with traditional web security techniques - HTTP basic or
   digest authentication over TLS/SSL.

2.4 Feature Interaction

   It is fundamental to this framework that the user can interact with
   multiple applications involved in the same dialog. There need not be
   any coordination between those applications. However, there are
   potential issues with feature interaction that are worth noting.

   One issue is how to determine which application should receive the
   user input. This is a non-problem with markups that can present a
   user interface. When the user presses a button or clicks a link, that
   input is directed to the markup that owns that button or link. The
   situation is more complex with DTMF input, where there is no way to
   determine which application the input is meant for. In this case,
   there is little choice but to send it to both applications. It is
   very likely that the wrong thing will happen in this case. Hopefully
   this will convince vendors to move towards user interfaces that don't
   have this problem.

3 DTMF Input Using DML

3.1 Overview



J. Rosenberg                                                  [Page 8]

Internet Draft                   markup                   April 24, 2002


   The vast majority of voice devices today have no display, and have a
   12 key keypad that allows the user to enter digits. This limited
   interface has been the model for interaction with communications
   applications for decades. So, while limited, it is critical that it
   can be well supported by the framework proposed here.

   Our approach is define a DTMF Markup Language (DML), which is
   extremely simple. It doesn't provide any kind of user interface. It
   merely presents the UA with a set of digit maps. These digit maps
   (defined in MGCP [9]) represent a series of digits that the
   application wishes to find a match against. Effectively, each is a
   condition that can be satisfied through a set of user input. Each
   digit map is associated with an HTTP URL that is to be invoked when
   that condition is met. Multiple conditions can be provided, each of
   which has its own URL. When one of the conditions is met, an HTTP
   POST is made, passing the digits collected. There are two modes of
   operation - stop-and-wait, and immediate. In stop-and-wait, once a
   condition is matched, any additional user input is buffered locally.
   The result of the form POST is another DML document, which provides a
   new set of matching conditions. The buffered digits, along with ones
   collected subsequently, are then applied against the new conditions.
   The stop-and-wait approach is consistent with the behavior of most
   other markups (VoiceXML, HTML, WML), but results in a throughput of
   one match per RTT. An application may need to receive each digit, one
   at a time, and might not be able to wait for the next markup to be
   returned. In immediate mode, a new markup is not obtained from the
   POST operation. Matching continues on the existing markup. An
   additional match causes another form POST, even if the existing one
   is in progress. A sequence number is provided as part of the POST
   operation, so that they can be properly ordered at the application.

   It is proposed that DML is based on the digit maps specified in MGCP
   [9], and that the behavior of a UA interpreting a DML document is
   identical to that of an MGCP gateway matching digits to a digit map.
   The digit maps are not limited to just DTMF; other packages could be
   used for events like hookflash. Usage of a markup, instead of a
   dedicated protocol, such as MGCP or MEGACO [10], allows for support
   of a far wider set of devices within the same framework. However,
   basing it on an existing approach gives good evidence of correctness
   and implementability. We are agnostic as to whether MEGACO or MGCP
   should be used as the base.

3.2 DML Syntax

   A DML document is an XML document. All DML documents are encoded in
   US-ASCII (there is no need for UTF-8), and are well-formed. The top
   level tag is dml. The DML tag contains a set of conditions, each of
   which is expressed with a condition tag. The content of the condition



J. Rosenberg                                                  [Page 9]

Internet Draft                   markup                   April 24, 2002


   tag is the digit map that is to be matched. The condition tag has an
   attribute, mode, which indicates whether immediate mode or stop-and-
   wait mode is being used. The default is stop-and-wait. There is also
   an href attribute that provides an HTTP URL to use for the HTTP form
   post. The digits collected as a result of the matching operation are
   posted using the URL parameter "digits".

   The following example DML document represents the matching conditions
   described in the example in Section 2.1.5 of MGCP [9]:


   <dml>
   <condition href="http://server33.apps.example.com?case=1">0T</condition>
   <condition
   href="http://server33.apps.example.com?case=2">00T</condition>
   <condition
   href="http://server33.apps.example.com?case=3">[1-7]xxx</condition>
   <condition
   href="http://server33.apps.example.com?case=4">8xxxxxxx</condition>
   <condition
   href="http://server33.apps.example.com?case=5">#xxxxxxx</condition>
   <condition
   href="http://server33.apps.example.com?case=6">*xx</condition>
   <condition
   href="http://server33.apps.example.com?case=7">91xxxxxxxxxx</condition>
   <condition
   href="http://server33.apps.example.com?case=8">9011x.T</condition>
   </dml>



   If the user pressed *77, the URL
   http://server33.apps.example.com?case=6&digits=*77 would be invoked
   by the user agent. This would return the next DML document to
   execute.

4 Example

   The best way to illustrate the proposed framework is with examples.

4.1 Pre-Paid Calling Card

   In this example, a user makes a SIP call using a prepaid calling card
   application. The application supports user input through DML,
   VoiceXML, and HTML. The HTML version is much richer, allowing the
   user to enter their prepaid card number through the web, and allowing
   them to see how much money is remaining through a web page. They can
   also make another call by clicking a button, and then entering a new



J. Rosenberg                                                 [Page 10]

Internet Draft                   markup                   April 24, 2002


   number. When the client supports VoiceXML, the credit card
   information is collected through voice prompts and DTMF recognition
   done locally on the user agent. When the user is done with the call,
   they can hang up, or press a "long pound" to enter a new number. When
   the client supports DML only, the user experience is identical to the
   VoiceXML case. However, the calling card information is collected by
   a VoiceXML server in the network. Once the call is established, the
   long pound is detected through DML running on the client.

4.1.1 HTML


   The call flow for this scenario is shown in Figure 2. First, the
   caller sends an INVITE (1) which is routed to the application server
   providing the pre-paid calling card application. The INVITE contains
   an Accept header listing text/html as a supported type. So, the
   application server generates a 183 (2) with an App-Info header. This
   header has an HTTP URL which is routed back to the app server. The
   client fetches this URL (3) and gets back an HTML document. This
   document has a form, thanking the user for making a call, and asking
   for the destination number and prepaid calling card number. The user
   types these in, and clicks submit. This results in a form POST (5).
   This form post does two things. First, it returns another HTML
   document (6). This document indicates the amount of minutes that the
   user has remaining. It also has buttons for determining the time and
   dollars remaining, and a button for hanging up. The form post also
   causes the pre-paid appliation to send the INVITE (as a B2BUA) to a
   gateway, based on the information entered by the user (7). This call
   completes normally (8-11) and the user can talk. At some point, they
   decide to end the call in order to dial someone else. So, they click
   the hangup button on the form. This results in another form POST
   (12). This causes the application server to terminate the call with
   the gateway (14), and re-INVITE the caller to put them on hold (16-
   18). The form POST returned another page (13) which allows the user
   to enter a new number to call. The user enters a new number, and
   clicks submit. This results in another form POST (19). This returns
   an HTML document similar to the one returned in message 4, showing
   the user how much time remains, and providing a button to hang up.
   The POST also causes the application server to use third party call
   control to connect the user to the gateway once more in order to
   reach the second number (21-26).

4.1.2 VoiceXML

   Interestingly, the call flow for this scenario is identical to that
   of Figure 2! In this case, the caller is calling from a SIP phone
   that supports local interpretation of VoiceXML. Or, it could be a
   user on a PSTN phone dialing through a gateway that supports VoiceXML



J. Rosenberg                                                 [Page 11]

Internet Draft                   markup                   April 24, 2002


   interpretation. Either way, the UAC sends the INVITE request (1).
   This time, the INVITE contains an Accept header with the type
   text/vxml listed. The application server sends a 183 (2) with an HTTP
   URL that can be used to fetch the VoiceXML script. The UA fetches the
   script (3), and its returned (4). The script asks the user to enter
   their calling card number and the phone number to reach. Here, the UA
   itself performs the text-to-speech (or, it might fetch recorded
   speech from a web server if that is what the VoiceXML script tells it
   to do). The UA also performs the DTMF recognition as specified by the
   VoiceXML script. Pushing this functionality into the UA allows for
   "voip-free" recognition. No media needs to be sent on the wire, no
   compression or DTMF encoding needs to occur. The media is collected
   locally and interpreted locally, all on the same platform. When all
   of that is done, the UA performs an HTTP form POST to the URL in the
   VoiceXML script, which has been crafted to route back to the
   application. This form POST can be structured identically to the one
   used in the HTML example. As such, the pre-paid application can work
   identically whether the input is collected through voice or a web
   page.

   The application flow proceeds as in the HTML example. However,
   instead of ending the call by clicking a button on the web page, the
   user presses a long pound. The VoiceXML document returned in message
   6 was written to wait for the user to enter a long pound. The
   resulting form POST (12) returns another VoiceXML script, which asks
   the user to enter the next number. This result is POSTed (19), and
   the call is established.

4.1.3 DML

   In this case, the user agent only supports DML. To collect the credit
   card information and dialed number, the application must use a
   network based VoiceXML server. However, once the call is established,
   the long pound is detected at the client with DML. Traditionally,
   detection of the long pound was done by "forking" the media at the
   UA, sending one stream to the VoiceXML server throughout the duration
   of the call. The VoiceXML server would hunt for the long-pound
   throughout the call. In this approach, there is no forked media, and
   there is no involvement from the VoiceXML server after the call is
   established. This provides a substantial savings of DSP resources and
   of network bandwidth.


   The call flow is shown in Figure 3. The initial INVITE (1) contains
   an Accept header that indicates support for text/dml. The application
   server acts as a B2BUA, and connects the caller to a VoiceXML server
   (2-6). The INVITE towards the VoiceXML server contained an HTTP URL
   in the request URI [3]. This causes the VoiceXML server to fetch the



J. Rosenberg                                                 [Page 12]

Internet Draft                   markup                   April 24, 2002




       Caller            App Server            Gateway
          |(1) INVITE         |                   |
          |------------------>|                   |
          |(2) 183 w. HTTL URL|                   |
          |<------------------|                   |
          |(3) HTTP GET       |                   |
          |------------------>|                   |
          |(4) HTTP 200 OK    |                   |
          |<------------------|                   |
          |(5) HTTP POST      |                   |
          |------------------>|                   |
          |(6) 200 OK         |                   |
          |<------------------|                   |
          |                   |(7) INVITE         |
          |                   |------------------>|
          |                   |(8) 200 OK         |
          |                   |<------------------|
          |                   |(9) ACK            |
          |                   |------------------>|
          |(10) 200 OK        |                   |
          |<------------------|                   |
          |(11) ACK           |                   |
          |------------------>|                   |
          |(12) HTTP POST     |                   |
          |------------------>|                   |
          |(13) HTTP 200 OK   |                   |
          |<------------------|                   |
          |                   |(14) BYE           |
          |                   |------------------>|
          |                   |(15) 200 OK        |
          |                   |<------------------|
          |(16) INVITE        |                   |
          |<------------------|                   |
          |(17) 200 OK        |                   |
          |------------------>|                   |
          |(18) ACK           |                   |
          |<------------------|                   |
          |(19) HTTP POST     |                   |
          |------------------>|                   |
          |(20) HTTP 200 OK   |                   |
          |<------------------|                   |
          |(21) INVITE no SDP |                   |
          |<------------------|                   |
          |(22) 200 OK SDP1   |                   |
          |------------------>|                   |
          |                   |(23) INVITE SDP1   |
          |                   |------------------>|
          |                   |(24) 200 OK SDP2   |
          |                   |<------------------|
          |(25) ACK SDP2      |                   |
          |<------------------|                   |
          |                   |(26) ACK           |
          |                   |------------------>|



   Figure 2: HTML Input Flow

J. Rosenberg                                                 [Page 13]

Internet Draft                   markup                   April 24, 2002




       Caller            App Server          VXML Server           Gateway
          |(1) INVITE         |                   |                   |
          |------------------>|                   |                   |
          |                   |(2) INVITE         |                   |
          |                   |------------------>|                   |
          |                   |(3) 200 OK         |                   |
          |                   |<------------------|                   |
          |                   |(4) ACK            |                   |
          |                   |------------------>|                   |
          |(5) 200 OK         |                   |                   |
          |<------------------|                   |                   |
          |(6) ACK            |                   |                   |
          |------------------>|                   |                   |
          |                   |(7) HTTP GET       |                   |
          |                   |<------------------|                   |
          |                   |(8) HTTP 200 OK    |                   |
          |                   |------------------>|                   |
          |                   |(9) HTTP POST      |                   |
          |                   |<------------------|                   |
          |                   |(10) HTTP 200 OK   |                   |
          |                   |------------------>|                   |
          |                   |(11) BYE           |                   |
          |                   |------------------>|                   |
          |                   |(12) 200 OK        |                   |
          |                   |<------------------|                   |
          |(13) INVITE        |                   |                   |
          |<------------------|                   |                   |
          |(14) 200 OK        |                   |                   |
          |------------------>|                   |                   |
          |(15) ACK           |                   |                   |
          |<------------------|                   |                   |
          |(16) INVITE no SDP |                   |                   |
          |w. App-Info        |                   |                   |
          |<------------------|                   |                   |
          |(17) 200 OK SDP1   |                   |                   |
          |------------------>|                   |                   |
          |                   |(18) INVITE SDP1   |                   |
          |                   |-------------------------------------->|
          |                   |(19) 200 OK SDP2   |                   |
          |                   |<--------------------------------------|
          |(20) ACK SDP2      |                   |                   |
          |<------------------|                   |                   |
          |                   |(21) ACK           |                   |
          |                   |-------------------------------------->|
          |(22) HTTP GET      |                   |                   |
          |------------------>|                   |                   |
          |(23) HTTP 200 w.DML|                   |                   |
          |<------------------|                   |                   |
          |(24) HTTP POST     |                   |                   |
          |------------------>|                   |                   |
          |(25) HTTP 200 OK   |                   |                   |
          |<------------------|                   |                   |
          |                   |(26) BYE           |                   |
          |                   |-------------------------------------->|
          |                   |(27) 200 OK        |                   |
          |                   |<--------------------------------------|
          |(28) INVITE no SDP |                   |                   |
          |<------------------|                   |                   |
          |(29) 200 OK SDP3   |                   |                   |
          |------------------>|                   |                   |
          |                   |(30) INVITE SDP3   |                   |
          |                   |------------------>|                   |
          |                   |(31) 200 OK SDP4   |                   |
          |                   |<------------------|                   |
          |(32) ACK SDP4      |                   |                   |
          |<------------------|                   |                   |
          |                   |(33) ACK           |                   |
          |                   |------------------>|                   |



J. Rosenberg                                                 [Page 14]

Internet Draft                   markup                   April 24, 2002


   Figure 3: Prepaid Calling Card using DML


   script by invoking that URL (7). The script that is returned (8) asks
   the user to enter their calling card number and the destination
   number. It is collected over the RTP stream established with the
   caller. Once collected, the result is POSTed to the application
   server (9). Once done, the VoiceXML server is no longer needed. So,
   the application server terminates the dialog with it (11). It then
   re-INVITEs the caller, putting them on hold (13-15). The next step is
   to connect them to the gateway using third party call control. The
   INVITE sent to the caller for the 3pcc flow (16) contains an App-Info
   header. This header contains a URL for a DML document. This document
   asks the UA to listen for a long pound. An example of how the
   document might look is:


   <dml>
   <condition
   href="http://server33.example.com?case=2">#L</condition>
   </dml>



   After the 3pcc exchange, the caller fetches this document (22,23).
   The user has their conversation. At the end, instead of hanging up,
   the enters a long pound. This matches the condition in the DML
   document, causing an HTTP form POST to the application server (24).
   The application server hangs up with the gateway (26). Once again
   using third party call control, the application server now connects
   the caller with the VoiceXML server (28-33). The flow proceeds from
   here as it does in message 7 onwards. The VoiceXML server would
   collect the next phone number, and then the application server would
   connect the caller to the gateway.

4.2 Voice Recorder

   Another example application is a voice recorder. The voice recorder
   application allows a user to record their conversation. We consider
   two forms of input. In one mode, the user only uses DTMF to control
   the recording. They can press 1 to start it, and 2 to stop it. There
   are no voice prompts or greetings. The user needs to know to press 1
   and 2 to start and stop. In the HTML version, the user gets a pop-up
   console that has buttons to start and stop recording. The
   implementation is based on the app component model [4] and uses a
   conference server component and a recording server component.

4.2.1 DML



J. Rosenberg                                                 [Page 15]

Internet Draft                   markup                   April 24, 2002



   A call flow for the DML version of the application is shown in Figure
   4. The caller sends an INVITE, which is routed to the application
   server (1). This INVITE has an Accept header listing text/dml as a
   supported type. The application server generates a 183 (2) with an
   App-Info header containing an HTTP URL for the DML document. The
   application server acts as a B2BUA, and completes the call to the
   called party (3-7). The caller fetches the DML document (8), which is
   returned to them. The document for this application is
   straightforward:


   <dml>
   <condition
   href="http://server33.example.com">1</condition>
   </dml>



   At some point during the call, the user presses 1. This matches the
   first DML condition, causing an HTTP form POST to the URL
   http://server33.example.com?digits=1 (10). What follows are a series
   of third party call control exchanges. These exchanges connect the
   caller to a conference server, the callee to the conference server,
   and then connect a recording server (controlled by RTSP) to the
   conference server. This is an application unaware conference server
   as described in [4], which mixes together the media from all users in
   the same context. Messages 12-17 connect the caller to the conference
   server. Messages 18-23 connect the callee to the conference server.
   Messages 24-29 bring the recording server into the conference. The
   controller then uses RTSP (30) to instruct the RTSP server to record
   the contents of the media it is receiving.

   Message 11 will have also caused another DML document to be returned
   to the caller:


   <dml>
   <condition
   href="http://server33.example.com">2</condition>
   </dml>



   If the user presses 2, this is reported to the application server,
   which can stop recording (not shown).





J. Rosenberg                                                 [Page 16]

Internet Draft                   markup                   April 24, 2002


        RTSP usage is most definitely not quite right in this call
        flow.

4.2.2 HTML Flow

   It is hopefully not a surprise to the reader at this point to learn
   that the call flow for the HTML version of this application is
   identical to the DML version. The only difference is that instead of
   returning DML documents, the HTTP operations return HTML documents.
   These documents have buttons that cause the appropriate form POSTS
   for starting and stopping recording. They would also provide the user
   with other options, such as a link to listen to the recorded audio,
   for example.

5 Requirements Analysis

   The following analyzes this framework against the general
   requirements outline in [7]:

        R1: The mechanism must support collecting device/user input
             which is associated with an established SIP session but
             must also support collecting device/user input that is
             outside of any established sessions.

             The framework supports collecting input associated with a
             SIP session. It can, without any changes, also support
             collection of input outside of a session, although more
             thought is needed on the security implications of doing so.

        R2: The mechanism must transport user indications to network
             elements independently of the media plane.

             The framework sends the indications using HTTP, outside of
             the media plane.

        R3: The transport mechanism must be sensitive to the limited
             bandwidth constraints of some signaling planes, for
             instance, reliability through blind retransmission is not
             acceptable.

             Reliability is done through TCP. The amount of bandwidth
             used is very small, since the HTTP requests need not
             contain anything but the request line.

        R4: The mechanism must support multiple network entities
             requesting and receiving indications independently of each
             other.




J. Rosenberg                                                 [Page 17]

Internet Draft                   markup                   April 24, 2002




       Caller     App Server   Conf Server Record Server   Callee
          |(1) INVITE  |            |            |            |
          |----------->|            |            |            |
          |(2) 183     |            |            |            |
          |<-----------|            |            |            |
          |            |(3) INVITE  |            |            |
          |            |------------------------------------->|
          |            |(4) 200 OK  |            |            |
          |            |<-------------------------------------|
          |            |(5) ACK     |            |            |
          |            |------------------------------------->|
          |(6) 200 OK  |            |            |            |
          |<-----------|            |            |            |
          |(7) ACK     |            |            |            |
          |----------->|            |            |            |
          |(8) HTTP GET|            |            |            |
          |----------->|            |            |            |
          |(9) HTTP 200|            |            |            |
          |w. DML      |            |            |            |
          |<-----------|            |            |            |
          |(10) HTTP POST           |            |            |
          |----------->|            |            |            |
          |(11) 200 OK |            |            |            |
          |<-----------|            |            |            |
          |(12) INVITE |            |            |            |
          |no SDP      |            |            |            |
          |<-----------|            |            |            |
          |(13) 200    |            |            |            |
          |SDP1        |            |            |            |
          |----------->|            |            |            |
          |            |(14) INVITE |            |            |
          |            |SDP1        |            |            |
          |            |----------->|            |            |
          |            |(15) 200 OK |            |            |
          |            |SDP2        |            |            |
          |            |<-----------|            |            |
          |            |(16) ACK    |            |            |
          |            |----------->|            |            |
          |(17) ACK    |            |            |            |
          |SDP2        |            |            |            |
          |<-----------|            |            |            |
          |            |(18) INVITE |            |            |
          |            |no SDP      |            |            |
          |            |------------------------------------->|
          |            |(19) 200    |            |            |
          |            |SDP3        |            |            |
          |            |<-------------------------------------|
          |            |(20) INVITE |            |            |
          |            |SDP3        |            |            |
          |            |----------->|            |            |
          |            |(21) 200 OK |            |            |
          |            |SDP4        |            |            |
          |            |<-----------|            |            |
          |            |(22) ACK    |            |            |
          |            |----------->|            |            |
          |            |(23) ACK    |            |            |
          |            |SDP4        |            |            |
          |            |------------------------------------->|
          |            |(24) INVITE |            |            |
          |            |no SDP      |            |            |
          |            |------------------------>|            |
          |            |(25) 200    |            |            |
          |            |SDP5        |            |            |
          |            |<------------------------|            |
          |            |(26) INVITE |            |            |
          |            |SDP5        |            |            |
          |            |----------->|            |            |
          |            |(27) 200 OK |            |            |
          |            |SDP6        |            |            |
          |            |<-----------|            |            |
          |            |(28) ACK    |            |            |
          |            |----------->|            |            |
          |            |(29) ACK    |            |            |
          |            |SDP6        |            |            |
          |            |------------------------>|            |
          |            |(30) RTSP RECORD         |            |
          |            |------------------------>|            |
          |            |(31) 200 OK |            |            |
          |            |<------------------------|            |



   Figure 4: Voice Recorder App using DML

J. Rosenberg                                                 [Page 18]

Internet Draft                   markup                   April 24, 2002


             Each application server can independently request receipt
             of indications, by placing its own HTTP URL in an App-Info
             header.

        R5: A network entity desiring user indications must be able to
             request user indications from another network entity. The
             entity receiving a request must be able to respond with its
             capability/intent to transmit user indications.

             Requesting of user indications is done by passing an HTTP
             URL to the network entity (originator or recipient) from
             which indications are desired. Capability and intent of
             transmission of user indications is indicated by fetching
             the URL.

        R6: The mechanism must support filtering so that only user
             indications of interest are transmitted.

             For HTML, VoiceXML and WML, user indications are directed,
             so that they are only provided to the application when that
             is what the user really wants. This is the ideal filtering
             scenario. For user interfaces that lack the ability to
             direct input, such as DTMF, the markup provides filtering.
             DML provides equivalent filtering capabilities to MGCP.

        R7: User activity indications must not be generated unless
             implicitly or explicitly requested by an entity.

             User input is only sent if an application has requested it
             by passing an HTTP URL referencing the markup for the
             interaction.

        R8: The mechanism must support user indications via keys or
             buttons and at the very least must define support for user
             interaction via a standard, generic computer keyboard.

             The framework supports interactions using any kind of
             input. Specific support for DTMF input is specified using
             DML. If generic keyboard input is desired (not clear that
             it is), a markup can be defined for it.

        R9: The mechanism must support the definition of device and/or
             user-specific buttons.

             The framework supports interactions using any kind of
             input, including device or user-specific buttons. A markup
             would need to be defined for it. The author suspects that
             this will be far more complex than would appear at first



J. Rosenberg                                                 [Page 19]

Internet Draft                   markup                   April 24, 2002


             glance, given the potential variabilities in the user input
             capabilities of devices (buttons, switches, jog-dials,
             sliders, etc.)

        R10: The mechanism must be extensible so that some non key-based
             user indications can be supported in the future, for
             instance, sliders, dials or wheels.

             The framework supports interactions using any kind of
             input, including sliders, dials, or wheels.

        R11: A requestor must be able to determine the makeup/contents
             of the user interface possessed by a target device.

             This is done through the Accept header in the request,
             which lists the set of supported markups. It can also be
             accomplished at a finer level of detail through CC/PP
             documents present in the form post used to retrieve the
             markup.

        R12: The mechanism must support reliable delivery at least as
             good as the session control protocol.

             The framework provides fully reliable delivery.

        R13: For key-based indications, the mechanism must provide some
             form of indication of key press duration.

             The DML capabilities are equivalent to MGCP. If this is
             provided in MGCP, it is provided here. If not, the markup
             can be extended to support it.

        R14: For key-based indications, the mechanism must provide some
             form of indication of relative key-press start time
             (relative to other key presses).

             The framework can support any kind of user input as long as
             a suitable markup is defined. If key-based input
             indications beyond DTMF are needed, these features can all
             be added to the markup.

        R15: The receiving application must be able to detect user
             activity indication loss due to packet loss from received
             user activity indications.

             HTTP is sent over TCP, so user input indications are
             reliable.




J. Rosenberg                                                 [Page 20]

Internet Draft                   markup                   April 24, 2002


        R16: The mechanism must allow for end-to-end security/privacy
             between source and destination.

             The exact requirements here need to be defined in more
             detail. A requirement of this level of generality cannot be
             usefully answered.

        R17: Both entities must be able to authenticate each other.

             This is done using HTTP Basic/Digest over SSL.

        D1: The mechanism should be simple to implement and execute on
             devices with simple interfaces.

             The framework supports devices ranging from the profoundly
             stupid to brilliantly complex.

        D2: There should be a separation between the transport mechanism
             in the signaling plane and the message syntax.

             Yes. Transport is done using HTTP form posts. The message
             syntax is a function of the markup, which is separate.

        D3: The mechanism should attempt to reduce recovery delays under
             packet loss scenarios.

             The behavior is exactly that as provided by TCP.

        D4: The mechanism should support routing and identification that
             is compatible with use in a SIP-based network.

             Since the HTTP URLs are handed out by the entity that the
             URLs need to route to, there are no routing issues we are
             aware of.

6 Conclusion

   This document has proposed a framework for supporting stimulus for
   SIP applications based on markup. This framework supports a broad
   range of device capabilities and user input modes, leveraging
   existing markup languages (such as HTML, VoiceXML and WML). To handle
   simple phones that only support DTMF, we defined a simple DTMF markup
   language that provides equivalent functionality to MGCPs digit maps.

7 To Do

        o More details on DML.




J. Rosenberg                                                 [Page 21]

Internet Draft                   markup                   April 24, 2002


        o The 3pcc interactions aren't quite right; they use one of the
          flows that is not recommended, for simplicity. Need to upgrade
          them to the proper flows.

        o More information on CC/PP.

8 Authors Addresses


   Jonathan Rosenberg
   dynamicsoft
   72 Eagle Rock Avenue
   First Floor
   East Hanover, NJ 07936
   email: jdrosen@dynamicsoft.com



9 Normative References

   [1] J. Rosenberg, H. Schulzrinne, et al.  , "SIP: Session initiation
   protocol," Internet Draft, Internet Engineering Task Force, Feb.
   2002.  Work in progress.

10 Informative References

   [2] H. Schulzrinne and S. Petrack, "RTP payload for DTMF digits,
   telephony tones and telephony signals," RFC 2833, Internet
   Engineering Task Force, May 2000.

   [3] J. Rosenberg, "A SIP interface to voiceXML dialog servers,"
   Internet Draft, Internet Engineering Task Force, July 2001.  Work in
   progress.

   [4] J. Rosenberg, P. Mataga, and H. Schulzrinne, "An application
   server component architecture for SIP," Internet Draft, Internet
   Engineering Task Force, Mar. 2001.  Work in progress.

   [5] B. Culpepper, R. Fairlie-Cuninghame, and J. Mule, "SIP event
   package for keys," Internet Draft, Internet Engineering Task Force,
   Mar. 2002.  Work in progress.

   [6] R. Mahy, "Signaled digits in SIP," Internet Draft, Internet
   Engineering Task Force, Aug. 2001.  Work in progress.

   [7] B. Culpepper and R. Fairlie-Cuninghame, "Network application
   interaction requirements," Internet Draft, Internet Engineering Task
   Force, Mar. 2002.  Work in progress.



J. Rosenberg                                                 [Page 22]

Internet Draft                   markup                   April 24, 2002


   [8] B. Campbell and J. Rosenberg, "Session initiation protocol
   extension for instant messaging," Internet Draft, Internet
   Engineering Task Force, Apr.  2002.  Work in progress.

   [9] M. Arango, A. Dugan, I. Elliott, C. Huitema, and S. Pickett,
   "Media gateway control protocol (MGCP) version 1.0," RFC 2705,
   Internet Engineering Task Force, Oct. 1999.

   [10] F. Cuervo, N. Greene, A. Rayhan, C. Huitema, B. Rosen, and J.
   Segers, "Megaco protocol version 1.0," RFC 3015, Internet Engineering
   Task Force, Nov. 2000.


   Full Copyright Statement

   Copyright (c) The Internet Society (2002). All Rights Reserved.

   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implementation may be prepared, copied, published
   and distributed, in whole or in part, without restriction of any
   kind, provided that the above copyright notice and this paragraph are
   included on all such copies and derivative works. However, this
   document itself may not be modified in any way, such as by removing
   the copyright notice or references to the Internet Society or other
   Internet organizations, except as needed for the purpose of
   developing Internet standards in which case the procedures for
   copyrights defined in the Internet Standards process must be
   followed, or as required to translate it into languages other than
   English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assigns.

   This document and the information contained herein is provided on an
   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.











J. Rosenberg                                                 [Page 23]