TOC 
SIP WGR. Mahy
Internet-DraftN. Ismail
Expires: August 25, 2003Cisco Systems, Inc.
 February 24, 2003

Media Policy Manipulation in the Conference Policy Control Protocol
draft-mahy-sipping-media-policy-control-00.txt

Status of this Memo

This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt.

The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html.

This Internet-Draft will expire on August 25, 2003.

Copyright Notice

Copyright (C) The Internet Society (2003). All Rights Reserved.

Abstract

The SIP conferencing framework defines a model for tightly-coupled conferencing in the Session Initiation Protocol (SIP), in which a Conference Policy Control Protocol is used to manipulate policies relevant to a specific conference, such as conference membership policy, authorization policy, and media policy. This document describes a logical model to describe media processing in a SIP conference. It also defines specific protocol semantics and a specific syntax to manipulate that model.



 TOC 

Table of Contents




 TOC 

1. Conventions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119[2].



 TOC 

2. Overview

The SIP conferencing framework[3] defines a model for tightly-coupled conferencing in SIP[1], in which a Conference Policy Control Protocol is used to manipulate policies which are relevant to a specific SIP conference, such as conference membership policy, authorization policy, and media policy. While the conference policy control protocol provides many non-media specific policies such as membership policy and authorization policy, this document specifically addresses #requirements# to manipulate the way in which media in a SIP conference is selected, combined, and modified. It defines a logical model of this type of media processing using a "media topology graph". By manipulating the graph, authorized users can change the media processing behavior of the mixers associated with a specific SIP conference.

A media topology graph consists of individual media streams, logical groups of media streams, and functions or "operations" performed on those streams. These elements are typically associated with a specific subconference. A subconference simply defines a context which allows different groups of users to share a media topology and participant roster with a subset of the participants in a conference. Subconferences are defined in the conferencing framework, and are typically used to enable conferencing sidebars. For convenience purposes, subgraphs consisting of groups and operations, called collections, can be defined, instantiated, and manipulated just like individual elements. These elements and their properties are defined below.

2.1 Elements

First are Streams. These are the actual media streams sent and/or received by or on behalf of conference participants. Media streams are typically established when conference participants join a conference and are described by the SDP media lines in the offer/answer exchange between the participants and the focus. Within the media topology graph, each stream is described by a media type, direction and at least one identifier. Media types can be audio, video or text. Other media types can also be specified in the future. Direction is "in" for streams originating from the conference participants and "out" for streams terminating at the conference participant. Stream identifiers can be network identifiers or aliases. Network identifiers consist of an address family (IPv4 or IPv6), an IP address, and a port number.

Aliases can also be created for any of the streams, either automatically or when created manually. One such automatic alias consists of a participant identifier and optionally either the media stream identification "mid" as specified in SDP or the position of the media line describing the stream in SDP. Another set of automatic aliases can be created automatically when per media line i-lines appear in the SDP.

Conference Policy servers provide clients with stream descriptions either as part of the SIP conferencing package or as responses to inventory requests specified in section x. Clients use the stream identifier that is part of the stream description to associate and connect the stream to other media topology elements to achieve the media policy required. open issue 2.1.1: Should we allow stream identifiers to be a URL?. An example of that is notifications that are either stored locally by the focus or one of its controlled mixers or stored on an external web page.

Next are Groups. Media groups are created by clients within the context of a sub-conference as specified in section y. A media group has a media type and a name. Groups can be connected to streams, operators and collections. An example is a group of audio streams that are associated with the main sub-conference and is connected to an operator that determines the loudest speaker.

Next are Operators. Operators are basic elements that perform simple media operations. They select among media streams, combine streams, or perform other media processing. Each operator has a type, one or more inputs, one logical output, and an optional set of parameters. The type uniquely identifies the operator and specifies the media service offered. The input connectors specify the number and type of streams required by the operator to perform its operation. An input or output of an operator can be connected to streams, operators, collections or groups. The number and type of parameters depend on the type of the operator. Each type defines the semantics of the operation and any parameters. Parameters define aspects of the operator's function that can differ from one instance of the operator to another.

This specification defines a set of standard operators (see section z). Each standard operator has a unique type that will be registered with IANA and an XML schema describing the operator. Server implementations can support any of the set of standard operators. Implementors can define their own operators and operator types. Each newly defined operator needs a unique type and a published XML schema. Clients make inventory requests to a Server to get the set of operators supported by the server.Clients can then instantiate operators using the method specified in section y.

     OPEN ISSUE: Should we use name spaces with types to guarantee uniqueness?

Finally there are Collections. Collections are complex elements created by connecting different operators, groups, streams and other collections together. Each collection provides a sophisticated media service. Like operators, a collection has a type that uniquely identifies it and specifies its function. Each collection has one or more inputs, one logical output and an optional set of parameters. This specification defines a set of standard collections that offer the most common mixing and switching media functions available. Each standard collection has a unique type that will be registered with IANA and an XML schema describing the collection. Server implementations can support any of the set of standard collections and they can also define their own proprietary collections. Each newly defined collection needs a unique type and a published XML schema. Clients make inventory requests to Server to get the set of collections supported by the server. Clients can then instantiate collections using the method specified in section zz. Clients can also make their own collections to provide new media services by using the method specified in section y.

    OPEN ISSUE: Likewise should we use namespaces here?

2.2 Media Topology Graphs

Below is a diagram which shows a sample media topology with streams, collections, and groups.

     Audio and Video Conference with one Audio Sidebar

        (streams)              (streams)                (streams)

    A B   D E F  H  J      A   C D   F G H I            B   E   J
    | |   | | |  |  |      |   | |   | | | |            |   |   |
    | |   | | |  |  |      |   | |   | | | |            |   |   |
    V V   V V V  V  V      V   V V   V V V V            V   V   V
  +-------------------+   +-------------------+   +-------------------+
  | Main Video In     |   | Main Audio In     |   | Sidebar Audio Out |
  |  (group)          |   | (group)           |   | (group)           |
  +-------------------+   +-------------------+   +-------------------+
           ||            //        ||                      ||
           ||           //         ||                      ||
           ||          //          ||           +-----+    ||
           ||         //           ||           |     |    ||
           \/        //            \/           |     V    \/
  ...................V.   ..................... | .....................
  .                   .   .                   . | .                   .
  .                   .   .                   . | .                   .
  .   vendor          .   .   standard        . | .   standard        .
  .   defined         .   .   conference      . | .   sidebar         .
  .   video           .   .   audio           . | .   audio           .
  .   collection      .   .   collection      . | .   collection      .
  .                   .   .                   . | .                   .
  .                   .   .                   . | .                   .
  .....................   ..................... | .....................
            ||                     ||           |          ||
            \/                     \/           |          \/
  +-------------------+   +-------------------+ | +-------------------+
  | Main Video Out    |   | Main Audio Out    | | | Sidebar Audio Out |
  | (group)           |   | (group)           | | | (group)           |
  +-------------------+   +-------------------+ | +-------------------+
    | | | | | |  |  |       |  | |  | | | |  |  |       |   |   |
    | | | | | |  |  |       |  | |  | | | |  |  |       |   |   |
    V V V V V V  V  V       V  V V  V V V V  +--+       V   V   V
    A B C D E F  H  J       A  C D  F G H I             B   E   J

        (streams)              (streams)                (streams)

Still need an ASCII art example of a topology graph (audio and video?). Perhaps show tiled video example

This document defines numerous standard operations (in section x.y) to facilitate interoperability. Implementors are free to extend this list of operations, and an IANA registration process is defined for this purpose. Note that specific conference servers may (MAY) support as few or as many operations as they choose, however each conference server needs to (MUST) support at least one standard collection (these are defined in section x.y) per media type which the conference server is capable of handling.

Media manipulation is generally media-specific. When a subconference is created, an input group and an output group are automatically created for each media type supported by the conference server, and a specific collection can be instantiated (again, for each media type). Once instantiated, collections are simply operations and groups connected in some way. The resulting graph can be modified, attached, detached, and deleted without affecting the collection from which the graph was copied. Note also that more than one collection can be incorporated into the topology graph for a given subconference and media type.

Manipulating the topology graph for a SIP conference enables a number of useful features, many of which are described in the SIP conferencing high-level requirements[4] document. Noisy participants can be "muted" from a conference by disconnecting their audio from the appropriate input group.

    OPEN ISSUE: would you instead just set the media to send only 
    or inactive?  doesn't seem as elegant.  the participant gets 
    a new offer answer that way which does not seem desirable.

Participants can be moved to a sidebar by disconnecting their media streams (some or all of them) and reconnecting them to the input and output groups created for the corresponding subconference. Interaction with #floor control# is coordinated by including an operation which selects only media streams corresponding to participants who have the appropriate floor. The resulting logical output stream or group of streams can be connected to a suitable mixing or combining operation (for example tiling for video), or connected directly to a specific physical stream or streams.

Obviously, authorization is required to allow manipulation of media topology by multiple parties (participants and non-participants alike). The effects of manipulating the media topology graph can range from simple, benign changes which only affect the participant requesting the change, to complete failure of the conference. Clearly no one-size-fits-all policy can be applied. However it is useful to recognize several different categories or severities of impact.

The rest of the functions of the Conference Policy Control Protocol (CPCP for brevity) are mostly orthogonal to media manipulation and so they will be defined in a separate document. However it is important to mention the interaction between the media topology-specific and other aspects of the policy. Conferences and subconferences can be created and deleted by CPCP. Although not topology dependent, when these are created the media topology will change automatically to reflect this. Also, one participant may wish to invite several other participants to a subconference (sidebar), but the initiating participant may not have permission to change the stream connection properties of all of the participants. In this case, the initiator places the participant in a pending state. This informs the participant that the initiator would like the participant to join the sidebar. Then the participant (or an agent acting on his or her behalf) either makes the requested change to the media topology by connecting his or her streams to the appropriate groups (a media topology task), or removes himself or herself from the pending list (a non-media related task). Finally, in many cases authorized users can set authorization policy related to a variety of aspects of conference policy. While setting these policies is non-media related, many uses of these policies do affect the media topology. Note that because of this separation, it is possible to produce an implementation of CPCP which runs on two separate servers, one responsible for media topology and the other responsible for the balance of conference policy functions.



 TOC 

3. Media Policy Control High-Level Functions

Manipulations of the media topology graph are performed as transactions. This insures that the media graph transitions from one consistent state to another. It should never be in a partially connected or disconnected state. Note that operations are automatically deleted unless they have at least one input connection and at least one output connection. As a result, a transaction which instantiates an operation must connect it to an input source and an output source during the same transaction, otherwise adding the operation would have no effect.

We define here a concrete syntax here in XML for specifying media topology manipulation transactions, collection management, and inventories.

3.1 Transactions

A <transaction> tag encloses one or more topology graph manipulation steps which must all succeed or all fail. Within the transaction, individual steps consist of either creating or instantiating elements or connecting them together. Note that there is an important distinction between groups and aliases and collections and operations. Groups and aliases are created (they don't exist before they are created), while collections and operations are instantiated (a copy of the original is created). A summary of the steps within a transaction is below.

Transaction
   instantiateOperation
   instantiateCollection
   createGroup
   createAlias
   destroyOperation
   destroyCollection
   deleteGroup
   deleteAlias
   connect
   disconnect

3.2 Inventory and Collection Management

In addition there are primitives for fetch inventories and manipulating Collections. A summary of these functions is below:

Inventory
   inventoryCollections
   showCollectionDetails
   inventoryOperations

Collection
   makeCollection
   deleteCollection


 TOC 

4. Semantics

4.1 Server Semantics--Media Topology operation semantics

Servers must maintain a list of all operator and collection types that can be used by Clients within a conference. Servers must return such a list to all authorized Clients in response to inventory queries. For operators and collections that have parameters, a list of acceptable parameter values must also be specified for each parameter.

For each transaction received by the Server it must proceed with the steps that follow. For each request within the transaction the Server must verify that the party initiating the request is authorized to initiate this specific request in the context of the sub-conference specified within the request. If the initiator is not authorized, the Server must not execute any part of the transaction and return the appropriate "Authorization Failure" response to the initiator. An example if user A requests to connect the input audio stream of user B to group X in sub-conference "sidebar-1" and the output audio stream of user B to group Y in sub-conference "sidebar-1". The Server must verify that user A is authorized to manipulate the media policy of user B and is authorized to manipulate "sidebar-1".

For each request the Server must verify that any changes in the media policy of any participant as a result of the execution of the request is authorized by the conference policy. If any party is not authorized for the media policy changes that result from the execution of any request within the transaction then the server must not execute any part of the transaction and return the appropriate "Authorization Failure" response to the initiator. In the example used in the previous point, the Server must verify that user B is authorized to join "sidebar-1".

The Server should verify that all requests to instantiate, create and/or connect elements are conforming to the XML schema and descriptions of the elements. If any request does not conform to the XML schema of the elements that it is operating on then the Server must not execute any part of the transaction and return the appropriate "XML Schema Error" response to the initiator. For example an operator that take one input connector of type video can not be connected to an audio stream.

The Server should verify that all the relevant mixers have enough resources to perform the actual media processing required as a result of the execution of the transaction. If no enough resources are available the Server must not execute any part of the transaction and return the appropriate "No Available Resources" response to the initiator. Note that resources needed for trans-coding and trans-rating should be accounted for. Editor Note: More details and some examples need to be provided to explain this section and specially the last bullet.

4.1.1 Connecting input and output media streams to groups, operators and collections

TBD [This section should specify where media streams should be connected when a participant first join a conference or a sub-conference. It also specify how Clients can use stream identifiers to connect streams as well as participants to other elements and what should the behaviour of the Server be.

4.1.2 Notifications of media policy changes

TBD [This sections specify the interactions needed between the media policy server and the notification server to report the relevant changes to media policy to the relevant parties. Note that the protocol should allow hidden transactions for which no notifications will be sent as a result of the media policy change. All authorization consideration specfified in section 5.2 should still be followed.]

4.2 Client Semantics

To be written.



 TOC 

5. Standard Operations

Much more to do. A Partial list is below:

selectFloorHolder
smilLayout
xsltLayout
stereo2mono
selectByName
containsContributor
doesNotContainContributor
text2speech
speech2text
speech2gesture
speech2signlanguage

This sections specifies a set of operators that are needed to provide the most common media processing operations used in conferencing today. Each opertaor performs a specific function. Each operator has a type to be registerd with IANA and an XML schema that defines how Clients might use the operator. Each Server implementation is free to support any number of these operators as well as define its own.

5.1 Audio

5.1.1 Loudest Speaker

The Loudest Speaker operator takes a set of audio streams in and produces one output stream that is of the loudest speaker of the list. The operator has two optional parameters. The "rank" paramreter takes a value from one to the number of input audio streams available to the operator. A rank value of "1" specifies the loudest speaker of the set, a rank value of "2" specifies the second loudest speaker of the set and so on. The second parameter is the "timeShift" parameter. This parameter takes a value from "0" to maxTimeShift where maxTimeShift is implementation dependent. A "timeShift" value of 0 specfies the current loudest speaker where as a "timeShift" value of 1 specfies the last loudest speaker and so on. The default rank parameter is 1 and the default timeShift parameter is 0. Below is the XML schema describing the operator and using a maxTimeShift value of 3.

         <?xml version="1.0" encoding="UTF-8"?>
         <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
         <!-- definition of simple type elements -->
         <xs:element name="operator"/>
          <xs:complexType>
           <xs:sequence> 
            <xs:element name="input" type=audio minOccurs="1" maxOccurs="unbounded"/>
            <xs:elemnet name ="output" type=audio />
           </xs:sequenec>
          </xs:complexType>
         <xs:attribute name="type" type=xs:string value="Loudest Speaker" use ="required"/>
         <xs:attribute name="rank" type=xs:integer value=[1-]/>
         <xs:attribute name="timeShift" type=xs:integer value=[0-3]/>
         </xs:element>
        </xs:schema>

5.1.2 n-1 mixer

The n-1 mixer mixes a set of audio streams and distribute them so that participants do not receive their own stream. The operator has a set of audio input connectors and opne "logical" output connector. Note that the logical sonnector can actually consist of up to n different streams. The first stream is distributed to the set of participants that are not currently particpitaing in the mix. Each of the other n-1 streams will be distributed to one of the participants that are currently contributing to the mix. Each of the n-1 mixes will consist of the normal mix after subtracting the input stream of the participant to which the mix is distributed. Note that all the logistic of the n-1 mixing and the correct distribution of the right mix to the right participant will be handled by the implementation of this operator. All the Client needs to worry about is instantiating the n-1 mixer and connecting its input connectors to a set of elements and typically connecting its output connector to an audio group. Later we will see other implementations where the whole process of handling the n-1 mixing has to be described by the Client in details. The n-1 mixer has one optional parameter which determines the number of input connectors it has (n). The values that "n" can take differs from one implentation to another. Below is the XML schema describing the n-1 mixer operator that has an n parameter than can be 3 or 4.


         <?xml version="1.0" encoding="UTF-8"?>
         <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
         <!-- definition of simple type elements -->
         <xs:element name="operator"/>
          <xs:complexType>
           <xs:sequence> 
            <xs:element name="input" type=audio minOccurs=n maxOccurs=n/>
            <xs:elemnet name ="output" type=audio />
           </xs:sequenec>
          </xs:complexType>
         <xs:attribute name="type" type=xs:string value="n-1Mixer" use ="required"/>
         <xs:attribute name="n" type=xs:integer value=[3-4]/>
         </xs:element>
        </xs:schema>     
         

5.1.3 n-mixer

TBD

5.1.4 n-1 filter

TBD

5.1.5 Gain control

TBD

5.2 Video

5.2.1 Video Tiling Operator

This operator combines the input video in a grid of x horizontal, by y vertical panes (each input corresponding to a single pane).

5.2.2 Two Input Video Switch Operator (TIVS)

The two input video switch operator has two input-connectors. The operator connects the stream connected to the first input connector to all streams connected to the ouput connector that match the input stream. The stream connected to the second input-connector is mapped to the rest of the output stream(s).

         <?xml version="1.0" encoding="UTF-8"?>
         <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
         <!-- definition of simple type elements -->
         <xs:element name="operator"/>
          <xs:complexType>
           <xs:sequence> 
            <xs:element name="input" type=video minOccurs=2 maxOccurs=2/>
            <xs:elemnet name ="output" type=video />
           </xs:sequenec>
          </xs:complexType>
         <xs:attribute name="type" type=xs:string value="TIVS" use ="required"/>
         </xs:element>
        </xs:schema>


     OPEN ISSUE: Should the matching criteria be fixed or should 
     it be definable as a parameter to the operator? Current 
     suggestion is that it would be fixed as following: 
     "two streams match if they belong to the same participant". 

5.2.3 Voice Activated Switch Operator (VAS)

This is one of the most popular video operations. This operation has two input connectors; one is typically connected to a group of video streams and the other is connected to an audio stream. The audio stream is typically the output of an audio operator (see Video collections) such as the "Loudest Speaker". If any of the streams connected to first input connector match the stream connected to the second input connector then the operator maps the stream connected to the first input connector to all the video stream(s) connected to the output connector.

         <?xml version="1.0" encoding="UTF-8"?>
         <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
         <!-- definition of simple type elements -->
         <xs:element name="operator"/>
          <xs:complexType>
           <xs:sequence> 
            <xs:element name="input" type=video />
            <xs:element name="input" type=audio />
            <xs:elemnet name ="output" type=video />
           </xs:sequenec>
          </xs:complexType>
         <xs:attribute name="type" type=xs:string value="VAS" use ="required"/>
         </xs:element>
        </xs:schema>


 TOC 

6. Standard Collections

To do.



 TOC 

7. Formal Syntax

The following syntax specification uses XML schema as described in xref target="***". This document includes several options for syntax, including an XML Schema, a SOAP encoding, and an XML RPC encoding. While drawing a conclusion at this time may be inappropriate, the XML Schema provides the largest amount of semantic data and is also the least verbose.

Example of proposed soap encoding:

<?xml version="1.0">
<env:Envelope xmlns:env="http://www.w3c.org/2002/12/soap-envelope">
  <env:Body>
    <transaction>
      <conf>main</main>
      <instantiateOperation>
        <type>tileXbyY</type>
        <id>xyz</id>
        <x>2</x>
        <y>2</y>
      </instantiateOperation>
      <connect>
        <input>operation:main:tileXbyY:xyz</input>
        <output>group:main:video:main.out</output>
      </connect>
    </transaction>
  </env:Body>
</env:Envelope>

An XML schema encoding:


<transaction conf="main">
  <instantiateOperation type="tileXbyY" id="xyz" x="2" y="2" />
  <connect>
    <input kind="operation" type="tileXbyY" id="xyz" conf="main" />
    <output kind="group" media="video" name="main.out" conf="main"/>
  </connect>
</transaction>

And finally an XML RPC encoding.


<?xml version="1.0"?>
<methodCall>
  <methodName>mpcpTransaction</methodName>
  <params>
    <param><value>main</value></param>
    <param><value><struct>
      <member><name>action</name> <value>instantiateOperation</value></member>
      <member><name>type</name> <value>tileXbyY</value></member>
      <member><name>id</name> <value>xyz</value></member>
      <member><name>x</name> <value><int>2</int></value></member>
      <member><name>y</name> <value><int>2</int></value></member>
    </struct></value></param>
    <param><value><struct>
      <member><name>action</name> <value>connect</value></member>
      <member><name>input</name> <value>operation:main:tileXbyY:xyz</value></member>
      <member><name>output</name> <value>group:main:video:main.out</value></member>
    </struct></value></param>
  </params>
</methodCall>



 TOC 

8. Security Considerations

Need to write some real text. Authorization rules are discussed in Section *n*.



 TOC 

9. IANA Considerations

This document defines an IANA registry



 TOC 

10. Acknowledgments

This work was the result of discussions among the SIP Conferencing Design Team.



 TOC 

Normative References

[1] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, A., Peterson, J., Sparks, R., Handley, M. and E. Schooler, "SIP: Session Initiation Protocol", RFC 3261, June 2002.
[2] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997 (TXT, HTML, XML).


 TOC 

Informational References

[3] Rosenberg, J., "A Framework for Conferencing with the Session Initiation Protocol", draft-rosenberg-sipping-conferencing-framework-01 (work in progress), February 2003.
[4] Levin, O., "Requirements for Tightly Coupled SIP Conferencing", draft-levin-sipping-conferencing-requirements-02 (work in progress), November 2002.


 TOC 

Authors' Addresses

  Rohan Mahy
  Cisco Systems, Inc.
  101 Cooper St
  Santa Cruz, CA 95060
  USA
EMail:  rohan@cisco.com
  
  Nermeen Ismail
  Cisco Systems, Inc.
  170 W Tasman Dr
  San Jose, CA 95134
  USA
EMail:  nismail@cisco.com


 TOC 

Intellectual Property Statement

Full Copyright Statement

Acknowledgement