Document: draft-ietf-avt-rtp-atrac-family-16.txt
Reviewer: Scott Brim
Review Date: 24 June 2008
IESG Telechat date: 02 July 2008

Summary:

This draft is on the right track, but has open issues, described in the review.

Comments:

This is being submitted as a proposed standard.  Therefore I am asking that it be very clear.  My concerns are mainly with what I see as some ambiguities and some possible errors in documenting protocol behavior. 
There aren't many so I have left them in the order they occur in the draft instead of categorizing them.


1. Introduction

 >    The need for real-time streaming of audio data has grown, and
 >    this document details our efforts in increasing the product and
 >    application space for the ATRAC family of codecs.

This is a draft for a proposed standard technical specification. 
Whether it is motivated by a desire to increase product and application space is irrelevant.  I would delete this.


4.5.2 Scalable Multi-Session Streaming

 >    While there may be alternative methods for synchronization of the
 >    layers, it is RECOMMENDED that the timestamp will be used for
 >    synchronizing the base layer with its enhancement. Applications

"It is RECOMMENDED" does not conform to RFC 2119.  This should be a
SHOULD, along with an explanation of the conditions under which it is
reasonable not to implement (so that implementors are not left
guessing).

 >    If the enhancement layer's session data cannot arrive until the
 >    presentation time, the decoder SHALL decode the Base layer
 >    session's data only, ignoring the enhancement layer's data.

Change SHALL to MUST globally.


5.1  Global Structure of Payload Format

 >    The structure of ATRAC Payload is illustrated in Figure 3.  The
 >    RTP payload following the RTP header contains three octet-aligned
 >    data sections.

Only two data sections are described.  Do you mean that the RTP header 
plus ATRAC header section plus payload section form three sections?


5.3.1 Usage of ATRAC Header Section

 >    Fragment Number (FrgNo): 3 bits
 >    In the event of data fragmentation, this value is one for the
 >    first packet, and increases sequentially for the remaining
 >    fragmented data packets. This value SHOULD be zero for an
 >    unfragmented frame.

Earlier it was said: "The ATRAC codec can handle very large frames.  As 
most IP networks have significantly smaller MTU sizes than the frame 
sizes ATRAC can handle ...".  If there can be such a significant 
difference -- and if you want to allow for larger frames in the future 
-- is there special handling for when this 3-bit counter rolls over 
(more than 7 fragments)?  If not, at least mention that you do not 
expect it to roll over -- or that you expect the receiver to be able to 
handle rollovers.


5.3.2.2  Frame Fragmentation

 >    However, if even a single ATRAC frame will not fit into a
 >    complete RTP packet, the ATRAC frame SHOULD be fragmented.

What is the alternative to fragmenting it?  If there is no
alternative, make the SHOULD a MUST.  If there is an alternative, what
is it and under what conditions is it acceptable to do it?  For
example, you might say: ... "the ATRAC frame SHOULD be fragmented unless 
the receiver is non-compliant and has indicated it is incapable of 
receiving fragments, in which case the session MUST be terminated."

 >    As subsequent packets do not contain any new frames, the Number
 >    of Frames field SHOULD be ignored.

Should this SHOULD be a MUST?  I would think so.  If not, under what 
conditions is it acceptable NOT to ignore the Number of Frames field?


6.1  Example Multi-frame Packet

First, NFrames=5 means there are 6 frames in the packet but only 5 are 
shown.

Second, up in 4.5.1 you said: "In multiplexed streaming, the base
layer and enhancement layer are coupled together in each packet,
utilizing only one session as illustrated in Figure 1.  While the
packet may begin with either layer type, the two layer types MUST
interleave."  In this example you show 3 base layer frames, an 
enhancement frame, and then a base layer frame.  Since

   - you have begun interleaving in the middle of a packet, and

   - interleaving can begin with either layer type, and

   - there are no frame numbers,

how can you tell that the enhancement layer frame is not the
_beginning_ of the interleaving, and that it is not associated with
the _following_ base layer frame?  There seem to be some implicit
assumptions that should be made explicit, so that implementors can
avoid incompatibility.


7.5.2  For Media subtype ATRAC-X

 >       The "baseLayer" parameter MUST be the first entry on this
 >       line.  It is RECOMMENDED that the "channelID" parameter be the
 >       next entry.

Again, make this a SHOULD, and explain under what conditions it is
acceptable not to do so.  Why are you allowing implementors NOT to
have channelID be second?  Why do you want them to?


7.5.3  For Media subtype ATRAC Advanced Lossless

 >    o  The Media subtype (payload format name) goes in SDP "a=rtpmap"
 >       as the encoding name.  This SHOULD be followed by the
 >       "sampleRate" (as the RTP clock rate), and then the actual
 >       number of channels regardless of the channelID parameter.

What is the problem if this order isn't followed?  If you have a SHOULD, 
it's good to tell implementors under what conditions it is acceptable 
for them not to do it.  Otherwise you get inconsistent implementations. 
  Some just ignore all SHOULDs.

 > It is RECOMMENDED

Make it a SHOULD, with explanation.

The same comment applies to the uses of RECOMMENDED that follow.  I'll
stop mentioning them.

7.6  Offer-Answer Model Considerations

 >    In order to establish an interoperable transmission framework, an
 >    Offer-Answer negotiation in SDP SHOULD observe the following
 >    considerations.

Under what conditions is it acceptable not to?


7.6.3  For Media subtype ATRAC-X

 >    o  When creating an offer with considerably high requirements
 >       (such as 8 channels at 96kHz), it is RECOMMENDED that the
 >       offer also contain a configuration with lower requirements
 >       (such as a stereo only option).  Although multiple alternative
 >       configurations may be offered, care SHOULD be taken not to
 >       offer too many payload types.

I'm not sure what this SHOULD means.  If this is just a general bit of
advice, make the SHOULD lower case should -- or perhaps just delete
it.  If this is an important guide to implementation, then should the
SHOULD be a MUST?  If so, what specifically do you mean by "too many"?
Is it possible for the offerer to know?  If it should be a SHOULD,
what is the impact of offering too many?  Under what conditions is it
acceptable to offer too many?  When the receiver's capabilities are
not known?

 >       For best performance, we suggest an answer SHALL NOT contain
 >       any values requiring further capabilities than the offer
 >       contains,

"suggest ... SHALL NOT".  Either they MUST NOT or SHOULD NOT, but I
wouldn't just "suggest" a requirement.  What happens if an offer _does_ 
contain further capabilities?