Date: Mon 21 Mar 94 05:51:23 PST From: Stephen Casner Subject: AVT WG Agenda and Proposed RTP Changes To: rem-conf@es.net To the Audio/Video Transport WG: You may be wondering about the status of AVT in general and of the Real-time Transport Protocol (RTP) in particular. Just before the November IETF meeting in Houston, the RTP spec was submitted to the Area Director with a request for "IESG Last Call". At the AVT meeting, comments were solicited on the spec but none were offered. However, behind the scenes, some objections were raised with the Area Directorate regarding the classification of RTP as a proposed standard and some aspects of the specification. I have received some feedback on the Area Directorate's review of RTP through discussions with Allison Mankin and John Wroclawski, but I have not received a written response yet. I was hoping to get that first, but I'll try to cover the issues with this message anyway. Subsequently, I've had two discussions with Van Jacobson and Ron Frederick in which we've taken the vat and nv programs as examples against which to test the issues listed below. As a further verification, Van and Ron plan to implement test versions of vat and nv incorporating RTP with the changes proposed below (there are hopes for this to be done before IETF). Please read about the proposed changes below, and if you can send comments before IETF, it would be much appreciated. This will be the main topic for Seattle. The current released RTP draft is draft-ietf-avt-rtp-04.txt and .ps even though its listed expiration date is past. There is a newer version on gaia.cs.umass.edu, in pub/hgschulz/rtp, files rtp.txt and profile.txt with updates for the timestamp change proposed below. -- Steve Casner Audio/Video Transport WG A G E N D A Tuesday, March 29, 1330-1530 - Status of RTP and the working group - Presentation on changes proposed in response to Area Directorate review: 1. Control and data on separate ports 2. Remove Channel ID, put multiplexing into encapsulations 3. Replace data options with a bit field 4. Media-specific timestamps 5. Application-specific sync marker bit 6. Global source identifiers 7. Control packet format / Reception reports - Report on test implementations of proposed changes - Discussion of proposed changes -- what issues and problems may have been overlooked? Wednesday, March 30, 1330-1530 - Continuation of discussion of proposed changes as needed - Presentation on RTP and Synchronization Protocol - Presentation of new encoding specifications over RTP (may include: JPEG; MPEG1/MPEG2; Cell-B; new version of H.261) - Assess schedule for RTP completion and other work needed Proposed Changes to RTP My take on the AVT process is that we have designed a protocol that is well specified and certainly functional if not optimal. However, part of the criticism was that RTP follows a "classical" protocol style rather than the principles of Application Level Framing (ALF) and Integrated Layer Processing (ILP) proposed by David Clark and David Tennenhouse in their SIGCOMM '90 paper. I have some trouble with this claim because I believe RTP already adheres to at least the major points of these principles. For example, each packet is typically one application data unit and includes the identification required to enable processing of application data units out of order. So, the fit is really a matter of degree. The following guidelines are relevant: - Minimize the number of multiplexing points - Minimize the number of inband control operations As applied to RTP, the following changes are proposed (though not all of these are exactly motivated by ALF): - Carry the control and data traffic on separate ports - Remove the application-level multiplexing of the Channel ID and move it to an encapsulation for the cases where it's needed - Minimize the use of options - Make the definition of some fields application-specific (in particular, the timestamp clock rate and sync marker) - Use global rather than local IDs, to be able to detect loops - Specify more precisely how reception reports should be provided Each of these items is described in more detail below. 1. Control and data on separate ports This change removes RTCP functions from RTP data packets and puts them into a separate packet stream on another port to streamline data processing and to allow monitoring programs to receive the control information with having to sort through the data. The relationship of port numbers does not have to be even/odd pair, only some arithmetic algorithm (such as +1) so that the mapping can be calculated in either direction. This is so that a set of ports can be identified as being in use when either control or data traffic is observed. Advantages: - Allows monitors to collect feedback from receivers without having to sort through the data packets. - Moves multiplexing of control and data from option decoding to the exising multiplexing point of port number to streamline data processing. - Enables elimination of options in data packets. Disadvantages: - Consumes more port number space. - It may be harder to convince some firewall administrators to allow multiple ports through the firewall than one (although this becomes moot if port numbers are randomly allocated). 2. Remove Channel ID, put multiplexing into encapsulations as needed There was quite a bit of debate about the Channel ID, with Christian Huitema being notable among those who opposed it. ILP says the number of multiplexing points should be minimized, which argues for removal of the Channel ID. The argument for the Channel ID was not strong, but two cases were identified where the Channel ID was needed: - When multiple RTP units are carried in one UDP packet (to reduce packet overhead); this is indicated by a profile that specifies an additional encapsulation that currently includes only a frame length. The frame length is 32 bits for purposes of alignment, but really only 16 are needed. The other 16 bits could carry a demultiplexing field to take the place of the Channel ID. - For some applications, the multiplexing of the Channel ID might be sufficient without UDP, so RTP could be carried directly over IP. Without the Channel ID, such applications would have to use UDP, or a new encapsulation (perhaps 32 bits) could be defined. Advantages: - Eliminates a multiplexing point. - Frees bits for indication of optional fields. Disadvantages: - No multiplexing provided for RTP-over-IP case. 3. Replace data options with a bit field With the RTCP operations separated into another packet stream on a different port, there are only a few optional fields to be selected in data packets. That allows the optional fields to be indicated by bits in the fixed header. If the bits are also grouped together, it is possible to interpret them as a "packet format", which is a single demultiplexing point for processing as a set of fixed formats if appropriate. (If the options were not orthogonal, a more compact encoding of the packet format would be possible.) Or, rather than replicating code, the option bits may be processed individually, which is more likely for the existing applications. The order of the bits and the corresponding fields is fixed, but the receiver may process them in any order. This reduces the validation required. The RTP options would be replaced as follows: CSRC (content source) -- A bit field of 5 or 6 bits in the fixed header to count the number of content source identifiers to follow, for packets produced by a bridge (mixer). Zero indicates that the packet did not come from a bridge. This count takes the place of the length field in the current CSRC option. SSRC (synchronization source) -- One bit to indicate that a sync source identifier follows. The CSRC count and SSRC bit could be combined into a "source ID" count if the first source ID in the list is defined to identify the sync source. That is, a packet produced by a bridge would always have to identify the bridge explicitly as the sync source, so the count would be at least 2, for the sync source ID and one content source ID. This extra overhead of the explicit sync source ID may be justified if global identifiers are used (see below) and the implicit source ID is reserved for traffic generated by the bridge host itself. BOS (beginning of sync unit) -- This option carries the sequence number of the first packet of a synchronization unit. It has not been used yet, and it would be eliminated as a generic RTP option. A profile could define this field to be included after the RTP header. APP (application-specific) -- Application specific functions could be defined in profiles as extensions to the RTP header, but there is no provision for one implementation of an application to define its own optional information in a way that other implementations can simply ignore. SDST (sync destination, or reverst-path option) -- The same mechanism that is used for the SSRC on a forward packet can be used for SDST on a reverse path packet because the directions are distinguished by the arrival port (the current SSRC and SDST options could have been one). However, if global identifiers are used, it may not be possible to implement reverse path packets (some would say that's a good thing). This is because translators would not need to keep a table that would allow mapping source identifiers back into transport addresses. ENC (encryption) -- Two methods: A. When encryption is used, the whole RTP unit is encrypted (all of the RTP header and data). The receiver depends upon header validity checks (version, format, sequence number, and timestamp having reasonable values) to discard packets that should have been decrypted (or decrypted with a different key). There is no provision for an explicit initialization vector; instead zero would be used with random initial values for the sequence number and timestamp to deter a known plaintext attack, or a shared secret could be used. Since it would not be possible for a translator to insert or modify the SSRC field, the SSRC would always have to be inserted before encryption, and the local identifier scheme would not work. The key and encryption algorithm would be specified by out of band information; key switching can be done by trying the possible keys one at a time to decrypt the header and make the validity checks. B. The fixed portion of the RTP header (64 bits) and the SSRC field (if present) would not be encrypted in order to allow translators to insert or modify the SSRC field (so this would reduce header size in the normal case, and would work with local identifiers). One bit in the header would indicate that the packet was encrypted. As with the current RTP ENC option, the initialization vector could either be the first 64 bits of the RTP header, or an explicit value generated randomly for each packet, but the choice would be specified only by the out-of-band information. In the case of the explict IV, the 64-bit field would be inserted after the SSRC (if present), and encryption would begin after the IV field. MIC (authentication) -- Two methods: A. An authenticated packet would be indicated by a bit in the header which would indicate the presence of an authentication field later in the header. To work with locally unique identifiers, which must be updated when a packet goes through a translator, the SSRC field would have to be excluded from the authentication; this means it could be faked by a translator. With the global identifier scheme, the SSRC could be authenticated to have been set by one of the sources (but could still be false). For either scheme, the SSRC field must always be included in case the packet has to go through a translator, or alternatively the SSRC flag bit could be excluded from the authentication. The authentication method (covered by ENC, keyed, symmetrically encrypted, or asymmetrically encrypted), algorithm and key (if any) would be known from out-of-band information. As with the method (A) for encryption, key changes could be accomplished by trying with old an new keys in succession. Alternatively, a key descriptor could be included at the start of the authentication field. To allow some receivers to ignore the authentication without knowing the out-of-band information, a length field would be needed at the start of the authentication field. B. Since authentication may in the future be provided at a layer below RTP, it would be advantageous if no bits were wasted once authentication within RTP became unused. This could be accomplished by using a separate encapsulation header prepended to the RTP header and distinguished from the RTP header by a different version number or some combination of bits in the first word that was sufficiently unique from a valid RTP header. One possibility would be version 0, which would otherwise indicate vat protocol, but with unused flag bits set. The first word of the authentication encapsulation should include a length field so receivers that didn't care about authentication could skip it. The other fields in the first 32 bits of the RTP header are the version number, format, sync marker, and sequence number. It is proposed that the version number be bumped to 2 if these proposed changes are adopted. In that case, the version field could be reduced from two bits to one if we were willing to sacrifice the possibility of defining a version 3. The format field would remain unchanged. The sync marker definition might change (see below), but would remain one bit. The sequence number should stay at 16 bits for arithmetic convenience, but could be trimmed if necessary. So, the resulting header bit count would be: min max 5 6 for CSRC count 0 1 for SSRC present 0 1 for encryption 0 1 for authentication 1 2 for version 6 6 for format 1 1 for sync marker 8 16 for sequence number --- --- 21 34 (so 2 must be trimmed) Advantages: - Options may be processed collectively as a packet format, although that seems unlikely for the options defined here. - Option bit field takes less space than the option code + length format (though longer global ID's would consume some of the space). - More streamlined data processing, though the difference may not be noticeable compared to the data manipulation in existing apps. - Allows processing of SSRC first, which was identified as a problem with current scheme unless SSRC option was required to be first. Disadvantages: - There may not be any spare bits for new options, but it may be argued that there just aren't that many variations that must be accommodated in the common part of supporting communication across a packet net (vs. application or media specific details which go later in the packet) - Encryption and authentication key changes can't be indicated explicitly. 4. Media-specific timestamps A more careful analysis of the technique given in the RTP spec for conversion between RTP timestamps and sample clocks has identified two problems: - It is possible to construct unusual pathological cases that result in an off-by-1 error due to floating point arithmetic, so we can't claim it's always accurate even though it would probably work fine for all the real examples. - The example code in the spec is insufficient because it does not point out that the received 32-bit RTP timestamp must always be extended with the high 16 bits of the NTP timestamp in order to accurately reconstruct a 32-bit sample counter. This requires that the sender and receiver both know roughly what time it is, and we believe it is not reasonable to establish that as a requirement for all uses of the RTP protocol, even though it might be a reasonable requirement within a given profile. Also, the fact that the RTP timestamp uses the middle bits of an NTP timestamp has led some implementations to "ask the system" for an NTP timestamp as each packet is being prepared. This technique will produce timestamps with too much jitter to be valid for audio samples. Furthermore, the sampling instant, rather than the time of packet preparation, is what's really needed for synchronization. Therefore, it is proposed that the rate at which the clock ticks, instead of always being 65536Hz, would become a parameter of the format. For some formats, such as the variable frame rate video we are now using, it may make sense to retain the 65536Hz rate, while for most audio formats, the clock rate would be set to be the same as the sampling rate, as is done for the timestamp in vat. For predefined format codes, the clock rate would be defined in the profile spec. For dynamically defined format codes, it is proposed that the clock rate be specified as a reduced rational number, with a 32-bit numerator and a 32-bit denominator. Assuming that all sampling rates are rational, this would be exact and would avoid the complications of incompatible floating point formats. A second aspect of the current RTP timestamp is that senders with synchronized time sources (e.g., using NTP) are supposed to periodically adjust the timestamp calculation to correct for drift between the sampling oscillator and the nominal sampling rate. This is to aid in intermedia synchronization, in particular for sources from multiple sites. In the new timestamp scheme, it would still be possible to have the sample counter be relative to the same t0 as NTP and periodically adjust it at the sender to correct for drift if the sender knows the time of day. By conveying the relationship of media timing to real time in the timestamp itself, no separate explicit communication of the relationship would be needed. However, independent of what clock rate is chosen for the timestamp, receivers would need some indication whether or not the senders knew the time of day (and how accurately) to determine whether they could use the timestamp to synchronize. If a periodic indication would be required anyway, then one could just as well communicate the relationship to real time periodically so the receiver can make the necessary adjustments (since it must be adaptive anyway). This would remove the requirement for the sender to make adjustments, which is an advantage for continuous transmission. Adjustments to the sample clock may cause audible glitches. If the sender must adjust for drift between the sampling clock and real time, and the receiver must adjust for drift between real time and the playout clock, the number of adjustments (and glitches) may be more than twice as many as if the receiver just made adjustments for drift between the sampling clock and the playout clock. (In the extreme case, if the sampling clock and playout clock happened to be exactly the same but both skewed with respect to real time, then synchronizing with real time would require adjustments at both sender and receiver, whereas not synchronizing to real time would require no adjustments). Therefore, it is also proposed to remove the requirement for senders to make timestamp adjustments. For example, the timestamps would just follow the input device's sample counter. For senders that do know the time of day, control packets would carry both an RTP timestamp (sample clock) and the corresponding full 64-bit NTP timestamp to establish the relationship. Specific applications such as the talking clocks might choose to still initialize the sample counter timestamp relative to t0 and adjust periodically keep the timestamps synchronized to real time since the data is artificially generated anyway. Advantages - Avoids time conversion calculations at the sender and receiver, and the potential errors at the receiver. - Doesn't require sender to know what time it is. - Reduced number of drift corrections for continuous transmission. Disadvantages - Relationship to real time must be communicated periodically. - Monitoring and recording tools would need to know about the predefined formats and their clock rates, whereas before they could be format-independent. - A receiver must wait for the periodic control packet before synchronization can be done. In particular, a recorder would need to back calculate the real time corresponding to the first packet of a recording to know how to play back the first packets in sync with another medium, for example. 5. Application-specific sync marker bit In the RTP specification, the sync bit was chosen to mark the last packet of a synchronization unit because that allows the receiver to also determine the first packet of the next talkspurt by its sequence number. If the sync bit marked the first packet of a synchronization unit, it would not be possible in real time to establish the end of the previous synchronization unit. However, some applications of RTP may only care about the start of a synchronization unit. For them to successfully determine the first packet of a synchronization unit requires that both the last packet of the previous unit and the first packet of the next one be received intact. This is a disadvantage, so those applications would prefer to mark the first packet. If you think about it, it's not necessary for the main RTP specification to define the meaning of the sync marker because there is no generic processing of that bit to be done for all applications. In fact, within a given application, it might make sense for the definition of the marker bit to be specific to a particular encoding. Therefore, it might be claimed that the sync marker does not belong in the common RTP header at all, but there are a couple of reasons for it to remain: - Most applications will want to mark something, and to allocate a bit elsewhere may require allocating 32. Since we don't want to make the header any longer than necessary, we should try to provide that space in the common header. - The marker bit may be useful for media-independent monitoring because loss may often occur in some relationship to the marker. It may be possible to draw some useful information from that relationship without knowing the specific definition of the marker. 6. Global source identifiers RTP currently uses locally unique source 16-bit identifiers to keep track of distinct sources when packets flow through translators (when the original transport address is replaced by that of the translator) and bridges (when packets from multiple sources are mixed). These identifiers have the advantage of small size, but there are some disadvantages as well: - Translators and bridges must maintain a table of incoming streams (identified by transport addresss and possibly an incoming SSRC identifier) to be mapped to outgoing identifiers. - If a translator or bridge went down, the mapping would probably be lost, which means downstream receivers would be temporarily confused about sources. - Local identifiers can't be used to detect a delivery loop. For example, if there is a translator from multicast to unicast followed by a translator from unicast to multicast, and somehow the two multicast trees get hooked together, a loop may form. Earlier in the RTP design process, truly globally unique identifiers were considered. These were constructed with [type,length,value] structures to use unique network addresses of various forms. This idea was rejected because these identifiers are too large. Van Jacobson proposes a point in between with the scheme used by vat and wb. This scheme uses 32-bit identifiers unique within a particular medium in a particular session. In other words, one site (source) may use the same identifier for each of several media in a session. In the current Internet, the 32-bit identifiers may be derived simply by using the IPv4 address of the host if there is only one source per host (as is common for workstations), or otherwise chosen at random from the "class F" IP address space (26 bits). There are some obvious difficulties in this scheme: - Two sources may choose the same random number. Fortunately, if the number of sources in a session that need to choose random numbers is small compared to the square root (8192) of the size of the space (2**26), the probability is "negligibly small" according to the analysis of the Birthday Problem. - If hosts on a private internet reuse IP addresses that are assigned to hosts in the Internet, and if the two internets are connected by some kind of translator (e.g., a firewall), then the applications running in the private internet would have to be configured to use random identifiers when communicating across the translator. This could be messy. There are two mechanisms that might be used for conflict resolution: - During the relatively long startup period when a participant first joins a conference, the IDs of the other participants are likely to be heard before the new participant transmits, so there is a chance for the new participant to choose another random number if a conflict is heard. - If a new site begins using an ID in conflict with an existing one, then any site in the session can send a message challenging the new participant (since the owner of the ID might not be in the session at that moment). Randomized delays would be used to prevent an implosion of responses. This challenge mechanism would need to be specified in the protocol. If the probability of unresolved conflict is smaller than the probability of lightning striking your workstation, it probably doesn't matter. This scheme would impose a limit on session size, but the practical limit for the "lightweight session" mechanism is somewhere between 1K and 10K anyway. Beyond that, the scaling of control packet rate backoff stops working (it becomes slower than about 2 minutes, and this gets unstable if participants come and go at similar time scales). For larger sessions, such as cable TV distribution, some means of aggregation would have to be specified, and that mechanism could provide a partitioning of the identifier space as well. For some applications, such as wb, the identifiers should be stable across invocations to avoid loss of ownership of previously generated information. If a random ID must be chosen then it must be remembered in persistent storage (e.g., a file). Rules such as this for the use of identifiers might be part of an application-specific profile. Multiple sources participating in the same session on one host would need some means to coordinate which one gets to use the IP address and which ones must select random identifiers. Although vat does not do this yet, I think Van has a plan. Advantages - Loop detection. - Less work in translators and bridges. - May be required for some encryption or authentication schemes under the new option mechanism. Disadvantages - 32 bits instead of 16. - Must be convinced by probability. 7. Control packet format / Reception reports With control and data being carried on separate ports, the functions of the RTCP options would be moved into the control packets. The primary functions are: - Providing information about the sender, e.g., name - Providing reception reports for all sources received - Relating the sender's media timestamp to real time, and also marking the time of the reception reports In previous AVT meetings, it has been suggested that RTCP might be removed from the RTP spec entirely. However, Van Jacobson states the position that in order for RTP to be used on a large scale, we must provide mechanisms for network service providers as well as users to evaluate the distribution quality, and the mechanism is to monitor the reception reports generated from the multicast data itself as the test traffic. This mechanism needs to be considered a fundamental part of RTP having to do with using the network for distribution; it's use should not be considered optional. Since the reception report mechanism is independent of particular media, it goes in the base RTP spec rather than a profile. The spec should be written such that any application using RTP will work in multicast mode, with unicast as a special case. The format of control packets has not been as well defined as the other items in this proposed collection of changes. The header format for the control packets need not be exactly the same as for the data packets, but it may be useful to keep them similar. The following information is proposed: Sender info: media timestamp NTP timestamp sender's packet count sender's byte count sender's name Reception info: number of reception reports array of report structures: source identifier count of packets received count of bytes received variance of interarrival time last timestamp received from this source in a session packet delay time since that session packet was received The reception report structure includes multiple reports per packet (all sources heard from since the last report). If the conference is large such that the control packet rate is slowed down, and if there are so many sources generating traffic that the reports will not all fit into a packet, then a different set of sites would be picked each time in round robin order, for as many as will fit. The last two items in the reception report can be used in an NTP-like algorithm to figure the round-trip propagation delay and then divide by two. Then one can make an ordering of the sources based on propagation delay as an approximation of distance, and can cluster the sites based on error rate and on distance. The reception report should contain absolute information rather than deltas so it does not matter if a report is missed. The NTP timestamp in the sender information is useful for media synchronization, but it is also needed as the timestamp at which the counts in the reception report were generated. This double duty requires that the media timestamp have sufficient resolution that one can be generated to correspond to the NTP timestamp at the time a reception report is generated; alternatively, the reception report must be generated at an instant that corresponds to a media timestamp tick. Given two reception reports with timing information allows the counts to be translated into rates. The RTP QOS option includes minimum and maximum delay measures. These are not included above because an outlyer can make the value useless as a description of the distribution. The general guideline for the construction of the control packet is to put more common information first so that application-independent monitors can process all the common information without having to know anything about the format of the application- or media-dependent information. So, the packet would be assembled as: sender info: common application dependent media dependent reception report: common, for all sources received application dependent, for all sources received media dependent, for all sources received There is one field that tells the number of reception reports, and then the reception info is contained in several parallel arrays, all with the same number of entries (some of which may be structs) and all indexed by the same number. The application dependent part would be defined by a profile and the main spec wouldn't say what the format was. In addition to the sender description and reception report information, the existing RTCP defines the following options. These options may need to be incorporated into the control packet structure in some way: FMT (format description) -- Allows format codes to be defined dynamically. This may also be accomplished with a higher-layer session protocol, but some applications may not include such a protocol. BYE (goodbye) -- Indicates that a source is terminating its participation in a session. Since no source description or reception report information is required, this could be a separate (trivial) control packet format. APP (application-specific controls) -- Application-specific sections of the sender information and reception reports in the control packet provide a means to carry application-specific information defined by a profile. However, in cases where multiple implementations of a single application interoperate but may have their own control information to communicate, and additional option mechanism may be appropriate. [end]