Video Interoperability in Lync 2013
July 23, 2012 by Jeff Schertz · 73 Comments
The covers have been removed from Lync ‘Wave 15’ and the general public is now privy to download and install a preview version of Lync Server 2013. This presents the opportunity to finally discuss a topic which has been broadly misunderstood for quite some time throughout the industry: what does “H.264 AVC/SVC” support mean exactly in the upcoming Lync platform, and how might it relate to the rest of the video conferencing industry? Microsoft has been talking about introducing a standards-based video codec as a replacement for their proprietary Real-Time Video (RTV) video codec for some time and there have even been announcements released and specification documents published in the past year, but until last week those statements were never officially confirmed. But a brief look at the newly released features will show quite a focus on the video experience in the next Lync platform.
Yet the simple declaration that “Lync 2013 supports H.264 video” tends lead one to believe that the next version of Lync will be able to connect to any existing standards-based video conferencing systems, doing away with the need for expensive gateways. Well, that assumption would be completely incorrect as there is much more to the story then simply providing a new video codec in the product.
But before going into too much detail on how Microsoft is utilizing H.264 it is important to understand some of the history and background around video conferencing in the industry.
In order to establish a video call between endpoints a series of events must be able to successfully take place over a variety of protocols and systems. One of the most basic concepts to understand is that these communication sessions are generally comprised of two separate types of communication streams: signaling and media. Signaling is primarily control traffic that typically (and specifically in Lync) will traverse some sort of server component that facilitates the discussion between 2 or more endpoints. Media is a separate session that may follow the same path as the signaling traffic (in the event of multi-party conference calls hosted on a server) or may be transmitted directly between both endpoints completely bypassing any central components (in the event of a native peer-to-peer call). In some video integration scenarios both the signaling and media will traverse through one or more gateways which may also even perform some level of transcoding in the media payload. In other scenarios a third-party system may be capable of communicating directly to the server component in native fashion without the need for signaling translation and/or media transcoding.
For Lync interoperability any endpoint supporting native integration (from Microsoft or a third-party) must be able to speak Microsoft’s specific implementation of Session Initiation Protocol (SIP) in order to interact with Lync. Although Microsoft Lync is based on SIP standards, there are a number of extensions included in the stack which are unique to the platform. For a gateway-based integration any endpoints which are not able to speak directly to Lync are connected through at least one gateway which handles the translation of Microsoft SIP to other standards-based signaling protocols which support video, like H.323 or SIP.
So the key concept to understand in this section is that unless a third-party endpoint or solution can natively speak the specific Microsoft ‘dialect’ of SIP then a gateway which can speak Microsoft SIP is required.
The actual negotiation of the media session is controlled by Session Description Protocol (SDP) information embedded in the SIP signaling, which allows the two endpoints to decide which media codecs to use and what IP addresses and ports to send their outbound media streams to. But the established media session is made up or more than just the media itself (e.g. video frames). Media sessions in Lync utilize Real-time Transport Protocol (RTP) and Real-time Transport Control Protocol (RTCP). While RTP carries the actual media stream, RTCP is used to facilitate the handshake, control the media flow, provides statistics about the media sessions, and more. So even if two systems share a common video codec that does not automatically indicate by default that the transport protocol used by each are also compatible.
If an endpoint supports native registration to Lync and contains a compatible media codec and transport protocol then the media session would be able to travel directly between the two endpoints. This is a completely native interoperability scenario which partners like Polycom and Lifesize leverage.
But if there is no compatible codec in common, or the transport protocols are not compatible then additional gateway services will need to be leveraged to transcode the media session between the two endpoints. For example if a Lync client is attempting to initiate a peer-to-peer video call with a third-party video conferencing system and they have no video codecs in common then a gateway service will need to transcode the video session between those two codecs. Clearly that gateway would need to support both codecs and thus act as an intermediary. The media path in this scenario is also proxied through the gateway and is not peer-to-peer. Third-party vendors like Cisco (Tandberg) and Avaya (RADVISION) utilize this gateway approach to provide either separate signaling and media gateways, or a single gateway that handles both types of traffic.
From the time that video was first introduced in the Communications Server platform there have only ever been two video codecs supported: Real Time Video (RTV or VC-1) and H.263, the latter only being available on the Windows client and not on the Mac client or any servers. For additional details on Video functionality in Lync see this previous article, but the key point here is that RTV can support a range of video resolutions from low quality QCIF through high-definition 720p, while the legacy H.263 codec support provided in the Windows Lync client is limited to low-resolution CIF.
In the traditional video conferencing industry a host of codecs have been supported throughout the decades, the most common being H.261, H.262 (MPEG-2), H.263, and most currently H.264 (MPEG-4 AVC). (The upcoming H.265 codec is in draft status and is still under development.) The remainder of this article will focus on H.264 and its various capabilities.
H.264 Advanced Video Coding
The H.264 AVC (Advanced Video Coding) family of standards is commonly used today for everything from video conferencing to Blu-Ray discs to even YouTube videos. But its many versions are not all the same and there are specific rules for compatibility and backwards-compatibility between these various iterations. Also, be careful with the term “standards-based” as this can be used to mean different things depending on the context. Just as Microsoft’s specific implementation of SIP was ‘based on a standard‘ it is not necessarily an industry standard implementation; more on this point later.
This standard includes a variety of different profiles, levels, and versions. Some of the specific versions also introduce additional modes as well, so keeping it all straight can be a daunting task. As of earlier this year the H.264 video standard is comprised of no less than 16 different versions, with the initial approved version supporting the most basic capabilities throughout three unique profiles: Baseline, Main and Extended. The large majority of current generation video conferencing equipment today supports the Baseline profile, and while some devices may be capable of further encoding and decoding higher profiles like Main or Extended it is not very common. In version 3 a new profile called High was introduced and provides for even further video compression, at the cost of some additional processing load, but realistically can reduce the media bandwidth requirements by up to 50%. (On a daily basis I participate in video conferences from my home office at 720p/30fps over bandwidth as low as 512kbps to 768kbps using a Polycom HDX which leverages High Profile. By comparison a properly equipped Lync client capable of sending and receiving 720p video using RTV could utilize up to 1.5 Mbps for the same call.)
H.264 Scalable Video Coding
Where things start to get interesting is with version 8 when Scalable Video Coding (SVC or Annex G) was introduced into the family. The SVC extension adds new profiles and scalability modalities, the latter of which are defined as: Temporal, Spatial, SNR/Quality/Fidelity, or Combined. Some or all of these different formats can be leveraged in a specific implementation to provide a desired functionality.
The new scalability behavior provides for an enriched conferencing experience by allowing users to interactively view varying levels of video quality on-demand and not require real time decoding and re-encoding of video streams. This approach places most of the processing on the endpoints (they are already encoding the streams anyway) but by introducing more intelligence into the endpoint the need to centrally decode and re-encode video can be eliminated. The biggest advantage to this approach is providing different endpoints the ability to display the resolution and frame rate best suited for a given scenario. A central conferencing engine can then change roles from a traditional encoding bridge into an advanced relay agent, sending the desired streams (or portions of streams) to the appropriate endpoints.
Basically the way that SVC works is the media stream is comprised of individual, complementary layers. Starting with a Base Layer that provides the lowest usable resolution and frame rate that any other endpoint compatible with SVC should be able to display. Higher quality options are then provided by one or more Enhancement Layers which are included in the same stream on top of the base layer.
These individual layers are additive, meaning that there is no unneeded duplication of information between the layers. The total bandwidth required to send a video stream should be roughly the same as a non-SVC stream of the same maximum quality.
Note: The actual parameters of the individual layers are dependent on the specific application so the resolutions and rates shown in the following diagram are simply examples used for illustration and in no way define the actual capabilities of different endpoints within Lync or any other SVC-compatible solution available today. The concept of ‘building-up’ additional layers to provide multiple levels of quality is what is important to take away from this section.
In the following example an encoding endpoint capable of sending up to 1080p resolution is connected to a conference hosted on a Multipoint Control Unit (MCU). Also in the same conference call are four other participants with various supported endpoints capable of displaying different levels of video. The base layer is encoded at a low resolution of 180p at only 15 frames per second, while multiple enhancement layers are added for additional resolutions all the way up to 1080p at 30fps.
In a multi-party conferencing scenario which utilizes SVC each endpoint can ask the MCU for only the highest resolution, frame rate, or quality they can each support for a given stream. The MCU will then send the desired layer and all lower layers to the other participants. As the mobile phone might be limited to displaying low resolution in this scenario then it is sent only the base layer and no additional data, whereas the laptop capable of viewing 720p is sent the first three layers. Multiple layers are sent because each higher enhancement layer only includes the delta of data above the lowers layers, so all layers are required to reassemble the complete frame for that specific layer. Only higher layers are stripped away by the MCU or ignored by an endpoint.
Note: This diagram is an over-simplification of the layering process in SVC as depending on the specification there may actually exist multiple simultaneous streams, each with its own individual layers. This example could be a single stream of four layers or two separate simultaneous streams of multiple layers. Depending on the type of scaling supported and the range of different options required not all of the data may always be provided in a single stream.
Because the source encoding endpoint in this example supports sending as high as 1080p/30 this does not mean that it will do so all the time. If the MCU is aware that no other participant either supports that resolution or is not currently asking for that resolution then the sending endpoint does not need to waste bandwidth by sending a layer to the MCU which is not being consumed. Only the enhancement layer providing the highest level of information, and all layers below that, would need to be sent to the MCU.
The immediate benefit of this architecture is that the MCU no longer has to perform any transcoding of the media streams in order to repackage each outgoing stream in the best possible configuration for every individual endpoint. This reduces the processing load on the central ‘bridge’ and turns the system into a Media Relay. Different adaptations of SVC may still include traditional transcoding capabilities for interoperability with non-SVC participants, while systems without transcoding abilities would typically be limited to providing a “lowest common denominator” experience where all participants would see the video sent by the encoding endpoint at whatever resolution is it able to send at, regardless of their own receiving capabilities.
As more and more disparate endpoints are able to connect to the same video conferences then the advantages of scalable video are further evident, yet as with anything there is always a trade-off. This scalable approach does place a higher processing cost on all systems involved as there is clearly a lot more intelligence involved here. In addition to the increased encoding work the endpoints are on the hook for send any and all resolutions desired (and supported) in the call. In a traditional conference the heavy lifting can be placed on the MCU, providing transcoding and upscaling, to reduce the load on the endpoints and network.
Yet displacing the majority of the workload from a centralized MCU out to the endpoints as well as providing multiple active video sources to endpoints does often present some sticker-shock to claims of “1080p multiparty video”. How can a single endpoint support seeing multiple video streams at high resolutions without crippling the network? The answer is simply that this does not happen, as viewing resolution is always limited by real estate. In the case of viewing video on either a 15” laptop screen or a 50” LCD display both capable of displaying 1920×1280 pixels then this is the upper limit of pixels which can be used to show video. It is silly to think that this monitor would be able to display four continuous streams of HD video as the maximum horizontal dimension of 1280 pixels is only divisible by 720 pixels (the horizontal pixel value for 720p HD) 1.5 times. This means that only one 720p video and only half of a second video window would fit on the this screen at full resolution. And when dealing with desktop video conferencing unless there is a secondary monitor attached to display the video window then there is no room left to see any other running applications.
Note: The term “High Definition” (HD) is thrown around often to mean different things, but technically it defines a video resolution ‘higher than that of standard definition’ (SD). The aspect ratio alone cannot always be used to clarify the level of definition as SD can include some widescreen (16:9) formats. The overall pixel count and interlacing modes are more important to identifying the definition type of a given resolution.
Traditionally in video conferencing VGA (640×480) and 480i (720×480) are where SD ends and 720p (1280×720) is where HD begins. There is also a mid-tier resolution of 480p (720×480) which is classified as “Enhanced Definition” but is typically not used in video conferencing.
The following diagram illustrates how viewing multiple active video streams does not exponentially increase bandwidth utilization as additional participants join the conference. Using an example resolution of 720p (1280×720 pixels) there are a limited number of pixels (921,600) available for which to display video.
On the left side a traditional MCU has encoded the four separate participant into a single video stream at 720p resolution and the receiving endpoint is able to display the video in full resolution.
On the right side an SVC enabled conference with the same participants is shown, where as this time each tile is actually a separate VGA stream delivered directly to the endpoint. The four individual sessions can be laid out on the screen based on the client software capabilities.
- Where things start to get more complicated is how the overall bandwidth could by calculated and compared between these calls. If in this example RTV was used then the single 720p session would require between 900-1500 kbps while VGA (640×480) sessions in RTV typically consume between 300-600 kbps. This estimates out to roughly 2.5 to 3 times more bandwidth required when moving from standard definition VGA to high definition 720p. Yet a single 720p stream would be a maximum of 1500kbps while four simultaneous VGA streams could be up to 2400kbps at maximum (600 * 4).
- The actual mathematics behind calculating video bandwidth is much more complex than this basic explanation, but the concepts are still applicable. Thus exactly just how ‘scalable’ is scalable video coding? As more participants join a conference and attempt to send media the experience will require even more media sessions and additional bandwidth. Although given the same original piece of screen real estate is used then each participant’s video representation will need to shrink in order to fit more participants on the screen, allowing each stream to utilize an even lower resolution. For example moving up to 16 concurrent video sessions would limit the maximum usable resolution to 320×240 pixels per tile, yet the signaling and processing involved to negotiate that level of concurrency may be more that a given platform could be designed to handle.
Back in April 2010 a number of industry leaders (Microsoft, HP, Polycom, LifeSize) got together to form the Unified Communications Interoperability Forum (UCIF) focused on creating a set of specifications and guidelines in which companies can utilize to build and adapt their solutions around a common set of interoperable protocols. The forum’s vision is to deliver a rich and reliable Unified Communications experience, initially focusing on the video experience.
What is important about this venture is that Microsoft has since announced they would be adopting H.264 SVC technology from Polycom. They have also published a specifications document, entitled Unified Communications Specification for H.264 AVC and SVC UCConfig Modes. The latest draft (v1.1) of which can be found here.
Although this specification document contains a lot of information the most important part for this article is the defined UCConfig Modes which relate to the various scalability modalities introduced in this standard.
- UCConfig Mode 0: Non-Scalable Single Layer AVC Bitstream
- UCConfig Mode 1: SVC Temporal Scalability with Hierarchical P
- UCConfig Mode 2q: SVC Temporal Scalability + Quality/SNR Scalability
- UCConfig Mode 2s: SVC Temporal Scalability + Spactial Scalability
- UCConfig Mode 3: Full SVC Scalability (Temporal + SNR + Spatial)
What these multiple modes define are differing levels of video scalability with additional processing requirements placed on each higher level, but with the returned benefit of reduced overall bandwidth. Take note that the multiple streams at different resolutions are additive so that the overall bandwidth is not three separate complete streams, as the data from the lower quality stream is used to ‘fill in the gaps’ in the higher quality stream. So an endpoint asking to display a 720p stream would be relayed all streams underneath that resolution as well so it has the ‘complete picture’.
Mode 0 means that no scalability is provided. It still supports multiple independent simulcast streams generated by the encoder, so although the scalability features of SVC are not provided in this mode the specification allows for multiple streams for each resolution requested, at a specific frame rate.
Stream Base Layer 1 720p 30fps 2 360p 30fps 3 180p 15fps
Mode 1 introduces Temporal Scaling which provides an endpoint the ability to send a single video stream per resolution for multiple frame rates. This means that if the endpoint is asked to send two separate resolutions (e.g. 720p and 360p) then it will send two separate video streams at the same time, but the receiving end can display either 30fps, 15fos, or 7.5fps by simply dropping entire frames of the video sequence. So if 30fps is the highest frame rate sent then the receiving end can display the video at 15fps by dropping every other frame, or display 7.5fps by dropping 2/3rds of the frames. It should be noted that the basic H.264 AVC standard already supports some level of temporal video scaling but it was improved upon in the SVC version.
- The following diagram depicts how a decoding endpoint would utilize Temporal Scaling to display either 30fps (blue arrows) or 15fps (red arrows). To display 7.5fps every 5th frame would be used (the first and last frame in this example).
Stream Scalable Layers 1 720p 30fps 720p 15fps 720p 7.5fps 2 360p 30fps 360p 15fps 360p 7.5fps 3 180p 15fps 180p 7.5fps
Mode 2q first applies the benefits of Mode 1 to the stream and then introduces Quality Scaling which encodes additional levels of image quality at the same resolution and frame rate as Mode 1. There still exists a separate stream for each resolution (e.g. 3 streams to send 720p, 360p, and 180p) but each stream may include different qualities for a specific resolution and frame rate. A Quantization Parameter (QP) is used to define the level of processing applied to each stream.
The following images illustrate in a simplified manner the results between different QP values (a higher value indicates lower quality).
Stream Scalable Layers 1 720p 30fps
2 360p 30fps
3 180p 30fps
Mode 2s, like 2q, applies Mode 1 first but where it differs is that instead of providing various Quality levels it leverages Spatial Scaling to intermix multiple adjacent resolutions into the same stream. This provides for more resolution choices (5 different resolutions in the example below) but does not increase to 5 separate streams.
The following images illustrate in a simplified manner the difference in resolutions provided in each individual stream.
Stream Scalable Layers 1 720p 30fps 480p 30fps 480p 15fps 2 360p 30fps 240p 30fps 240p 15fps 3 180p 30fps 180p 15fps
Modes 2s and 2q are not additive, so only one version or the other could be used at one time. Thus Mode 3 is defined as a combination of features included throughout the lower modes providing the benefits of temporal, quality, and spatial scaling across even fewer simultaneous streams. The table below shows a single stream which can now incorporate a mix of adjacent resolutions, multiple frame rates, and different quality levels.
Stream Scalable Layers 1 720p 30fps
2 360p 30fps
The specification document also lists a large variety of supported resolutions, including both landscape and portrait orientations.
Aspect Ratio Resolutions 16:9
1920 x 1080
1280 x 720
960 x 540
848 x 480
640 x 360
480 x 270
424 x 240
320 x 180
160 x 90
320 x 240
160 x 120
1920 x 288
1280 x 192
960 x 144
640 x 96
9:16 1080 x 1920 720 x 1280 540 x 960 480 x 848 360 x 640 270x 480 240x 424 180x 320 90x 160 3:4 480 x 640 320 x 424 240 x 320 120x 160
By comparison RTV is limited to providing only a few different resolution options across the same landscape aspect ratios, highlighting just how many new options are available in the UCConfig specification of H.264 SVC.
Aspect Ratio Resolutions 16:9
1280 x 720
640 x 480
352 x 288
176 x 144
Lync 2013 Video Capabilities
Clearly the UCConfig specification defines a large variety of modes and options, yet this does not mean that Lync 2013 utilizes the entire set of available features. In fact the implementation of H.264 AVC/SVC provided in Lync 2013 appears to only utilize some of the levels, according to this published documentation available in MSDN (under the RTCP specifications) which is related to the H.264 AVC/SVC UCConfig Mode Specification: 126.96.36.199 Video Source Request (VSR).
In this section there exists a specific definition for the maximum UCConfig Mode value that a receiver can support and the valid values are either UCConfig Mode 0 or UCConfig Mode 1. Values of 2 are larger are specifically not allowed.
This would seem to indicate then that Lync 2013 will support UCConfig Mode 0 (for backwards compatibility with H.264 AVC-only compatible systems) while leveraging only Temporal Scaling in Mode 1 for multi-party video conferencing sessions using Lync 2013 clients and servers.
In the New Client Features chapter of the Lync Server 2013 Preview documentation the following items are stated under the Video Enhancements section.
- Video is enhanced with face detection and smart framing, so that a participant’s video moves to help keep him or her centered in the frame.
- High-definition video (up to 1080p resolution) is now supported in conferences.
- Participants can select from different meeting layouts: Gallery View shows all participants’ photos or videos; Speaker View shows the meeting content and only the presenter’s video or photo; Presentation View shows meeting content only; Compact View shows just the meeting controls.
- With the new Gallery feature, participants can see multiple video feeds at the same time. If the conference has more than five participants, video feeds of only the most active participants appear in the top row, and photos appear for the other participants.
- Participants can use video pinning to select one or more of the available video feeds to be visible at all times.
- Presenters can use the “video spotlight” feature to select one person’s video feed so that every participant in the meeting sees that participant only.
- With split audio and video, participants can add their video stream in a conference but dial into the meeting audio.
These items primarily focus on the new Gallery experience where a maximum of 5 video participants can be viewed at the same time when joining a Lync 2013 conference call. The first enhancement worth noticing is that video sessions in conference calls (hosted on the Lync AVMCU) can now scale up to 1080p resolution, which is not possible in Lync 2010 or OCS. Secondly multiple active-speaker video streams are supported, which was also limited to a single active-speaker in previous server versions.
These capabilities are made possible by leveraging SVC and not by the introduction of any increased transcoding abilities of the Lync AVMCU.
In the New Server Features chapter the following items are also listed under the New Video Features section.
- HD video – users can experience resolutions up to HD 1080P in two-party calls and multiparty conferences.
- Gallery View – in video conferences that have more than two people, users can see videos of participants in the conference. If the conference has more than five participants, video of only the most active participants appear in the top row, and a photo appears for the other participants.
- H.264 video – the H.264 video codec is now the default for encoding video on Lync 2013 Preview clients. H.264 video supports a greater range of resolutions and frame rates, and improves video scalability.
- The first statement seems a bit confusing as in the new client features section a resolution of 720p was stated as the limit in conferences, so this may indicate that 1080p resolution is supported in 2-party peer to peer calls only. This seems logical as Lync 2010 peer-to-peer calls could support 720p and conference calls were limited to VGA so it appears that by leveraging H.264 AVC/SVC instead of RTV both of those limitations can be increased by some factor. This also may point to the use of some hardware acceleration to provide these higher resolutions that RTV was not capable of, but that is only an educated guess as there is no mention of hardware-acceleration in the available 2013 documentation.
High definition video is now available when connecting to multi-party conferences hosted on a Lync 2013 server. This may not mean though that legacy clients which only support RTV will be able to leverage HD resolutions as RTV is still limited to VGA on the Lync AVMCU.
The Lync Server 2013 AVMCU does not display a single high resolution video stream incorporating multiple participants, traditionally called Continuous Presence, in the way that third-party video conferencing bridges can by decoding and re-encoding multiple streams. These conferencing solutions can provide for many more than 5 endpoints simultaneously in a single video stream comprised of various standards-based room systems, immersive telepresense rooms, mobile clients, and even Microsoft SIP clients like OCS or Lync. (The Lync 2013 native video conferencing experience and capabilities will be covered in more detail in a future article.)
At this point it should be pretty clear that even though Lync now supports the H.264 video codec, the signaling interoperability is still a key component. Without that level of connectivity first being addressed (via either native registration or gateways) there is no chance of negotiating any type of communication between the disparate systems.
Additionally to revisit an earlier statement although it appears that UCConfig Mode 0 support provides some interoperability with H.264 AVC, exactly what is that level of compatibility? While the media payload as defined by the UCConfig specification should be compatible with any other video conferencing system that already supports H.264 AVC Base Profile that does not necessary mean that the media control protocols are completely compatible. Thus even if a currently available signaling gateway or natively registered endpoint is able to connect to Lync Server 2013 it may yet still not be able to negotiate media even if both support H.264 Base Profile. An existing media gateway could be updated to address this but the traffic would still need to proxy through the gateway (for control protocol reasons, not media transcoding) in the event that native registration is not utilized.
Realistically this is what the creation of the UCI Forum and then publishing the UCConfig specifications was designed for, so that third-parties and partners can update their solutions to interoperate with this new specification. So although this is not a silver bullet for Lync and third-party video interoperability it is much closer than what was previously available as by moving away from the proprietary RTV codec and moving towards the open-standards base H.264 codec the amount of work required to design and qualify Lync-compatible systems is greatly reduced.
As these new features and capabilities in Lync Server 2013 will certainly bring video conferencing to the forefront of the desktop UC experience Lync will still rely on partners to continue to provide interconnectivity to the masses of existing video conferencing solutions as well as add additional video conferencing value and performance to the overall platform. Microsoft has made the first strides in adopting an industry-standards based scalable video codec and it is now up to the rest of the players in the industry to decide if they will follow suit. It is also worth mentioning that if other vendors go higher up the SVC scale to adopt capabilities available in Modes 2 or 3 then as long as the defined UCConfig specifications are adhered to then those systems will still be backward compatible to the Mode 1 usage available in Lync 2013.
The bottom line is that traditional third-party video systems will still need to either continue supporting native registration or continue utilizing a signaling gateway to communicate with Lync 2013.