Creating a Production-Ready WebRTC App for Video Calls: 5 Considerations for Developers


Live video calling is revolutionizing the way we interact with people across the globe. But video calls aren’t just about live video. The ability to integrate communication into the workflow to achieve specific goals — to make it contextual to daily work — is crucial, and WebRTC is one proven platform to help achieve those goals.

Unlike Skype, which is a stand-alone communication platform, WebRTC is an open-source technology, and it allows for adding real-time communication features such as live video calling directly into browser applications and websites. It is a set of JavaScript APIs for easy integration, without having to deal with the inherent complexities of requiring downloads or plugins for use.

Before you tackle using WebRTC on your own, there are a few questions to consider and issues to understand, especially when deploying large-scale video call apps.

Do I Need A Media Server?

A media server is not required if you only need to connect to two or three participants in a video call — but that assumes they are on a high-speed, uncongested network. Issues such as distribution and latency all kick in once you move from a one-to-one video session to more than two participants or anticipate large-scale webinar or broadcasting. And when you want to add in features such as recording, transcoding, and connecting calls to PSTN, the burden increases.

This is because WebRTC is built on a peer-to-peer (mesh) topology. Each participant in a session is connected directly to the other participants. When the number of participants increases, maintaining direct connections (including coding and decoding) between all participants becomes CPU-intensive and unsustainable (see diagram below). As a result, the quality of the video call degrades (in the form of frozen video and sound problems).

Server-based topologies like Selective Forwarding Unit (SFU) or multi-point control (MCU) can help address these limitations.



Server-based topologies for scalability

SFU-based topology is computationally less demanding. It does not require transcoding and mixing, making it more scalable and economical. Each participant sends their video stream to the server. The server then forwards those streams (in packets) to the other participants (the other subscribers’ browsers). This reduces the bandwidth requirement, as it minimizes the uplink bandwidth needed from each participant. Furthermore, with a server, developing other functions like transcoding, recording, and Session Initiation Protocol) (SIP) integration becomes simpler (which will be discussed below).




Bonus Tip:  Plan for the long term: even if your current requirement is only 1-to-1 calling, make sure your implementation can scale easily in the future. There is no QoS guarantee with the WebRTC stack and a TURN server is required for peer discovery in WebRTC when the users are on a restricted network.

How Do I Optimise Bandwidth?

Let’s say you decided to go with an SFU. You might want to introduce simulcast as well. In a perfect scenario where bandwidth is not an issue, you’d have all participants on high-quality 720p video resolution, which requires 1.5Mpbs per participant. In the example below, each participant in the session sends out 1.5Mbps and receives three streams of 1.5Mbps. In a four-party call, the media server will need to receive 6Mbps and send out 18Mbps.


Resolution                                    720p (1.5 Mbps)

Individual User (outgoing)       1.5Mbps (1 stream)

Individual User (incoming)      4.5Mbps (3 streams)

SFU (outgoing)                            18Mbps (12 streams)

SFU (incoming)                           6Mbps (4 streams)


Realistically, sustained high bandwidth for each participant is unlikely — even more so with an increase in the number of participants. In a multiple-bandwidth scenario, without simulcast, the SFU will receive feedback from each participant about their network connection and will stream at the lowest bitrate to ensure all participants in the session can view the content. The obvious problem with this is that one participant’s stream can lower quality for all participants.

Simulcast-Optimized Video Quality

Simulcast, on the other hand, is a mechanism by which a device sends a video stream that contains multiple bitrates.

Simulcast-Optimized Video Quality


With simulcast, a user/client will encode their video into multiple different bitrates. These video streams are then received by the SFU, and the SFU can pick which stream to send to which participant, based on subscribers’ available bandwidth. This is extremely useful in a broadcast scenario. Each participant can consume whichever bitrate is best suited for them.

Using the above example, with simulcast the amount of bandwidth needed from the media server would be just  8.4Mbps outgoing and 8.8Mbps incoming (assuming two participants are on a smaller window, and thus only need 300kbps of bandwidth). As you can see, a simulcast environment improves the effectiveness of bandwidth consumption and the quality of group calls:

Resolution                                                   Assuming 720p is the highest (150Kbps – 1.5Mbps)

Individual User (outgoing)                     2.2Mbps (video stream with multiple bitrates)

Individual User (incoming)                     1.5Mbps (1 stream) and 0.3Mbps (2 streams)

SFU (outgoing)                                           8.4Mbps (12 streams)

SFU (incoming)                                           8.8Mbps (4 streams)


Bonus Tip: For large-scale calls, you don’t have to “squeeze” everyone onto the window. Instead, you can just display the last few active participants on the screen. We call this Active Talker. Together with simulcast, you get more realistic and efficient user experience. Active Talker is not a feature available in WebRTC.

How Do I Connect To VoIP/PSTN?

While web browsers and native apps tend to be the primary focus of modern communication, we cannot ignore users with a VoIP-based endpoint or mobile or landline connection over PSTN. Therefore, it’s imperative for users to be able to dial into an active WebRTC-based session from a phone or have their phone ring when invited to join.

SIP Signaling for WebRTC Apps

To do this, you need a gateway or switch that can speak the protocol used by VoIP phones everywhere: SIP. Although WebRTC uses the same underlying protocols that VoIP uses, including  Real-time Transport Protocol (RTP), Real-time Control Protocol (RTCP), Secure Real-time Transport Protocol (SRTP), and Session Description Protocol (SDP), it has no native support for SIP-based signaling. Instead, choosing how to signal for call establishment is left to the developer. As a result, connecting WebRTC to PSTN/VoIP endpoints requires a WebRTC/SIP gateway. So you have two choices: either develop in-house or employ a gateway. This requires enough understanding of both WebRTC and SIP protocols to make the two work together cohesively.

Bonus Tip: To connect to a PSTN line you may also need to deal with legality issues. For example, you cannot legally mix VoIP and PSTN traffic for VoIP calls originating in the Indian subcontinent.

How Can I Add Recording As Part Of My Workflow?

Both client-side recording and server-side recording techniques help you work around the challenges of recording with mesh networks. JavaScript coders tend to favor client-side recording, but it comes with limitations. For instance, videos are recorded locally and stored for later use, so you have no visibility into how much available storage you need. And because WebRTC media is pushed over UDP transport, recorded video quality may be suboptimal if there is packet loss on the transport channel.

Server-side Recording

With server-side recording, media isn’t sent from browser to browser. Instead, it is sent directly over media servers. When the media is ready to be transmitted, the WebRTC session is initiated with the server as a session broker. The media is routed to the receiving end via the server and, at the same time, the decoded media is sent for recording and post-processing. Recording processing includes combining multiple media inputs from all participants into a single media file, changing the format for playback, and compressing the file size.

Bonus Tip: Recording media is just one step. You need to also think about archiving, processing metadata, and how, when, and where the playback takes place. Don’t forget to incorporate access security, too.

Is My Media Secure?

WebRTC is inherently safer in several ways. Because it does not use plug-ins, it eliminates one possible vector of attack through malicious plug-ins or malware. In addition, browser patches are deployed quickly and regularly. WebRTC also mandates all media be end-to-end encrypted and that applications be HTTPS-compliant.

Added Security

WebRTC sends encrypted media over secure Datagram Transport Layer Security (DTLS) channels and only permits sending encrypted RTP streams via SRTP. However, WebRTC’s signaling layer should be encrypted, as well as media servers. For additional security, you might want to consider hosting your media servers on-premises. This is a great option for organizations seeking maximum control and privacy of their data.

Bonus Tip:  Make sure you’re aware of relevant regulations and compliance requirements when developing your apps. For example, healthcare organizations require Health Insurance Portability and Accountability Act (HIPAA) compliance for patient personal data privacy.

DIY or CPaaS?

While source code for WebRTC is free and easily available to developers, creating your own application can be challenging. There are many considerations, such as sourcing for media and signaling servers, integration, and security. There is no QoS guarantee in the WebRTC stack, and a TURN server is required for peer discovery when users are on a restricted network.

Currently, there are not many WebRTC experts available worldwide, making a CPaaS provider a more viable alternative. One such provider is EnableX. Built on a carrier-grade platform, EnableX offers developers all the necessary toolkits to develop engaging communication experiences, from one-to-one chats to large-scale broadcasts and webinars, without needing to build backend infrastructure and interfaces.

A feature-rich, scalable and secure platform, EnableX offers recording, Active Talker, translation, pre/post-call analysis capabilities, and more. The platform is continuously optimized for the highest video quality.

EnableX is a great option for those who are looking to reduce their time to market and their upfront development costs while maintaining a large degree of control over solutions design and development. To find out more, visit

Pankaj Gupta

Pankaj, CEO of vCloudx, is the driving force behind the company’s strategic focus since its inception in 2017. He is a serial IT & Telecom entrepreneur with more than 20 years of proven experience in building successful businesses in the APAC market. He led his last venture, ConferIndia, for more than 10 years, building it into one of the most successful collaboration services companies in India, before it was acquired by Arkadin, an NTT Communications company.

Know more