{"id":2111,"date":"2023-01-17T11:38:15","date_gmt":"2023-01-17T06:08:15","guid":{"rendered":"https:\/\/www.enablex.io\/insights\/?p=2111"},"modified":"2025-07-02T19:47:43","modified_gmt":"2025-07-02T14:17:43","slug":"scalable-webrtc-speech-to-text-system","status":"publish","type":"post","link":"https:\/\/www.enablex.io\/insights\/scalable-webrtc-speech-to-text-system\/","title":{"rendered":"Building a scalable WebRTC-based Speech to Text system"},"content":{"rendered":"<p><span data-contrast=\"auto\">WebRTC has been around for over a decade and promises excellent scalability for most use cases. You\u2019d often have come across Video CPaaS vendors who\u2019re democratising and actively boosting the spread of WebRTC for usage by app developers, so the latter can use it for various use cases without necessarily understanding the core tech in deep detail.&nbsp;&nbsp;<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559740&quot;:276}\">&nbsp;<\/span><\/p>\n<p><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559740&quot;:276}\">&nbsp;<\/span><span data-contrast=\"auto\">Like many technologies, WebRTC is inherently complex. To make a production grade WebRTC application, one requires a complex intermix of engineering spanning across network engineering, video engineering, VOIP or similar RTC custom protocols, UI\/UX etc.&nbsp;<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559740&quot;:276}\">&nbsp;<\/span><\/p>\n<p><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559740&quot;:276}\">&nbsp;<\/span><span data-contrast=\"auto\">WebRTC\u2019s deep integration with <a href=\"https:\/\/www.enablex.io\/cpaas\/\" target=\"_blank\" rel=\"noopener\">CPaaS<\/a> vendors brings a win-win model for application developers. This happens through economies of scale, continuous and rapid enhancement of the WebRTC-based video platform, compatibility, and support w.r.t ever changing and rapidly evolving browsers stacks and mobile development frameworks.&nbsp;<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559740&quot;:276}\">&nbsp;<\/span><\/p>\n<p><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559740&quot;:276}\">&nbsp;<\/span><span data-contrast=\"auto\">EnableX has been at the forefront of offering all these essentials by using a very exhaustive set of DIY Video APIs, while also providing a no-code, yet highly customisable, off the shelf UI-based video platform.&nbsp;<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559740&quot;:276}\">&nbsp;<\/span><\/p>\n<p><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559740&quot;:276}\">&nbsp;<\/span><span data-contrast=\"auto\">While building this we set off on a journey to offer a speech to text functionality for our <a href=\"https:\/\/www.enablex.io\/insights\/the-most-comprehensive-guide-on-webrtc\/\" target=\"_blank\" rel=\"noopener\">WebRTC<\/a> based CPaaS platform \u2013 EnableX. This is one of the most requested features by our customers as it opens new possibilities of running post session AI analysis.&nbsp;<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559740&quot;:276}\">&nbsp;<\/span><\/p>\n<p><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559740&quot;:276}\">&nbsp;<\/span><span data-contrast=\"auto\">As usual as it may seem, you would have experienced this in different forms on some of the existing enterprise video communication applications such as Microsoft teams or Zoom. However, working with EnableX, you can fetch converted speech to text in real-time and embed it within your business application.&nbsp;<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559740&quot;:276}\">&nbsp;<\/span><\/p>\n<p><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559740&quot;:276}\">&nbsp;<\/span><span data-contrast=\"auto\">For example, imagine there\u2019s an app that evaluates the speech to text response that can now&nbsp; further assess grammar construction, speech style, word complexity, vocabulary, etc. Using NLP engines.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559740&quot;:276}\">&nbsp;<\/span><\/p>\n<p><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559740&quot;:276}\">&nbsp;<\/span><span data-contrast=\"auto\">Another possibility- a video session with a large group where participants join from across the world and do not share a common language. The availability of a real-time speech conversion&nbsp; in their respective spoken language can help overcome the language barrier.&nbsp;<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559740&quot;:276}\">&nbsp;<\/span><\/p>\n<p><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559740&quot;:276}\">&nbsp;<\/span><span data-contrast=\"auto\">So, how do you create a WebRTC based speech to text system?<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559740&quot;:276}\">&nbsp;<\/span><\/p>\n<p><span data-contrast=\"auto\">Your first attempt can be to use the Web Audio to capture the respective browser audio from every end point participating in the video session and then, send it to a cloud based speech to text service.&nbsp;<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559740&quot;:276}\">&nbsp;<\/span><\/p>\n<p><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559740&quot;:276}\">&nbsp;<\/span><span data-contrast=\"auto\">Significant problems in the above approach:<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559740&quot;:276}\">&nbsp;<\/span><\/p>\n<ol>\n<li data-leveltext=\"%1.\" data-font=\"Calibri\" data-listid=\"13\" data-list-defn-props=\"{&quot;335552541&quot;:0,&quot;335559684&quot;:-1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769242&quot;:[65533,0],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;%1.&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}\" data-aria-posinset=\"1\" data-aria-level=\"1\"><strong><span data-contrast=\"auto\">Doubles the end point bandwidth <\/span><\/strong><span data-contrast=\"auto\">\u2013 Audio data is sent both to the cloud-based speech to text service and duplicated audio packets are also sent into the video session.<\/span><\/li>\n<li data-leveltext=\"%1.\" data-font=\"Calibri\" data-listid=\"13\" data-list-defn-props=\"{&quot;335552541&quot;:0,&quot;335559684&quot;:-1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769242&quot;:[65533,0],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;%1.&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}\" data-aria-posinset=\"1\" data-aria-level=\"1\"><strong><span data-contrast=\"auto\">Non-scalable<\/span><\/strong><span data-contrast=\"auto\"> \u2013 The cost impact will be high for a large group (say 100+) where everyone is sending their audio to the speech to text platform and the converted speech is shared with everyone else. Not to mention, it can also choke client upload\/downlink bandwidth (because of receiving speech to text data from 100+ people continuously), so it is not a technically feasible solution.<\/span><\/li>\n<li data-leveltext=\"%1.\" data-font=\"Calibri\" data-listid=\"13\" data-list-defn-props=\"{&quot;335552541&quot;:0,&quot;335559684&quot;:-1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769242&quot;:[65533,0],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;%1.&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}\" data-aria-posinset=\"1\" data-aria-level=\"1\"><strong><span data-contrast=\"auto\">Device support<\/span><\/strong><span data-contrast=\"auto\">&nbsp; \u2013 For mobile apps built on top of native app or hybrid framework, you have to further explore on how to replicate the speech to text functionality and develop independent solutions.<\/span><\/li>\n<\/ol>\n<p><span data-contrast=\"auto\">The next attempt is to move it to the server side. Let the server do the heavy lifting of constantly translating the speech to text. Of course there are many challenges you will have to overcome if you decide to do it on the server side \u2013 WebRTC audio is transported on SRTP streams, OPUS based encoding, ease of access of cloud-based service and its interoperability with server stack, horizontal scalability, and vertical scalability.<\/span><\/p>\n<p><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559740&quot;:276}\">&nbsp;<\/span><span data-contrast=\"auto\">But beating the scalability problem is still difficult. Let us understand it in more detail.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559740&quot;:276}\">&nbsp;<\/span><\/p>\n<p><span data-contrast=\"auto\">Average human spoken word rate is 150 wpm. If there are 50 people whose mic activity is captured and constantly translated, it means a 150 * 50 = 7500 wpm . Every word is around an average of 6 characters. which in this case, totals to&nbsp; 45000 characters per minute. However, the speech to text conversion gets continuously refined during the conversion process so the actual data rate is much higher.&nbsp;<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559740&quot;:276}\">&nbsp;<\/span><\/p>\n<p><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559740&quot;:276}\">&nbsp;<\/span><span data-contrast=\"auto\">A practical way to solve it is to only convert the speech to text for the top \u201cN\u201d active talkers and share the converted speech with each user. The Active talker functionality in an interactive video based platforms allows you to choose and send the top most recent talkers.&nbsp;<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559740&quot;:276}\">&nbsp;<\/span><\/p>\n<p><a href=\"https:\/\/www.enablex.io\/insights\/active-talker-giving-an-edge-to-your-video-conference\/\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\"><span data-ccp-charstyle=\"Hyperlink\">https:\/\/www.enablex.io\/insights\/active-talker-giving-an-edge-to-your-video-conference\/<\/span><\/span><\/a><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559740&quot;:276}\">&nbsp;<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559740&quot;:276}\">&nbsp;<\/span><\/p>\n<p><span data-contrast=\"auto\">Given the unique capabilities of EnableX, such as server-side active talker, horizontally scalable video platform, and its interoperable capabilities for Voice and Voice API working side by side using highly scalable EnableX Gateway, we\u2019ve built a highly scalable on demand speech to text capability which can be initiated with a single API call.<\/span><\/p>\n<p><span data-contrast=\"auto\">Here\u2019s how it\u2019ll come alive-<\/span><\/p>\n<p><img decoding=\"async\" class=\"aligncenter wp-image-2113 size-full\" src=\"https:\/\/www.enablex.io\/insights\/wp-content\/uploads\/2023\/01\/EnableX-WebRTC-Speech-to-Text-Model.jpg\" alt=\"WebRTC - Speech to Text\" width=\"673\" height=\"316\" srcset=\"https:\/\/www.enablex.io\/insights\/wp-content\/uploads\/2023\/01\/EnableX-WebRTC-Speech-to-Text-Model.jpg 673w, https:\/\/www.enablex.io\/insights\/wp-content\/uploads\/2023\/01\/EnableX-WebRTC-Speech-to-Text-Model-300x141.jpg 300w\" sizes=\"(max-width: 673px) 100vw, 673px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p><strong><span data-contrast=\"auto\">Using EnableX\u2019s text to speech feature, you get the follow key advantages<\/span><\/strong><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559740&quot;:276}\">&nbsp;<\/span><\/p>\n<ol>\n<li data-leveltext=\"%1.\" data-font=\"Calibri\" data-listid=\"14\" data-list-defn-props=\"{&quot;335552541&quot;:0,&quot;335559684&quot;:-1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769242&quot;:[65533,0],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;%1.&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}\" data-aria-posinset=\"1\" data-aria-level=\"1\"><strong><span data-contrast=\"auto\">Horizontally scalable &#8211; <\/span><\/strong><span data-contrast=\"auto\">You can run any number of concurrent rooms and they all can be transcribed.<\/span><\/li>\n<li data-leveltext=\"%1.\" data-font=\"Calibri\" data-listid=\"14\" data-list-defn-props=\"{&quot;335552541&quot;:0,&quot;335559684&quot;:-1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769242&quot;:[65533,0],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;%1.&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}\" data-aria-posinset=\"1\" data-aria-level=\"1\"><strong><span data-contrast=\"auto\">Always ON or can be enabled on demand \u2013<\/span><\/strong><span data-contrast=\"auto\"> Rooms can be configured to be always transcribed or transcription can be enabled by room participants on-demand<\/span><\/li>\n<li data-leveltext=\"%1.\" data-font=\"Calibri\" data-listid=\"14\" data-list-defn-props=\"{&quot;335552541&quot;:0,&quot;335559684&quot;:-1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769242&quot;:[65533,0],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;%1.&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}\" data-aria-posinset=\"1\" data-aria-level=\"1\"><span data-contrast=\"auto\">Speaker identification<\/span><\/li>\n<li data-leveltext=\"%1.\" data-font=\"Calibri\" data-listid=\"14\" data-list-defn-props=\"{&quot;335552541&quot;:0,&quot;335559684&quot;:-1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769242&quot;:[65533,0],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;%1.&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}\" data-aria-posinset=\"1\" data-aria-level=\"1\"><span data-contrast=\"auto\">Support for 100+ languages<\/span><\/li>\n<li data-leveltext=\"%1.\" data-font=\"Calibri\" data-listid=\"14\" data-list-defn-props=\"{&quot;335552541&quot;:0,&quot;335559684&quot;:-1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769242&quot;:[65533,0],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;%1.&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}\" data-aria-posinset=\"1\" data-aria-level=\"1\"><span data-contrast=\"auto\">Using EnableX data store feature, transcription for the room can be either saved to be fetched later or can be stored at customer\u2019s end<\/span><\/li>\n<li data-leveltext=\"%1.\" data-font=\"Calibri\" data-listid=\"14\" data-list-defn-props=\"{&quot;335552541&quot;:0,&quot;335559684&quot;:-1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769242&quot;:[65533,0],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;%1.&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}\" data-aria-posinset=\"1\" data-aria-level=\"1\"><span data-contrast=\"auto\"><strong>Speech to text converter joins as a silent participant into the room<\/strong> \u2013 This can be useful for audit purpose<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335551550&quot;:6,&quot;335551620&quot;:6,&quot;335559740&quot;:276}\">&nbsp;<\/span><\/li>\n<\/ol>\n<p><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559685&quot;:720,&quot;335559740&quot;:276}\">&nbsp;<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559740&quot;:276}\">&nbsp;<\/span><strong><span data-contrast=\"auto\">To get access to EnableX Speech to text API, please refer to the link below:<\/span><\/strong><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559740&quot;:276}\">&nbsp;<\/span><\/p>\n<p><strong><span data-contrast=\"auto\">Web&nbsp; &#8211; JavaScript SDK&nbsp;<\/span><\/strong><\/p>\n<p><a href=\"https:\/\/www.enablex.io\/developer\/video-api\/client-api\/web-toolkit\/live-transcription\/\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\"><span data-ccp-charstyle=\"Hyperlink\">Live Transcription: Web SDK \u2013 Video API \u2013 EnableX Developer Centre<\/span><\/span><\/a><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559740&quot;:276}\">&nbsp;<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559740&quot;:276}\">&nbsp;<\/span><\/p>\n<p><strong><span data-contrast=\"auto\">Android \u2013<\/span><\/strong><\/p>\n<p><a href=\"https:\/\/enablex22.vcloudx.com\/developer\/video-api\/client-api\/ios-toolkit\/live-transcription\/\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\"><span data-ccp-charstyle=\"Hyperlink\">Live Transcription: iOS SDK \u2013 Video API \u2013 EnableX Developer Centre (vcloudx.com)<\/span><\/span><\/a><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559740&quot;:276}\">&nbsp;<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559740&quot;:276}\">&nbsp;<\/span><\/p>\n<p><strong><span data-contrast=\"auto\">IOS<\/span><\/strong><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559740&quot;:276}\">&nbsp;<strong><span data-contrast=\"auto\">\u2013<\/span><\/strong><\/span><\/p>\n<p><a href=\"https:\/\/enablex22.vcloudx.com\/developer\/video-api\/client-api\/ios-toolkit\/live-transcription\/\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\"><span data-ccp-charstyle=\"Hyperlink\">Live Transcription: iOS SDK \u2013 Video API \u2013 EnableX Developer Centre (vcloudx.com)<\/span><\/span><\/a><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559740&quot;:276}\">&nbsp;<\/span><\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>WebRTC has been around for over a decade and promises excellent scalability for most use cases. You\u2019d often have come across Video CPaaS vendors who\u2019re democratising and actively boosting the spread of WebRTC for usage by app developers, so the latter can use it for various use cases without necessarily understanding the core tech in &#8230;<\/p>\n","protected":false},"author":25,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[23],"tags":[47],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v21.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Building Speech to Text System in WebRTC Calling: a guide<\/title>\n<meta name=\"description\" content=\"Learn how to build a scalable WebRTC-based speech to text system. Explore the technologies and best practices for accurate and efficient transcription.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.enablex.io\/insights\/scalable-webrtc-speech-to-text-system\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Building Speech to Text System in WebRTC Calling: a guide\" \/>\n<meta property=\"og:description\" content=\"Learn how to build a scalable WebRTC-based speech to text system. Explore the technologies and best practices for accurate and efficient transcription.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.enablex.io\/insights\/scalable-webrtc-speech-to-text-system\/\" \/>\n<meta property=\"og:site_name\" content=\"Insights about video API, SMS API; WhatsApp for Business API\" \/>\n<meta property=\"article:published_time\" content=\"2023-01-17T06:08:15+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-07-02T14:17:43+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.enablex.io\/insights\/wp-content\/uploads\/2023\/01\/EnableX-WebRTC-Speech-to-Text-Model.jpg\" \/>\n<meta name=\"author\" content=\"Jason Wills\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@enablexio\" \/>\n<meta name=\"twitter:site\" content=\"@enablexio\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Jason Wills\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Building Speech to Text System in WebRTC Calling: a guide","description":"Learn how to build a scalable WebRTC-based speech to text system. Explore the technologies and best practices for accurate and efficient transcription.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.enablex.io\/insights\/scalable-webrtc-speech-to-text-system\/","og_locale":"en_US","og_type":"article","og_title":"Building Speech to Text System in WebRTC Calling: a guide","og_description":"Learn how to build a scalable WebRTC-based speech to text system. Explore the technologies and best practices for accurate and efficient transcription.","og_url":"https:\/\/www.enablex.io\/insights\/scalable-webrtc-speech-to-text-system\/","og_site_name":"Insights about video API, SMS API; WhatsApp for Business API","article_published_time":"2023-01-17T06:08:15+00:00","article_modified_time":"2025-07-02T14:17:43+00:00","og_image":[{"url":"https:\/\/www.enablex.io\/insights\/wp-content\/uploads\/2023\/01\/EnableX-WebRTC-Speech-to-Text-Model.jpg"}],"author":"Jason Wills","twitter_card":"summary_large_image","twitter_creator":"@enablexio","twitter_site":"@enablexio","twitter_misc":{"Written by":"Jason Wills","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.enablex.io\/insights\/scalable-webrtc-speech-to-text-system\/#article","isPartOf":{"@id":"https:\/\/www.enablex.io\/insights\/scalable-webrtc-speech-to-text-system\/"},"author":{"name":"Jason Wills","@id":"https:\/\/www.enablex.io\/insights\/#\/schema\/person\/422d2b153c3c96827da141c6446d11a3"},"headline":"Building a scalable WebRTC-based Speech to Text system","datePublished":"2023-01-17T06:08:15+00:00","dateModified":"2025-07-02T14:17:43+00:00","mainEntityOfPage":{"@id":"https:\/\/www.enablex.io\/insights\/scalable-webrtc-speech-to-text-system\/"},"wordCount":1004,"publisher":{"@id":"https:\/\/www.enablex.io\/insights\/#organization"},"keywords":["webrtc"],"articleSection":["TechTalks"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.enablex.io\/insights\/scalable-webrtc-speech-to-text-system\/","url":"https:\/\/www.enablex.io\/insights\/scalable-webrtc-speech-to-text-system\/","name":"Building Speech to Text System in WebRTC Calling: a guide","isPartOf":{"@id":"https:\/\/www.enablex.io\/insights\/#website"},"datePublished":"2023-01-17T06:08:15+00:00","dateModified":"2025-07-02T14:17:43+00:00","description":"Learn how to build a scalable WebRTC-based speech to text system. Explore the technologies and best practices for accurate and efficient transcription.","breadcrumb":{"@id":"https:\/\/www.enablex.io\/insights\/scalable-webrtc-speech-to-text-system\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.enablex.io\/insights\/scalable-webrtc-speech-to-text-system\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.enablex.io\/insights\/scalable-webrtc-speech-to-text-system\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.enablex.io\/insights\/"},{"@type":"ListItem","position":2,"name":"Building a scalable WebRTC-based Speech to Text system"}]},{"@type":"WebSite","@id":"https:\/\/www.enablex.io\/insights\/#website","url":"https:\/\/www.enablex.io\/insights\/","name":"Enablex","description":"","publisher":{"@id":"https:\/\/www.enablex.io\/insights\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.enablex.io\/insights\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.enablex.io\/insights\/#organization","name":"Enablex","url":"https:\/\/www.enablex.io\/insights\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.enablex.io\/insights\/#\/schema\/logo\/image\/","url":"https:\/\/www.enablex.io\/insights\/wp-content\/uploads\/2023\/05\/EnableX-Logo-01.png","contentUrl":"https:\/\/www.enablex.io\/insights\/wp-content\/uploads\/2023\/05\/EnableX-Logo-01.png","width":17382,"height":3567,"caption":"Enablex"},"image":{"@id":"https:\/\/www.enablex.io\/insights\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/twitter.com\/enablexio","https:\/\/www.linkedin.com\/company\/vcloudx"]},{"@type":"Person","@id":"https:\/\/www.enablex.io\/insights\/#\/schema\/person\/422d2b153c3c96827da141c6446d11a3","name":"Jason Wills","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.enablex.io\/insights\/#\/schema\/person\/image\/","url":"https:\/\/www.enablex.io\/insights\/wp-content\/uploads\/2025\/05\/envato-labs-ai-f14f6981-d7f8-4c3e-9234-00323c7d5ca0-96x96.jpg","contentUrl":"https:\/\/www.enablex.io\/insights\/wp-content\/uploads\/2025\/05\/envato-labs-ai-f14f6981-d7f8-4c3e-9234-00323c7d5ca0-96x96.jpg","caption":"Jason Wills"},"description":"Jason works behind the scenes at EnableX, helping to turn complex tech into practical tools that developers and businesses can actually use. With several years of experience in product development and platform architecture, he focuses on making communication technologies simpler, smarter and easier to build with. Whether he's writing step-by-step guides, product tips or explaining how our APIs work, Jason keeps things clear and useful.","sameAs":["https:\/\/www.enablex.io\/","https:\/\/www.linkedin.com\/company\/vcloudx\/"],"url":"https:\/\/www.enablex.io\/insights\/author\/jason-wills\/"}]}},"_links":{"self":[{"href":"https:\/\/www.enablex.io\/insights\/wp-json\/wp\/v2\/posts\/2111"}],"collection":[{"href":"https:\/\/www.enablex.io\/insights\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.enablex.io\/insights\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.enablex.io\/insights\/wp-json\/wp\/v2\/users\/25"}],"replies":[{"embeddable":true,"href":"https:\/\/www.enablex.io\/insights\/wp-json\/wp\/v2\/comments?post=2111"}],"version-history":[{"count":0,"href":"https:\/\/www.enablex.io\/insights\/wp-json\/wp\/v2\/posts\/2111\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.enablex.io\/insights\/wp-json\/wp\/v2\/media?parent=2111"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.enablex.io\/insights\/wp-json\/wp\/v2\/categories?post=2111"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.enablex.io\/insights\/wp-json\/wp\/v2\/tags?post=2111"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}