Robert Bader

National Park Foundation: March on Washington - A Technical Overview

Play video

Visit Page

Cross platform Audio recording and moderation using WebRTC, Web Audio, and the Web Speech API

The National Park Service and National Park Foundation present the site “March On Washington”: a multi-device site developed by UNIT9 for Organic.

This is a HTML5 site celebrating the 50th anniversary of Martin Luther King’s “I have a dream” speech. The site lets users explore audio recordings, videos and photos of the event that took place in 1963, and brings the audience in, to participate with a gallery of people’s own readings of the speech that anyone can contribute to.

From a technology point of view the most interesting part of the site was the user generated content audio, which we could record and upload in a seamless way with no special plugins. It also was interesting for us to apply Google Speech to the user contributions to aid with moderation when submitting the entry to the server.

In addition to this we are covering over two other areas: how we set up a sync for the scrolling text for the speech audio and some tips to take into account while applying Record.js. and how we structured the back-end, to handle server side processing of sound using Google App Engine and Google Compute Engine.

Also please note, some additional downloadable example code is on its way so check back for an update.

Quick Tour

The “March On Washington” site is made up of four main sections: a play mode for the original historical documents montaged to the original speech, the explore section where you can have a look at other photographic materials from the time and a user generated content section.

The main page of the site shows archive photography while we hear Martin Luther King’s speech and watch the words flow below the image to his own.

UNIT9 - National Park Foundation: March on Washington - A Technical Overview

For many items we have a detailed informational sheet, connected video and audio records.

The user gallery shows people’s profile picture and you can hear their reading of sections of the whole speech.

WebRTC and Web Speech API communication

When you decide to record your own version of the speech we ask that you first log in with Google+ or your Facebook account, then pick a chapter and press record. Then you get to speak your own version, and this is sent to the server over WebRTC and filtered and tied to your user profile picture. We mix the track with background noise and grain so it is placed in the same style of the original speech.

Getting Recorded Audio

There are many tutorials for the getUserMedia feature, but there are none we could find that combine this with a recorded audio track and speech speech recognition. Here is a quick breakdown that goes over how what we did.

From: /prototypes/audioWebRTC_Recognition/js/app.js

      
window.URL = window.URL || window.webkitURL;
navigator.getUserMedia = navigator.getUserMedia || navigator.webkitGetUserMedia || navigator.mozGetUserMedia || navigator.msGetUserMedia;
startRecording: function () {
      if ( navigator.getUserMedia ) {
            navigator.getUserMedia({audio: true}, this.onGetMediaSuccess, this.onGetMediaFail);
      } else {
            console.log('navigator.getUserMedia not present');
      }
},
onGetMediaSuccess: function ( s ) {
      var self = this.App;
      self.mediaStreamSource = self.audioContext.createMediaStreamSource( s );
      self.recorder = new Recorder(self.mediaStreamSource);
      self.recorder.record();
},
onGetMediaFail: function () {
      //code flow for rejected
}

Once the user is done recording, we capture the audio as an HTML5 binary audio file and let the user have a listen and check it before we send it to the backend. So here ExportWAV generates a Blob object containing the recording in WAV format:

      
var self = this;
this.recorder.stop();
this.recorder.exportWAV(function(s) {
      var audio = document.createElement( 'audio' );
      audio.src = window.URL.createObjectURL(s);
      audio.id = 'record' + self.recordCount;
      document.body.appendChild( audio );
      });
}

Speech Recognition Front-End

To process our audio file with speech recognition, what we do first is check if the browser has Web Speech API implemented. The next step is to set basic configuration and events related to user action.

From: /prototypes/audioWebRTC_Recognition/js/app.js

      
startRecognition: function () {
      if (!('webkitSpeechRecognition' in window)) {
            console.log( 'recognition not available on your browser' );
      } else {
            var self = this;
            this.recognition = new webkitSpeechRecognition();
            this.recognition.continuous = true;
            this.recognition.interimResults = true;
            this.recognition.lang = "en";
            this.recognition.onstart = function() {
            self.onRecognitionStart();
      }
      this.recognition.onresult = function( e ) {
            self.onRecognitionResult( e );
      }
      this.recognition.onerror = function( e ) {
            self.onRecognitionError( e );
      }
      this.recognition.onend = function() {
            self.onRecognitionEnd( );
      }
      this.recognition.start();
},
stopRecognition: function () {
      this.recognition.stop();
},

This function returns speech translated to string.

      
onRecognitionResult: function ( event ) {
      for (var i = event.resultIndex; i < event.results.length; ++i) {
            if (event.results[i].isFinal) {
                  console.log( event.results[i][0].transcript);
            }
      }
},

Together with RecorderJS

We used Recorderjs to export output of the WebRTC API to an audioBlob. This was a very quick and convenient process that takes advantage of Web-workers as well so that processing can happen in the background. The convenience of RecordJS is also that it can encodeWAV according to the standard so that this can be processed with standard tools on server later, in our case.

From: js/vendor/recorder.js

      
this.exportWAV = function(cb, type) {
      currCallback = cb || config.callback;
      type = type || config.type || 'audio/wav';
      if (!currCallback) throw new Error('Callback not set');
      worker.postMessage({
            command: 'exportWAV',
            type: type
      });
}

From: js/vendor/recorderWorker.js

      
function exportWAV(type){
      var bufferL = mergeBuffers(recBuffersL, recLength);
      var bufferR = mergeBuffers(recBuffersR, recLength);
      var interleaved = interleave(bufferL, bufferR);
      var dataview = encodeWAV(interleaved);
      var audioBlob = new Blob([dataview.buffer], {type: type});
      this.postMessage(audioBlob);
}

So combining the features of user recording and Web Speech API we can check if the user speech is identical phrasing of the MLK speech. This could mean a quick moderation pipeline for people that are faithful in their reading of the original. Even where some differences are found this approach helps the moderators screen for what kind of difference the user entry has.

Synchronising Speech to Scrolling Text

The “I Have a Dream” speech transcript needed to appear scrolling at the same time as Martin Luther King’s words were being played as audio. The way we approached this is by first adding some metadata to the the transcript that captured the relative length of each piece of copy he speaks. We created a sequence of tags with options relating to the speech timing, like this:

I have a dream that one day
 this nation will rise up and live out the true meaning of its creed: “
We hold these truths to be self-evident: that all men are created equal.

…

The time field could capture those sections in which MLK is speaking with an increasing timestamp option, or we could even capture where he took pauses with a ‘break’ in which the text can stop scrolling.

We read this transcript with timing and process it to be able to access the right bit of text based on timing. So here you can see how we get timeStamps from span tags:

From: js/models/Slideshow.js

      
recountElements : function ( ) {
      this.spanElements = this.$transcriptWrapper.find('span');
      for(var i = 1; i < this.spanElements.length; i++) {
            this.offsets[i] = parseFloat($(this.spanElements[i]).offset().left);
            this.widths[i] = $(this.spanElements[i]).width();
            this.timeStamps[i] = parseFloat($(this.spanElements[i]).attr("timestamp"));
      }
},

Every requestAnimationFrame we update our timers and call runTranscript which takes care of updating the elements are visible on screen, and then moves them to the right location on screen.

From: js/models/Slideshow.js

      
runTranscript : function ( ) {
      this.currentSpanIndex = this.countCurrentIndex();
      var time = this.getCurrentTime();
      this.moveToPosition(time);
},
countCurrentIndex : function ( ) {
      var currentSpanIndex = null;
      for (var i = 0; i < this.timeStamps.length; ++i) {
            if (this.getCurrentTime() < this.timeStamps[i] * 1000) {
                  currentSpanIndex = i - 1;
                  break;
            }
      }
      return currentSpanIndex;
},
moveToPosition : function ( value ) {
      var offset = this.currentTimeToOffset(value);
      this.leftOffset = offset;
      this.setTranscriptPosition(this.leftOffset);
},
setTranscriptPosition : function (pos) {
      this.elementToMoveStyle.left = (pos + this.transcriptConstOffset) + 'px';
},
currentTimeToOffset : function ( value ) {
      var actualTimeStamp = this.timeStamps[this.currentSpanIndex] * 1000;
      var nextTimeStamp = this.timeStamps[this.currentSpanIndex + 1] * 1000;
      var progress = (value - actualTimeStamp) / (nextTimeStamp - actualTimeStamp);
      if (progress >= 1) this.currentSpanIndex = this.countCurrentIndex();
      var offset = -this.offsets[this.currentSpanIndex] - (this.offsets[this.currentSpanIndex + 1] - this.offsets[this.currentSpanIndex]) * progress;
      return offset;
},

This system for keeping sync between the consecutive lines of speech was extended to also apply when the user contributes their own reading. We could do this because we ask the user to read the lines of text each independently and to mark when they are done. So we can annotate a user generated entry with the same format we use for the main speech.

Transcript Synchronisation with UGC:

Basic idea is that we split the whole speech into 10 chapters: each with its own duration (start & end time stamps). As the audio for each chapter is recorded we compute and map this to the progress of the mainSpeech audio. Then the audio progress is mapped to the transcript, so that the user can see the right bit of speech below their text.

The current progress (this.progress) is calculated in the following snippet, across a temporal resolution of 0.1% of each chapter.

From: /js/views/UgcContentView.js

getCurrentProgress : function ( ) {
      if(this.currChapterStartTime) {
            var currChapterProgress = (((new Date()).getTime() - this.currChapterStartTime)*0.001) / (this.currAudioChapterAfterPlay.stop - this.currAudioChapterAfterPlay.start);
            if(currChapterProgress > 1)
                  currChapterProgress = 1;
            var singleChapterSizeInPixels = this.width / this.nrOfChapters;
            this.progress = ((this.currAudioChapterAfterPlay.index * singleChapterSizeInPixels) + currChapterProgress * singleChapterSizeInPixels) / this.width;
      }
      else
            this.progress = -app.scrollController.xScrollValue / this.width;
      return this.progress;
},

We compute the chapter index and it’s progress:

      
getChapterInfoFromProgress : function ( ) {
      var progress = this.progress * 10;
      progress = Math.min(Math.max(progress, 0), 9.99); // cut to range 0-9.99
      var chapterIndex = parseInt(progress);
      var chapterProgress = progress - chapterIndex;
      return {index: chapterIndex, progress: chapterProgress};
},

To compute the current mainSpeech audio progress:

      
getCurrentProgressForMainSpeech : function ( ) {
      // we can't take chapter info from getChapterInfoFromScrollProgress() because it stops when
      // window can't be scrolled any further, and that doesn't mean that the last chapter finished playing
      var chapterInfo = this.getChapterInfoFromProgress(); // don't use getChapterInfoFrom'Scroll'Progress()
      var currMainSpeechTime = 0;
      for (var i = 0; i < chapterInfo.index; i++)
            currMainSpeechTime += app.audioChapters.chapters[i][1] - app.audioChapters.chapters[i][0];
      currMainSpeechTime += (app.audioChapters.chapters[chapterInfo.index][1] - app.audioChapters.chapters[chapterInfo.index][0]) * chapterInfo.progress;
      return currMainSpeechTime;
},

This function updates the transcript screen position and depends on progress calculated from ugc or the explore content section:

From: js/models/Slideshow.js

      
moveTranscriptExploreAndUgc : function ( progress ) {
      this.isNewAudioProgress = true; // audio progress on resume should be changed
      // to match progress in ugc & explore
      var offset = 0;
      var pr = 0;
      if(app.controller.getActiveView() == 'ugcContent') {
            //offset = this.timeToOffset2( (progress * 1000) );
            offset = this.timeToOffset( progress );
            this.progressBar.draw(Math.round(this.offsetToProgress(offset)));
      } else {
            offset = this.progressToOffset( progress );
            this.progressBar.draw(Math.round(progress));
}
this.moveToOffsetPosition( offset );
      this.currentTime = this.offsetToTime(this.leftOffset);
      this.activeSlide = this.progressToSlide();
},

Sound Controller: Using SoundManager2

Sound management is full of pitfalls from specific hardware issues to synchronization problems etc.. so we decided to use a 3rd party library http://www.schillmania.com/projects/soundmanager2/ which worked very well for us also because of its cross browser support. There were though some particular issues that arose that we want to point out.

On mobile or tablet devices you need to start the audio on the first interaction and you can only play one audio track at a time. This means ideally you should architect an audio based application so it can work with this limitation in advance.

You need to take care to handle buffering issues along with audio Play or you will get delays. So you might create your own Play controller to ensure a standard process.

Resume play will not work if the buffer of the current audio file is not available for example because you played another file in the meantime. So we need to store position and resume playing at that specific location.

Play on iOS required a slightly different play controller and a user click event to start the sound.

The way we decided to handle ‘PLAY’, is to take the following steps at regular intervals: run&resume, check buffering progress and pause audio. And when a positive value appears in buffering progress it is ready to play and you can set position for it.

From: js/utils/SoundController.js

      
this.sounds[name].play();
this.onBuffered( function(){ alert('READY TO PLAY AND TO SET POSITION'); } );
this.onBuffered = function (callback) {
      var self = this;
      var isReady = false;
      this.sounds[name].resume();
      var bufferedBytes = parseFloat(this.sounds[name].getBufferedBytes());
      this.sounds[name].pause();
      isReady = bufferedBytes > 0 ? true : false;
      if(isReady)
            callback();
      else
            setTimeout(function() { self.onBuffered(callback); }, 100);
}

The way we handled ‘PLAY’ on iOS is to: Load, Play, Pause to force buffering the stream. So we don’t need to perform a resume as we did in Android based devices. On iOS the seek to location only works once the whole stream is loaded.

      
this.sounds[name].play();
this.sounds[name].pause();
this.sounds[name].load();
this.onBuffered( function(){ alert('READY TO PLAY AND TO SET POSITION'); } );
this.onBuffered = function (callback) {
      var self = this;
      var isReady = false;
      var bufferedBytes = parseFloat(this.sounds[name].getBufferedBytes());
      isReady = bufferedBytes >= 1 ? true : false;
      if(ComplexDetection.isIphone()) // iPhone needs duration property
            isReady &= this.sounds[name].soundObject.duration > 0;
      if(isReady)
            callback();
      else
            setTimeout(function() { self.onBuffered(callback); }, 100);
};

The Dream Speech Back-end

While the front end would display a simple interface to choose a snipped of speech then select a microphone to record: there was quite a bit of technology involved in handling the processing of this data on the server.

App Engine Challenges: Saved by Compute Engine

We ran into some new challenges on the backend. App Engine, while excellent for delivering scalable applications, has some limitations: one of those is that it can run only pure Python code, no exceptions. (except Go, or Java, or PHP… but you can only choose one for entire codebase). So unless we could create some pure-Python MP3 and Ogg Vorbis encoders, we would be stuck serving our users .wav files, which are easily up to 10 times as large!

We also wanted to do some sound effects processing. And while the effects we wanted to add were quite simple (compression, band-pass filters, slight reverb), there were no solutions available in pure Python.

So we decided that we needed a Linux box with ffmpeg on board… We chose Google’s excellent Compute Engine service to host our transcoding application. And, as it turned out, one mid-sized instance could chew through hundreds of tasks in just a few minutes – but just in case we implemented load-balancing and a quick to update instance management to make it even more robust.

Tasklets

On the App Engine’s side, we took advantage of the ndb’s tasklets feature to implement non-blocking IO operations on both urlfetch and database operations. Combined with cron jobs it allowed us to monitor the load and health of our instances:

      
@ndb.tasklet
def ping(self):
    # get async context
    ctx = ndb.get_context()

    # mark the last time we've tried contacting the instance
    self.last_pinged = datetime.datetime.now()

    # prepare payload for the network request - implement this yourself!
    auth = get_auth()
    payload = get_random_payload()

    # send it!
    try:
        # the "yield" keyword will hand over the control back to the
        # tasklet dispatcher - this allows other tasklets to run
        r = yield ctx.urlfetch(
            self.ping_url, method="POST", headers={"Authorization": auth},
            payload=urllib.urlencode({"payload": payload}),
            deadline=30,
        )
    except urlfetch.DeadlineExceededError:
        # if the operation fails, we will still put the entity in the
        # DB, marking the last pinged time.
        yield self.put_async()
        raise

    # If we didn't get a 200 response, we should scream about that!
    if r.status_code != 200:
        yield self.put_async()
        raise RuntimeError(r)

    # mark the last pong reply, and round-trip time
    self.last_ponged = self.last_seen = datetime.datetime.now()
    self.rtt = (self.last_ponged - self.last_pinged).microseconds

    # check that the "secret" payload got back to us!
    data = json.loads(r.content)
    assert data["payload"] == payload

    # finally, store the returned statistics
    self.load = data["load"]
    self.uptime = data["uptime"]
    yield self.put_async()

The important part is how this fits together with the rest of the app:

      
@app.get("/api/cronjobs/send-pings")
def cron_send_pings():
    futures = []
    for tx in models.Transcoder.query().fetch():
        try:
            futures.append(tx.ping())
        except Exception:
            captureException()
            continue
    ndb.Future.wait_all(futures)
    return "Yo!", 200

Pure Data Visual Programming Server Side?

While we were there with Compute Engine’s power at our disposal, we chose to employ Pure Data for our DSP needs. PD is a visual programming “language”, however we’ve managed to plug it into our processing pipeline using it’s fairly recent feature: batch mode.

      
from shutil import copy
from subprocess import Popen, PIPE
from utils import working_directory

with working_directory(config["workdir"]):
    copy(fname, in_fname)
    p = Popen([
        "pd-extended",
        "-nosound",
        "-batch",
        "./batch.pd",
    ], stdin=PIPE,
       stdout=PIPE,
       stderr=PIPE
    )
    p.stdin.close()
    out, err = p.stdout.read(), p.stderr.read()
    # log output and error...
    ret = p.wait()
    copy(out_fname, fname)

Back-end tools

Its worth mentioning quickly the many other technologies and packages we use to glue our bits together. Debian, nginx, Flask, beanstalkd, daemontools, Sentry and many other have all helped us make this possible. Some are tried and tested elements of our Backend Ninja tool-belt, others were used experimentally for the first time. We could dedicate a whole dedicated article to each of these independently we would welcome any specific feedback interest.

Conclusion

The speech by Martin Luther King will grow connected to the contributions of many other people, their own voices, documentation, videos, etc… So the technology story of this job is for the user to barely notice something technical is happening and to focus on the speech. The way the user can record their voice without a plug-in, in browser and with a standard privacy authentication is novel, especially because the experience flows naturally.

References:

jQuery.com

http://handlebarsjs.com
http://backbonejs.org
http://requirejs.org
www.greensock.com/tweenlite
http://sass-lang.com

http://www.schillmania.com/projects/soundmanager2
http://www.goodboydigital.com/pixijs/docs
Google Maps API

Visit Page

Credits

Division

Digital
Director

Robert Bader
Agency

Organic
Brand

National Park Foundation
Executive Producer

Amelia Roberts
Project Manager

Aaron Yeung
Tech Lead

Krzysztof Wyrzykowski
Developer

Maciej Zasada
UX

Organic Inc.
Front-end Developer

Karol Sobiesiak
Front-end Developer

Marcin Domanski
Front-end Developer

William Mapan
Back-end Developer

Kamil Cholewinski
Systems Administrator

Thomas Pedoussaut
Sound Design

Axel Wagner
Production Company

UNIT9

Technology

Google APIs , HTML5
Platform

Desktop , Mobile , Web
Kind

Microsite
Industry

Cultural
Target Market

USA
Release Date

2013-08-26

Awards

Webby Awards

x 1

Robert Bader

National Park Foundation: March on Washington - A Technical Overview

Cross platform Audio recording and moderation using WebRTC, Web Audio, and the Web Speech API

Quick Tour

WebRTC and Web Speech API communication

Getting Recorded Audio

Speech Recognition Front-End

Together with RecorderJS

Synchronising Speech to Scrolling Text

Transcript Synchronisation with UGC:

Sound Controller: Using SoundManager2

The Dream Speech Back-end

App Engine Challenges: Saved by Compute Engine

Tasklets

Pure Data Visual Programming Server Side?

Back-end tools

Conclusion

References:

Credits

Division

Director

Agency

Brand

Executive Producer

Project Manager

Tech Lead

Developer

UX

Front-end Developer

Front-end Developer

Front-end Developer

Back-end Developer

Systems Administrator

Sound Design

Production Company

Technology

Platform

Kind

Industry

Target Market

Release Date

Awards

More from Robert Bader

Cookie settings