Robert Bader
National Park Foundation: March on Washington - A Technical Overview
Cross platform Audio recording and moderation using WebRTC, Web Audio, and the Web Speech API
The National Park Service and National Park Foundation present the site “March On Washington”: a multi-device site developed by UNIT9 for Organic.
This is a HTML5 site celebrating the 50th anniversary of Martin Luther King’s “I have a dream” speech. The site lets users explore audio recordings, videos and photos of the event that took place in 1963, and brings the audience in, to participate with a gallery of people’s own readings of the speech that anyone can contribute to.
From a technology point of view the most interesting part of the site was the user generated content audio, which we could record and upload in a seamless way with no special plugins. It also was interesting for us to apply Google Speech to the user contributions to aid with moderation when submitting the entry to the server.
In addition to this we are covering over two other areas: how we set up a sync for the scrolling text for the speech audio and some tips to take into account while applying Record.js. and how we structured the back-end, to handle server side processing of sound using Google App Engine and Google Compute Engine.
Also please note, some additional downloadable example code is on its way so check back for an update.
Quick Tour
The “March On Washington” site is made up of four main sections: a play mode for the original historical documents montaged to the original speech, the explore section where you can have a look at other photographic materials from the time and a user generated content section.
The main page of the site shows archive photography while we hear Martin Luther King’s speech and watch the words flow below the image to his own.
For many items we have a detailed informational sheet, connected video and audio records.
The user gallery shows people’s profile picture and you can hear their reading of sections of the whole speech.
WebRTC and Web Speech API communication
When you decide to record your own version of the speech we ask that you first log in with Google+ or your Facebook account, then pick a chapter and press record. Then you get to speak your own version, and this is sent to the server over WebRTC and filtered and tied to your user profile picture. We mix the track with background noise and grain so it is placed in the same style of the original speech.
Getting Recorded Audio
There are many tutorials for the getUserMedia feature, but there are none we could find that combine this with a recorded audio track and speech speech recognition. Here is a quick breakdown that goes over how what we did.
From: /prototypes/audioWebRTC_Recognition/js/app.js
window.URL = window.URL || window.webkitURL;
navigator.getUserMedia = navigator.getUserMedia || navigator.webkitGetUserMedia || navigator.mozGetUserMedia || navigator.msGetUserMedia;
startRecording: function () {
if ( navigator.getUserMedia ) {
navigator.getUserMedia({audio: true}, this.onGetMediaSuccess, this.onGetMediaFail);
} else {
console.log('navigator.getUserMedia not present');
}
},
onGetMediaSuccess: function ( s ) {
var self = this.App;
self.mediaStreamSource = self.audioContext.createMediaStreamSource( s );
self.recorder = new Recorder(self.mediaStreamSource);
self.recorder.record();
},
onGetMediaFail: function () {
//code flow for rejected
}
Once the user is done recording, we capture the audio as an HTML5 binary audio file and let the user have a listen and check it before we send it to the backend. So here ExportWAV generates a Blob object containing the recording in WAV format:
var self = this;
this.recorder.stop();
this.recorder.exportWAV(function(s) {
var audio = document.createElement( 'audio' );
audio.src = window.URL.createObjectURL(s);
audio.id = 'record' + self.recordCount;
document.body.appendChild( audio );
});
}
Speech Recognition Front-End
To process our audio file with speech recognition, what we do first is check if the browser has Web Speech API implemented. The next step is to set basic configuration and events related to user action.
From: /prototypes/audioWebRTC_Recognition/js/app.js
startRecognition: function () {
if (!('webkitSpeechRecognition' in window)) {
console.log( 'recognition not available on your browser' );
} else {
var self = this;
this.recognition = new webkitSpeechRecognition();
this.recognition.continuous = true;
this.recognition.interimResults = true;
this.recognition.lang = "en";
this.recognition.onstart = function() {
self.onRecognitionStart();
}
this.recognition.onresult = function( e ) {
self.onRecognitionResult( e );
}
this.recognition.onerror = function( e ) {
self.onRecognitionError( e );
}
this.recognition.onend = function() {
self.onRecognitionEnd( );
}
this.recognition.start();
},
stopRecognition: function () {
this.recognition.stop();
},
This function returns speech translated to string.
onRecognitionResult: function ( event ) {
for (var i = event.resultIndex; i < event.results.length; ++i) {
if (event.results[i].isFinal) {
console.log( event.results[i][0].transcript);
}
}
},
Together with RecorderJS
We used Recorderjs to export output of the WebRTC API to an audioBlob. This was a very quick and convenient process that takes advantage of Web-workers as well so that processing can happen in the background. The convenience of RecordJS is also that it can encodeWAV according to the standard so that this can be processed with standard tools on server later, in our case.
From: js/vendor/recorder.js
this.exportWAV = function(cb, type) {
currCallback = cb || config.callback;
type = type || config.type || 'audio/wav';
if (!currCallback) throw new Error('Callback not set');
worker.postMessage({
command: 'exportWAV',
type: type
});
}
From: js/vendor/recorderWorker.js
function exportWAV(type){
var bufferL = mergeBuffers(recBuffersL, recLength);
var bufferR = mergeBuffers(recBuffersR, recLength);
var interleaved = interleave(bufferL, bufferR);
var dataview = encodeWAV(interleaved);
var audioBlob = new Blob([dataview.buffer], {type: type});
this.postMessage(audioBlob);
}
So combining the features of user recording and Web Speech API we can check if the user speech is identical phrasing of the MLK speech. This could mean a quick moderation pipeline for people that are faithful in their reading of the original. Even where some differences are found this approach helps the moderators screen for what kind of difference the user entry has.
Synchronising Speech to Scrolling Text
The “I Have a Dream” speech transcript needed to appear scrolling at the same time as Martin Luther King’s words were being played as audio. The way we approached this is by first adding some metadata to the the transcript that captured the relative length of each piece of copy he speaks. We created a sequence of tags with options relating to the speech timing, like this:
<span timestamp=”0″>I have a dream that one day</span>
<span timestamp=”5″> this nation will rise up and live out the true meaning of its creed: “</span>
<span timestamp=”10″>We hold these truths to be self-evident: that all men are created equal.</span>
<span timestamp=”18″ class=”break”></span>
…
The time field could capture those sections in which MLK is speaking with an increasing timestamp option, or we could even capture where he took pauses with a ‘break’ in which the text can stop scrolling.
We read this transcript with timing and process it to be able to access the right bit of text based on timing. So here you can see how we get timeStamps from span tags:
From: js/models/Slideshow.js
recountElements : function ( ) {
this.spanElements = this.$transcriptWrapper.find('span');
for(var i = 1; i < this.spanElements.length; i++) {
this.offsets[i] = parseFloat($(this.spanElements[i]).offset().left);
this.widths[i] = $(this.spanElements[i]).width();
this.timeStamps[i] = parseFloat($(this.spanElements[i]).attr("timestamp"));
}
},
Every requestAnimationFrame we update our timers and call runTranscript which takes care of updating the elements are visible on screen, and then moves them to the right location on screen.
From: js/models/Slideshow.js
runTranscript : function ( ) {
this.currentSpanIndex = this.countCurrentIndex();
var time = this.getCurrentTime();
this.moveToPosition(time);
},
countCurrentIndex : function ( ) {
var currentSpanIndex = null;
for (var i = 0; i < this.timeStamps.length; ++i) {
if (this.getCurrentTime() < this.timeStamps[i] * 1000) {
currentSpanIndex = i - 1;
break;
}
}
return currentSpanIndex;
},
moveToPosition : function ( value ) {
var offset = this.currentTimeToOffset(value);
this.leftOffset = offset;
this.setTranscriptPosition(this.leftOffset);
},
setTranscriptPosition : function (pos) {
this.elementToMoveStyle.left = (pos + this.transcriptConstOffset) + 'px';
},
currentTimeToOffset : function ( value ) {
var actualTimeStamp = this.timeStamps[this.currentSpanIndex] * 1000;
var nextTimeStamp = this.timeStamps[this.currentSpanIndex + 1] * 1000;
var progress = (value - actualTimeStamp) / (nextTimeStamp - actualTimeStamp);
if (progress >= 1) this.currentSpanIndex = this.countCurrentIndex();
var offset = -this.offsets[this.currentSpanIndex] - (this.offsets[this.currentSpanIndex + 1] - this.offsets[this.currentSpanIndex]) * progress;
return offset;
},
This system for keeping sync between the consecutive lines of speech was extended to also apply when the user contributes their own reading. We could do this because we ask the user to read the lines of text each independently and to mark when they are done. So we can annotate a user generated entry with the same format we use for the main speech.
Transcript Synchronisation with UGC:
Basic idea is that we split the whole speech into 10 chapters: each with its own duration (start & end time stamps). As the audio for each chapter is recorded we compute and map this to the progress of the mainSpeech audio. Then the audio progress is mapped to the transcript, so that the user can see the right bit of speech below their text.
The current progress (this.progress) is calculated in the following snippet, across a temporal resolution of 0.1% of each chapter.
From: /js/views/UgcContentView.js
getCurrentProgress : function ( ) {
if(this.currChapterStartTime) {
var currChapterProgress = (((new Date()).getTime() - this.currChapterStartTime)*0.001) / (this.currAudioChapterAfterPlay.stop - this.currAudioChapterAfterPlay.start);
if(currChapterProgress > 1)
currChapterProgress = 1;
var singleChapterSizeInPixels = this.width / this.nrOfChapters;
this.progress = ((this.currAudioChapterAfterPlay.index * singleChapterSizeInPixels) + currChapterProgress * singleChapterSizeInPixels) / this.width;
}
else
this.progress = -app.scrollController.xScrollValue / this.width;
return this.progress;
},
We compute the chapter index and it’s progress:
getChapterInfoFromProgress : function ( ) {
var progress = this.progress * 10;
progress = Math.min(Math.max(progress, 0), 9.99); // cut to range 0-9.99
var chapterIndex = parseInt(progress);
var chapterProgress = progress - chapterIndex;
return {index: chapterIndex, progress: chapterProgress};
},
To compute the current mainSpeech audio progress:
getCurrentProgressForMainSpeech : function ( ) {
// we can't take chapter info from getChapterInfoFromScrollProgress() because it stops when
// window can't be scrolled any further, and that doesn't mean that the last chapter finished playing
var chapterInfo = this.getChapterInfoFromProgress(); // don't use getChapterInfoFrom'Scroll'Progress()
var currMainSpeechTime = 0;
for (var i = 0; i < chapterInfo.index; i++)
currMainSpeechTime += app.audioChapters.chapters[i][1] - app.audioChapters.chapters[i][0];
currMainSpeechTime += (app.audioChapters.chapters[chapterInfo.index][1] - app.audioChapters.chapters[chapterInfo.index][0]) * chapterInfo.progress;
return currMainSpeechTime;
},
This function updates the transcript screen position and depends on progress calculated from ugc or the explore content section:
From: js/models/Slideshow.js
moveTranscriptExploreAndUgc : function ( progress ) {
this.isNewAudioProgress = true; // audio progress on resume should be changed
// to match progress in ugc & explore
var offset = 0;
var pr = 0;
if(app.controller.getActiveView() == 'ugcContent') {
//offset = this.timeToOffset2( (progress * 1000) );
offset = this.timeToOffset( progress );
this.progressBar.draw(Math.round(this.offsetToProgress(offset)));
} else {
offset = this.progressToOffset( progress );
this.progressBar.draw(Math.round(progress));
}
this.moveToOffsetPosition( offset );
this.currentTime = this.offsetToTime(this.leftOffset);
this.activeSlide = this.progressToSlide();
},
Sound Controller: Using SoundManager2
Sound management is full of pitfalls from specific hardware issues to synchronization problems etc.. so we decided to use a 3rd party library http://www.schillmania.com/projects/soundmanager2/ which worked very well for us also because of its cross browser support. There were though some particular issues that arose that we want to point out.
- On mobile or tablet devices you need to start the audio on the first interaction and you can only play one audio track at a time. This means ideally you should architect an audio based application so it can work with this limitation in advance.
- You need to take care to handle buffering issues along with audio Play or you will get delays. So you might create your own Play controller to ensure a standard process.
- Resume play will not work if the buffer of the current audio file is not available for example because you played another file in the meantime. So we need to store position and resume playing at that specific location.
- Play on iOS required a slightly different play controller and a user click event to start the sound.
The way we decided to handle ‘PLAY’, is to take the following steps at regular intervals: run&resume, check buffering progress and pause audio. And when a positive value appears in buffering progress it is ready to play and you can set position for it.
From: js/utils/SoundController.js
this.sounds[name].play();
this.onBuffered( function(){ alert('READY TO PLAY AND TO SET POSITION'); } );
this.onBuffered = function (callback) {
var self = this;
var isReady = false;
this.sounds[name].resume();
var bufferedBytes = parseFloat(this.sounds[name].getBufferedBytes());
this.sounds[name].pause();
isReady = bufferedBytes > 0 ? true : false;
if(isReady)
callback();
else
setTimeout(function() { self.onBuffered(callback); }, 100);
}
The way we handled ‘PLAY’ on iOS is to: Load, Play, Pause to force buffering the stream. So we don’t need to perform a resume as we did in Android based devices. On iOS the seek to location only works once the whole stream is loaded.
this.sounds[name].play();
this.sounds[name].pause();
this.sounds[name].load();
this.onBuffered( function(){ alert('READY TO PLAY AND TO SET POSITION'); } );
this.onBuffered = function (callback) {
var self = this;
var isReady = false;
var bufferedBytes = parseFloat(this.sounds[name].getBufferedBytes());
isReady = bufferedBytes >= 1 ? true : false;
if(ComplexDetection.isIphone()) // iPhone needs duration property
isReady &= this.sounds[name].soundObject.duration > 0;
if(isReady)
callback();
else
setTimeout(function() { self.onBuffered(callback); }, 100);
};
The Dream Speech Back-end
While the front end would display a simple interface to choose a snipped of speech then select a microphone to record: there was quite a bit of technology involved in handling the processing of this data on the server.
App Engine Challenges: Saved by Compute Engine
We ran into some new challenges on the backend. App Engine, while excellent for delivering scalable applications, has some limitations: one of those is that it can run only pure Python code, no exceptions. (except Go, or Java, or PHP… but you can only choose one for entire codebase). So unless we could create some pure-Python MP3 and Ogg Vorbis encoders, we would be stuck serving our users .wav files, which are easily up to 10 times as large!
We also wanted to do some sound effects processing. And while the effects we wanted to add were quite simple (compression, band-pass filters, slight reverb), there were no solutions available in pure Python.
So we decided that we needed a Linux box with ffmpeg on board… We chose Google’s excellent Compute Engine service to host our transcoding application. And, as it turned out, one mid-sized instance could chew through hundreds of tasks in just a few minutes – but just in case we implemented load-balancing and a quick to update instance management to make it even more robust.
Tasklets
On the App Engine’s side, we took advantage of the ndb’s tasklets feature to implement non-blocking IO operations on both urlfetch and database operations. Combined with cron jobs it allowed us to monitor the load and health of our instances:
@ndb.tasklet
def ping(self):
# get async context
ctx = ndb.get_context()
# mark the last time we've tried contacting the instance
self.last_pinged = datetime.datetime.now()
# prepare payload for the network request - implement this yourself!
auth = get_auth()
payload = get_random_payload()
# send it!
try:
# the "yield" keyword will hand over the control back to the
# tasklet dispatcher - this allows other tasklets to run
r = yield ctx.urlfetch(
self.ping_url, method="POST", headers={"Authorization": auth},
payload=urllib.urlencode({"payload": payload}),
deadline=30,
)
except urlfetch.DeadlineExceededError:
# if the operation fails, we will still put the entity in the
# DB, marking the last pinged time.
yield self.put_async()
raise
# If we didn't get a 200 response, we should scream about that!
if r.status_code != 200:
yield self.put_async()
raise RuntimeError(r)
# mark the last pong reply, and round-trip time
self.last_ponged = self.last_seen = datetime.datetime.now()
self.rtt = (self.last_ponged - self.last_pinged).microseconds
# check that the "secret" payload got back to us!
data = json.loads(r.content)
assert data["payload"] == payload
# finally, store the returned statistics
self.load = data["load"]
self.uptime = data["uptime"]
yield self.put_async()
The important part is how this fits together with the rest of the app:
@app.get("/api/cronjobs/send-pings")
def cron_send_pings():
futures = []
for tx in models.Transcoder.query().fetch():
try:
futures.append(tx.ping())
except Exception:
captureException()
continue
ndb.Future.wait_all(futures)
return "Yo!", 200
Pure Data Visual Programming Server Side?
While we were there with Compute Engine’s power at our disposal, we chose to employ Pure Data for our DSP needs. PD is a visual programming “language”, however we’ve managed to plug it into our processing pipeline using it’s fairly recent feature: batch mode.
from shutil import copy
from subprocess import Popen, PIPE
from utils import working_directory
with working_directory(config["workdir"]):
copy(fname, in_fname)
p = Popen([
"pd-extended",
"-nosound",
"-batch",
"./batch.pd",
], stdin=PIPE,
stdout=PIPE,
stderr=PIPE
)
p.stdin.close()
out, err = p.stdout.read(), p.stderr.read()
# log output and error...
ret = p.wait()
copy(out_fname, fname)
Back-end tools
Its worth mentioning quickly the many other technologies and packages we use to glue our bits together. Debian, nginx, Flask, beanstalkd, daemontools, Sentry and many other have all helped us make this possible. Some are tried and tested elements of our Backend Ninja tool-belt, others were used experimentally for the first time. We could dedicate a whole dedicated article to each of these independently we would welcome any specific feedback interest.
Conclusion
The speech by Martin Luther King will grow connected to the contributions of many other people, their own voices, documentation, videos, etc… So the technology story of this job is for the user to barely notice something technical is happening and to focus on the speech. The way the user can record their voice without a plug-in, in browser and with a standard privacy authentication is novel, especially because the experience flows naturally.
References:
http://handlebarsjs.com
http://backbonejs.org
http://requirejs.org
www.greensock.com/tweenlite
http://sass-lang.com
http://www.schillmania.com/projects/soundmanager2
http://www.goodboydigital.com/pixijs/docs
Google Maps API
Credits
-
Division
-
Director
-
Agency
-
Brand
-
Executive Producer
-
Project Manager
-
Tech Lead
-
Developer
-
UX
-
Front-end Developer
-
Front-end Developer
-
Front-end Developer
-
Back-end Developer
-
Systems Administrator
-
Sound Design
-
Production Company
-
Technology
-
Platform
-
Kind
-
Industry
-
Target Market
-
Release Date
2013-08-26