Sarah Palin's speech sounds remarkably like a computer program. In this post, I make the case by demonstrating a Donald Trump endorsement speech generated right in your browser.
And later, we have a quiz to see if you can tell the difference between Sarah's actual speech and generated text.
All code included!
Donald Trump is campaigning through your town and on the last day of his visit, you decide you want to endorse him for president. You need to give your endorsement speech that evening, and you want it to be as powerful and compelling as the recent endorsement speech by right-wing darling Sarah Palin.
Do you?
Although option #2 is probably the best choice, lets go for option #3 as it is a lot more fun and feeds the topic of this article - programmatically creating Sarah-speak!
This is just a fun post, so lets start with the results - here you are - your speech is ready (and simply click the button to generate another):
If you get a particularly thought provoking passage, please leave it in a comment below.
If you are unfamiliar with Markov Chains, you may be wondering how this was done. Is this some form of AI?
No, it is not AI - for a Palin speech that would be overkill. This is using a very simple mechanism known as a Markov Text Generator.
It works by taking a piece of source text – referred to as the "input corpus". It then generates semi-random text in which each word in the new text is chosen as a function of the word that precedes it. This function is simply based on the probabilities of each word pair in the source text.
Suppose we take the following corpus text – something at least as relevant to our times as the Sarah Palin endorsement speech.
How much wood could a woodchuck chuck if a woodchuck could chuck wood?
If the word "woodchuck" was the first randomly selected word, the next word would be chosen at random from the two possibilities: "chuck" and "could", each with a 50% probability. If "chuck" were selected, then the next word would be either "if" or "wood".
This results in text that contains the same words and word pairs as the corpus text in roughly the same proportion. The process is "stateless" - so no topics are maintained or ideas woven together. The resulting text often contains a lot of non-sequiturs and disjointed run-on sentences. Exactly like Sarah Palin!
In fact, it was while reading bits of her endorsement speech that it struck me just how closely her sentence patterns sounded like Markov generated text. I couldn't resist coding up a quick generator to compare the two.
Lets try this - here are some sentences. Try and guess which are actually part of Sarah's speech and which are auto-generated (then click each one to find out):
If you've played with Markov text generators before, you may have noticed that the text generated here reads better than expected. This is because I made a couple improvements.
First, I generate the text a sentence at a time: Choosing a random starter word from the corpus text, then following the Markov algorithm until it hits a sentence ender word. These starter and ender words are simply based on whether they started or ended a sentence in the corpus text. This gives the generated text more of a sentence-like structure which reads better.
Additionally, rather than always picking a single following word at a time, I had it pick a random small number of following words, which results in small groups of words repeated just as they were in the corpus text. This does run the risk of repeating the source text too closely, but through testing and adjusting those parameters I was able to strike a good balance.
The code for this is really simple. First, we split the corpus text into words and make a separate list of starter words. Then, for each sentence, we grab a random starter word and then complete the sentence by repeatedly using the Markov process to obtain the next word. This finishSentence() function is the heart of the process, and is only about a dozen or so lines long:
// pass in sent array containing first word at element 0
function finishSentence(sent)
{
while(!isSentenceEnd(arrayLast(sent)))
sent = sent.concat(getRandNextWordArray( arrayLast(sent),rand(1,7)))
return sent.join(" ")
}
function getRandNextWordArray(w,wc)
{
// lc (ListComprehension) is a combination of map and filter
// here we create our NextWordArray - any duplicates intentionally included
var nwa = lc(function(ww,i) {
if(ww == w)
return i
},words)
var result = []
var index = nwa[rand(nwa.length)] // choose a random next word index
// continue grabbing words up to wc or until we finish the sentence
while(wc--)
{
nwo = words[index++]
result.push(nwo)
if(isSentenceEnd(nwo))
return result
}
return result
}
See this code on Github for a complete run-able code example (run from Node or a browser console).
There are certainly additional techniques that would improve the "quality" of this text generator even further.
One possible improvement is to try and carry a "thread" or "topic" from one sentence to the next. This would require starting in the middle of a sentence and building out the beginning and the end using Markov to finish it and a reverse Markov to start it. Topics could either be randomly chosen words, words chosen using some English cues (i.e. words that follow "the"), or could be fed to the generator manually.
Of course any Markov text generator is constrained by the quality of the corpus text (which in the case of Sarah Palin, one could argue, sets a fairly low bar.)
If any readers have ideas that could improve the results, please leave a comment below.