Paging Dr. Dankenstein

Inspired by Yan Zhu's Icowid, a bot that tweets generated sentences based on Cryptocurrency ICO's and Erowid "trip reports" (drug trip stories), I recently decided to make my own Markov Chain Twitter bot. I ended up creating Karl Jobs, which generates a tweet based on Karl Marx and Steve Jobs corpora every four hours.

P.S. All the code in this post is available on GitHub as a generator called Dankenstein.

For those of you that don't know, a Markov Chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. In our context, this basically means that each next word in a generated sentence is chosen based on the probability of it following the previous word; For instance, given the word "dank" the next word chosen would probably be "meme", if this sequence of the words were prevalent in the training data.

In this post I'll be explaining some of what I did when creating my bot, Karl Jobs.

Karl Jobs

If you want to follow along, you should check out the code from Dankenstein's GitHub-repo.

Mash my bits up

Most of the following code examples are in python or bash.

If you want to follow along and try to code something yourself, you'll need:

  • A (new) twitter account
    • The email you sign up with can not be associated with another Twitter account
    • Pro tip: Gmail doesn't differentiate between "karl.jobs@gmail.com" and "karljobs@gmail.com" - if you own one, you'll get mail sent to the other
    • You'll also need to register your phone number with the account (under "settings") to be allowed to create an appplication
  • A twitter application (register at dev.twitter.com) with authentication keys for the account (read more)

When I started out, I wanted to create a bot that was a mashup of distinct sources. I tried to come up with a few worthy candidates:

candidates = [
    'King James Bible',
    'ICOs',
    'Erowid',
    'Cannibal Corpse Lyrics',
    'Deepak Chopra',
    'Elon Musk',
    'Mark Zuckerberg Testimony',
    'Richard Stallman',
    'The Communist Manifesto',
    'Frankenstein',
    'The Satanic Bible',
    'Steve Jobs',
    'The Brothers Karamazov',
    'Donald Trump',
    'SICP',
    'Alex Jones',
    'Philip K Dick Novels',
    'Moby Dick',
    'Richard Dawkins',
    'TNGs Data',
    'Obama',
    'President Bartlet (Westwing)',
    'H.P. Lovecraft',
    'George W. Bush',
    'TNGs Picard'
]

Then I generated a list of all possible pairs of potential candidates:

"""Get all possible pair combinations of candidates"""
import itertools
import pprint

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(list(itertools.combinations(candidates, 2)))

Some of the more interesting resulting pairs included:

('King James Bible', 'Richard Stallman'),
('King James Bible', 'Donald Trump'),
('ICOs', 'Moby Dick'),
('ICOs', 'H.P. Lovecraft'),
('Erowid', 'The Communist Manifesto')
('Erowid', 'Alex Jones'),
('Erowid', 'TNGs Data'),
('Cannibal Corpse Lyrics', 'Deepak Chopra'),
('Cannibal Corpse Lyrics', 'Obama'),
('Deepak Chopra', 'Philip K Dick Novels'),
('Elon Musk', 'The Satanic Bible'),
('Mark Zuckerberg Testimony', 'TNGs Data'),
('Mark Zuckerberg Testimony', 'H.P. Lovecraft'),
('Richard Stallman', 'Moby Dick'),
('Richard Stallman', 'TNGs Data'),
('The Communist Manifesto', 'Steve Jobs'),
('Frankenstein', 'Donald Trump'),
('Frankenstein', 'Richard Dawkins'),
('The Satanic Bible', 'Obama'),
('The Satanic Bible', 'SICP'),
('Steve Jobs', 'Alex Jones'),
('The Brothers Karamazov', 'Philip K Dick Novels'),
('The Brothers Karamazov', 'H.P. Lovecraft'),
('Donald Trump', 'Moby Dick'),
('SICP', 'H.P. Lovecraft'),
('Alex Jones', 'Philip K Dick Novels'),
('George W. Bush', 'TNGs Picard')

I couldn't decide between the pairs, so I prepared datasets for all the candidates...

Generating datasets and training a first model

I won't cover the work of finding, making and collecting the datasets in detail; To prepare the datasets for the candidates, you'll need to run the bash-scripts in the corpus directory of Dankenstein. You can generate all of them at once by running make corpora from the project root.

Some of these scripts have a few python dependencies; You'll need to run pip install tweepy darklyrics wikiquotes to be able to scrape Twitter, DarkLyrics and Wikiquote. You will also need popplerfor the ICO and SICP corpora (or other PDF-wrangling), which can be installed on macOS via Homebrew by running brew install poppler, or on Ubuntu by running sudo apt-get install -y poppler-utils. For other distros or platforms, please consult the Poppler website.

Additionally, you need an OAuth API key for Twitter in order to scrape Tweets properly, as well as post tweets from our Markov Chain Model in the final step. Once you've registered an application, enter your credentials in twitterCredentials.sh - or if you're following along, coding, you can set the your consumer key, consumer secret, access token and access token secret as environment variables:

export CONSUMER_KEY="consumer_key"
export CONSUMER_SECRET="consumer_secret"
export ACCESS_KEY="access_token"
export ACCESS_SECRET="access_token_secret"

For our Markov-model, we'll need to install the markovify-package by running pip install markovify.

Now let's train a model!

import markovify

if len(sys.argv) <= 2:
    sys.exit('Need two corpora!')
else:
    corpus1 = sys.argv[1]
    corpus2 = sys.argv[2]

# Get raw text as string.
with open(os.path.dirname(current_dir) + "/corpus/"+sys.argv[1]+".txt") as f:
    text_a = f.read()
with open(os.path.dirname(current_dir) + "/corpus/"+sys.argv[2]+".txt") as f:
    text_b = f.read()

# Build the model.
model_combo = markovify.combine([ model_a, model_b ], [ 1, 1 ])

# Print three randomly-generated sentences of no more than 140 characters
for i in range(3):
    print(text_model.make_short_sentence(140))

After generating a few sentences for most of the combinations, I narrowed my search down to the following candidate combinations:

('King James Bible', 'Richard Stallman'),
('King James Bible', 'Donald Trump'),
('ICOs', 'H.P. Lovecraft'),
('ICOs', 'Alex Jones'),
('Erowid', 'Alex Jones'),
('Erowid', 'Mark Zuckerberg Testimony'),
('Cannibal Corpse Lyrics', 'Deepak Chopra'),
('Cannibal Corpse Lyrics', 'Elon Musk'),
('Mark Zuckerberg Testimony', 'TNGs Data'),
('Richard Stallman', 'TNGs Data'),
('The Communist Manifesto', 'Steve Jobs')

Now was the time for further experimentation and fine-tuning.

A better class

First off, I wanted to make a Markov model class that obeys sentence structure better than a naive mode.

For this class, you will have to download a dependency for our Markov Model class, nltk, and install some extras for it: pip install nltk && python -m nltk.downloader all

import markovify
import nltk
import re

#nltk.download('averaged_perceptron_tagger')

class POSifiedText(markovify.Text):
    def word_split(self, sentence):
        words = re.split(self.word_split_pattern, sentence)
        words = [w for w in words if len(w) > 0]
        words = ["::".join(tag) for tag in nltk.pos_tag(words)]
        return words

    def word_join(self, words):
        sentence = " ".join(word.split("::")[0] for word in words)
        return sentence

Then we'll update our example script from earlier script like so:

# file: trainModel.py

import os
import subprocess
import sys
import markovify
from POSifiedText import *

current_dir = os.path.dirname(os.path.abspath(__file__))

def file_len(fname):
    p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE,
        stderr=subprocess.PIPE)
    result, err = p.communicate()
    if p.returncode != 0:
        raise IOError(err)
    return int(result.strip().split()[0])

"""
Script that trains a Markov Model

args: corpus1 corpus2 [(scale1 scale2) stateSize overlapTotal overlapRatio tries sentences]
"""

if len(sys.argv) <= 2:
    sys.exit('Need two corpora!')

else:
    corpus1 = sys.argv[1]
    corpus2 = sys.argv[2]

ratio1 = float(sys.argv[3] if len(sys.argv)>4 else 1)
ratio2 = float(sys.argv[4] if len(sys.argv)>4 else 1)

state_size = int(sys.argv[5]) if len(sys.argv)>5 else 2
overlap_total = int(sys.argv[6]) if len(sys.argv)>6 else 15
overlap_ratio = int(sys.argv[7]) if len(sys.argv)>7 else 70
tries = int(sys.argv[8]) if len(sys.argv)>8 else 10
sentences = int(sys.argv[9]) if len(sys.argv)>9 else 5
model_type = str(sys.argv[10]) if len(sys.argv)>10 and (str(sys.argv[10]) == 'naive' or str(sys.argv[10]) == 'expert') else 'naive'

for candidate in [corpus1, corpus2]:
    if not os.path.isfile(os.path.dirname(current_dir)+"/corpus/"+candidate+".txt"):
        try:
            subprocess.call(os.path.dirname(current_dir) + "/corpus/"+candidate+".sh", shell=True)
        except Exception as e:
            print "Corpora scripts not set as executable!"

# Get sizes of corpora and to make model ratio basis of equal size
corpus1_size=file_len(os.path.dirname(current_dir)+"/corpus/"+corpus1+".txt")
corpus2_size=file_len(os.path.dirname(current_dir)+"/corpus/"+corpus2+".txt")

if corpus1_size >= corpus2_size:
    ratio1base = 1.0
    ratio2base = float(corpus1_size)/float(corpus2_size)
else:
    ratio2base = 1.0
    ratio1base = float(corpus2_size)/float(corpus1_size)

# Get raw text as strings and build models
with open(os.path.dirname(current_dir) + "/corpus/"+sys.argv[1]+".txt") as f:
    text_a = f.read()
with open(os.path.dirname(current_dir) + "/corpus/"+sys.argv[2]+".txt") as f:
    text_b = f.read()

if model_type == 'naive':
    # Naive models
    model_a = markovify.Text(text_a, state_size)
    model_b = markovify.Text(text_b, state_size)
elif model_type == 'expert':
    # Custom models
    model_a = POSifiedText(text_a, state_size=state_size)
    model_b = POSifiedText(text_b, state_size=state_size)

# Combine the models
model_combo = markovify.combine([ model_a, model_b ], [ ratio1base*ratio1, ratio2base*ratio2 ])

# Print randomly-generated sentences of no more than 140 characters
for i in range(sentences):
    print(model_combo.make_short_sentence(140, max_overlap_total=overlap_total, max_overlap_ratio=overlap_ratio, tries=tries))

This new script allows us to weight the two models we combine relative to eachother. It also lets us set constraints with regards to originality via maximum overlap with training examples, both the number of sequential words and percentage of the resulting sentence. We can also override the default state size (2), i.e. increase the number of previous events' states each event depends on - two meaning only the immediately previous event i.e. word). Finally, we can also choose what type of model to use (the first, naive class, or the second, more advanced class).

You can now do this like this : python trainModel.py <corpus1> <corpus2> [<scale1> <scale2> <stateSize> <overlapTotal> <overlapRatio> <tries> <sentences> <modelComplexity>] (where the arguments in square brackets are optional).

Alternatively, you can run make model ARGS="<corpus1> <corpus2> [<scale1> <scale2> <stateSize> <overlapTotal> <overlapRatio> <tries> <sentences> <modelComplexity>]" from Dankenstein's project root.

In case of Yuge corpora

Note: I have not needed to perform the following step during my own work

By default, the markovify.Text class loads, and retains, the your textual corpus, so that it can compare generated sentences with the original (and only emit novel sentences).

With very large corpora, however, loading the entire text at once (and retaining it) can be memory-intensive. You can solve the problem doing something like this.

# Tell Markovify not to retain the original
with open("path/to/my/huge/corpus.txt") as f:
    text_model = markovify.Text(f, retain_original=False)

print(text_model.make_sentence())

# Read in the corpus line-by-line or file-by-file and combine them into one model at each step
combined_model = None
for (dirpath, _, filenames) in os.walk("path/to/my/huge/corpus"):
    for filename in filenames:
        with open(os.path.join(dirpath, filename)) as f:
            model = markovify.Text(f, retain_original=False)
            if combined_model:
                combined_model = markovify.combine(models=[combined_model, model])
            else:
                combined_model = model

print(combined_model.make_sentence())

Output

karlsComputer

In my attempts, I pretty soon discovered that a hybrid of Steve Jobs and The Communist Manifesto yielded some pretty great results.

Redistribution is subject to the crazy ones.
I would have to study physics to understand the laws of the proletariat.
And yet death is the most rudimentary of directions and you asked how to drive a car.
I got more thrill out of them than I have out of any practical application in my body.

There were also some particularly uncharacteristic ones,

We also know first hand that Flash is the greatest invention of Life.

and some unsettling ones.

Somehow it lives on, but sometimes I think they are afraid how we would taste.

Also, some strange ones...

So at 30 I was born.
Our belief was that I had never graduated from high school.
So I decided to put me up for adoption.

I therefore decided to save the model, essentially just adding the following to our training-script, in place of the last code block (from the comment about printing randomly...) to serialize the model and some associated variables:

# Pickle what we need
import cPickle as pickle
pickle.dump( {'model': model_combo, 'overlap_total': overlap_total, 'overlap_ratio': overlap_ratio, 'tries':tries, 'sentences': sentences}, open( os.path.dirname(current_dir) + '/twitter-bot/model.p', "wb" ) )

Tweeting

Finally, I needed a script to handle the posting to Twitter:

import cPickle as pickle
import os
import sys
import tweepy
from POSifiedText import *

# Load model and variables
data = pickle.load( open( current_dir+'/model.p', "rb" ) )

# Generate sentence
sentence=(data['model'].make_short_sentence(140, max_overlap_total=data['overlap_total'], max_overlap_ratio=data['overlap_ratio'], tries=data['tries'])) # 280

# Authenticate with Twitter and post tweet
auth = tweepy.OAuthHandler(os.environ['CONSUMER_KEY'], os.environ['CONSUMER_SECRET'])
auth.set_access_token(os.environ['ACCESS_KEY'], os.environ['ACCESS_SECRET'])

api = tweepy.API(auth)
api.update_status(sentence)

Deploying the bot

To deploy the bot, you can set up a cronjob on your own server, or you can use Heroku's Free Tier, or Google Cloud's App Engine Standard Free Tier, for instance.

In this demonstration, I'll use Heroku.

To follow along, your root project directory must be a git-repo, so you'll need to create one, if you haven't already. If you've checked out Dankenstein, you should run rm .git/config from the project root before creating your own.

Make sure you've generated a model (filename model.p using Dankenstein or the example code), and added it to your Git-repo.

You'll need to register a free Heroku account. Choose "Python" as your Primary Development Language. You'll also need to download and install the Heroku Toolbelt.

Then run the following from your shell:

heroku login # authenticate using your login info
heroku create <your-app-name> # create an application
heroku config:add CONSUMER_KEY=consumer_key CONSUMER_SECRET=consumer_secret ACCESS_KEY=access_key ACCESS_SECRET=access_secret
git push heroku master # push to the dyno
heroku ps:exec # ssh to the dyno; will trigger a restart the first time
heroku ps # check status

You'll also need to make a Procfile, which defines the job for your worker dyno. If you've checked out Dankenstein, the accompanying Procfile only contains the following:

worker: bash dankenstein/bot.sh && sleep 14400

This will make the worker dyno generate a tweet based on your model, and sleep for four hours before doing the same all over again.

Make your adjustments to the repo and push the master branch to heroku again:

git push heroku master

Congratulations! You have just made your very own Markov Chain Twitter bot!

"BUT I DON'T WANT TO PROGRAM"

If you don't want to implement a bot yourself, the generator I produced during my work on my bot, Karl Jobs, is available on GitHub as Dankenstein.

Dankenstein will let you

  • Recreate the corpora I have used as examples in this post
  • List available corpora and possible candidate combinations
  • Generate a bot
  • Output sentences
  • Tweet

Feel free to create an issue if you find a bug, or a PR if you implement support for a new dataset!

Newer post
Troy @ Bouvet
Older post
Paranoia 2018