Back to Binary Thoughts

Storytelling Through Code

"The one where code came in a cryptex." | How code should be written with the care, diligence, and editing of a good story.

  • Eric Pesch
  • Eric Pesch
  • 18 Oct 2020
  • 11 min read

book_apple_unsplash.jpg

Once upon a time, there was this piece of Python code... Image: by @lianamikah via Unsplash, book pictured is "A Baxter Family Christmas" by Karen Kingsbury

A story has 3 simple parts:

  • A beginning
  • A middle
  • And an end

There’s a lot that can happen in those 3 beats, but that’s the basic gist of it. Last post, we covered the basics of technical communication and broke down a lot of ways that framing and breaking down a question can lead to telling a better story about what you are trying to communicate. Just like how we need to communicate who, what, when, where, and why we do something, it is essential we do the same with our technical work as well.

In this post, I aim to cover how important storytelling is when it comes to doing something I generally do every day of my professional career - writing code. We will break it down in a few ways:

  1. What should code tell me
  2. By Example: Python
  3. By Example: SQL

1. What Should Code Tell Me - A Fairy Tale Approach

To re-iterate, a story has 3 parts: a beginning, middle, and end. When we open a piece of code - whether written by us or by others - we have similar story beats.

  • What will this do when I run it
  • What will it take to run it
  • Why are you here in the first place

From a general code convention standpoint, I can accomplish this in a few different ways. To name a few:

  • Variable names
  • Class definitions
  • Documentation
  • Comments

Each of these methods is different depending on the language you are working in and you may have more choices based on the syntactic sugar that language supports. Additionally, some languages have structural framing which you would need to fit like Objects, Classes, etc., that guide how you approach telling that piece of code’s story.

The point of the examples below is to demonstrate some practices I use or see every day, but you will need to discover the best ways to tell a story in the medium you are working in. Just like a story for screen is built differently than a story in companies, storytelling in Python is going to be different than storytelling in SQL.

Side-Note - Some Fairytales

I like seeing things concretely, so I was curious if there was some small literature that I could try and adapt with code. I landed upon fairy tales - our childhood dreams (and sometimes horrors) that are universally known by most children in some form or another. Understandably, not all of these exist everywhere in the same fashion across the world, so I wanted to recap a few I know so we are all on the same page.

Jack and Jill

Jack and Jill went up the hill

To fetch a pail of water.

Jack fell down and broke his crown,

And Jill came tumbling after.

Hickory, Dickory, Dock

Hickory, dickory, dock.

The mouse ran up the clock.

The clock struck one,

The mouse ran down,

Hickory, dickory, dock.

2. By Example: Python - Jack and Jill

Let’s take “Jack and Jill” and put that into a Python script which tells us a random result of success or failure of the dynamic duo fetching a pail of water. Either character could fetch the pail as well. We’ll even throw in another character for fun - James. The result can look like the following:

$ python3 jack_jill.py
Jack fetched a pail of water.
Jill felt overjoyed!
James felt overjoyed!

or:

$ python3 jack_jill_bad.py
Jill fell down and broke their crown.
Jack came tumbling after!
James came tumbling after!

The first pass at that program might look like:

import random

def x(input):
    xs = []
    random.shuffle(input)
    p = input.pop()
    z = random.choice([True, False])
    if z:
        xs.append(f"{p} fetched a pail of water.")
        xs.extend([f"{n} felt overjoyed!" for n in input])
    else:
        xs.append(f"{p} fell down and broke their crown.")
        xs.extend([f"{n} came tumbling after!" for n in input])
    return "\n".join(xs)

print(x(["Jack", "Jill", "James"]))

Arguably, the program accomplishes what we set out to do - randomly select a character, and randomly print whether they succeeded or failed. Looking more closely though, this fails the litmus of legibility in multiple ways:

  • Why is there a random boolean choice?
  • Variable names are concise to the point of being illegible and easily mis-used (e.g. xs, p, z, n, x())
  • There is no documentation explaining what this does fully.

Let’s go back, take another stab at it, and see if we can make it legible:

# Jack and Jill in Python
# Purpose:          Print a sample story, using the fairytale of Jack and Jill as a framework
# Author:           Eric Pesch (ep3998)
# Date Modified:    2020-10-18

from typing import List
from dataclasses import dataclass
import random

Added frontmatter to the top of the code explaining purpose and history

@dataclass(frozen=True)
class Person:
    """Defines a character in our story
    """
    name: str

def fetch_water(person: Person) -> bool:
    """Randomly determine if water was fetched

    Args:
        person (Person): Character

    Returns:
        bool: If successful fetching water
    """
    return random.choice([True, False])

Broke some of the functionality into sub-modules and classes

def story(characters: List[Person]) -> str:
    """Fairytale

    Args:
        characters (List[Person]): List of lead characters

    Returns:
        str: The result of the story
    """
    results = []

    # Choose a random character to fetch water
    random.shuffle(characters)
    char_fetch_water = characters.pop()
    has_water = fetch_water(char_fetch_water)

    # Print story output, depending on success/failure result
    if has_water:
        results.append(f"{char_fetch_water.name} fetched a pail of water.")
        results.extend([f"{char.name} felt overjoyed!" for char in characters])
    else:
        results.append(f"{char_fetch_water.name} fell down and broke their crown.")
        results.extend([f"{char.name} came tumbling after!" for char in characters])


    return "\n".join(results)

Cleaned up and added documentation for the main story method

# List of characters in our story
story_chars = [
    Person("Jack"),
    Person("Jill"),
    Person("James")
]

# Print story output
print(story(story_chars))

Added more in-line documentation explaining inputs.

Full Python code all together: jack_jill.py

OK - This is a bit overboard on the side of documentation, but reading it from top to bottom gives both experienced and novice programmers clear indicators of purpose. From our original list:

  • What will this do when I run it - In-line comments in the code; functional modules/units of work separate from each-other
  • What will it take to run it - Typing hints throughout for input/outputs
  • Why are you here in the first place - Frontmatter at the top of the code

While this is using Python-specific syntax to enable this kind of story, similar common principles could be shared across all kinds of code. Again, reiterating our previous list:

  • Variable names - Clear and legible now
  • Class definitions - Modular functions instead of a mish-mash of logic
  • Documentation - Lots of it now
  • Comments - In-line guides

The above example is trivially complex, but think of the every day scenarios which fall into a trap of undocumented logic. To name some examples I have run into:

  • ETL systems which contained arbitrary black-box transformations no one wrote down (root cause: why is this multiplication here - oh it’s doing a conversion rate exchange to US Dollars)
  • Visual Basic Scripts with Regex functions with inclusion/exclusion groupings with random characters (root cause: spam filters for Email Subject lines from known malware groups)
  • Byte shifting in Java with arbitrary hard-coded shifts (root cause: unsigned byte array to signed integers using endian values)

All of these examples caused hours of code dives, hallways meetings, and note scrubbing to explain the code when an extra few lines of documentation would’ve saved lives time.

3. By Example: SQL - Hickory, Dickory, Dock

What about something a bit more limited in nature like SQL? Let’s say we had a log of times the mouse ran up and down a clock in “Hickory, Dickory, Dock” and wanted to run some quick statistics on its trips.

Schemas and data for fun:

-- Schemas and sample data for those following at home
CREATE TABLE log_mouse (
    name               VARCHAR(255),  -- Character name
    is_ran_down        BOOLEAN,       -- If ran up or down clock
    hour               INT,           -- Time on clock

    -- Log creation/modified dates
    creation_time      DATETIME DEFAULT CURRENT_TIMESTAMP,
    modification_time  DATETIME DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
  );

INSERT INTO log_mouse(name, is_ran_down, hour, creation_time)
VALUES
  ('mouse', TRUE, 1, '2020-01-01'),
  ('mouse', TRUE, 2, '2020-01-01'),
  ('mouse', TRUE, 3, '2020-01-01'),
  ('mouse', TRUE, 4, '2020-01-01'),
  ('mouse', TRUE, 5, '2020-01-01'),
  ('mouse', TRUE, 1, '2020-01-02'),
  ('mouse', FALSE, 2, '2020-01-02'),
  ('mouse', TRUE, 3, '2020-01-02'),
  ('mouse', FALSE, 4, '2020-01-02'),
  ('mouse', TRUE, 5, '2020-01-02'),
  ('mouse', FALSE, 1, '2020-01-03'),
  ('mouse', FALSE, 2, '2020-01-03'),
  ('mouse', FALSE, 3, '2020-01-03'),
  ('mouse', FALSE, 4, '2020-01-03'),
  ('mouse', FALSE, 5, '2020-01-03');

Sample result:

name | hour | cumulative_cnt_ran_down
-------------------------------------
mouse|     5|                       8

Using the same pattern as the SQL section, let’s take a look at a first draft and then re-write it.

Draft:

SELECT *
FROM (
  SELECT name, hour, SUM(cnt_ran_down) OVER (order by name, hour) AS cumulative_cnt_ran_down
  FROM
    (
     SELECT name, hour, SUM(is_ran_down) AS cnt_ran_down
    FROM
      (
        SELECT * FROM log_mouse
        WHERE creation_time BETWEEN DATE('2020-01-01') AND DATE('2020-01-02')
      ) a
    GROUP BY 1,2
    ) b
) c
WHERE hour = 5;

Better draft:

WITH cte_log AS (
  SELECT
    name,
    is_ran_down,
    hour,
    creation_time
  FROM
    log_mouse
  WHERE
    creation_time BETWEEN DATE('2020-01-01') AND DATE('2020-01-02')
),
cte_agg AS (
  SELECT
    name,
    hour,
    SUM(is_ran_down) AS cnt_ran_down
  FROM
    cte_log
  GROUP BY
    name,
    hour
),
cte_cumulative AS (
  SELECT
    name,
    hour,
    SUM(cnt_ran_down) OVER (order by name, hour) AS cumulative_cnt_ran_down
  FROM
    cte_agg
)
SELECT *
FROM cte_cumulative
WHERE hour = 5;

Without even a line of in-line comments, the second draft easily demonstrates the intention of the whole query:

  • cte_log - Grab the log data
  • cte_agg - Aggregate the times the mouse ran down the clock
  • cte_cumulative - Group by the hour, and calculate the cumulative sum of runs
  • Final SELECT - Grab the final cumulative number from hour 5

Trivially this can be done more efficiently - we could completely remove the cumulative part - but if you imagine harder questions like Rolling Averages across dates, Outlier patterns via quartiles, JOIN-ing multiple tables together - everything gets complicated very, very fast. Consider the first SQL block for this arbitrarily easy question:

  • Did you notice we filtered by a specific time range (Jan 1 to 2nd)?
  • Did you notice what columns we GROUP‘d by, hidden behind column indices `1, 2)
  • Which sub-block did the GROUP BY apply to, and would you have even noticed the final WHERE clause?

From a SQL specific perspective, just re-arranging the code into modular common table expressions (CTE) makes everything cleaner and easier to debug. For example, if the log table returned no results due to the date filtering, I would have to do something like this with the original query to start debugging it:

-- SELECT *
-- FROM (
  -- SELECT name, hour, SUM(cnt_ran_down) OVER (order by name, hour) AS cumulative_cnt_ran_down
  -- FROM
    -- (
    --  SELECT name, hour, SUM(is_ran_down) AS cnt_ran_down
    -- FROM
      -- (
        SELECT * FROM log_mouse
        WHERE creation_time BETWEEN DATE('2020-01-01') AND DATE('2020-01-02')
--       ) a
--     GROUP BY 1,2
--     ) b
-- ) c
-- WHERE hour = 5;

Say it was in the GROUP BY, or the cumulative date window was made wrong - imagine hunting through SQL which can be hundreds of lines long and debugging nested sub-queries? That’s a thing of nightmares!

Going back to our list:

  • What will this do when I run it - Sequential step-by-step modules
  • What will it take to run it - N/A mostly, though debugging will be easier
  • Why are you here in the first place - Named CTEs, named GROUP BYs, etc.

Caveat: There can be performance implications with changing from sub-queries to CTEs depending on the internals of the database’s query optimizer, but in general the tone of this section is “figure out ways of making code legible, even in constrained environments.”

Last Bits

Hopefully the direction of what I covered is clear - legible code saves time, money, and lives. Every engineer no matter their discipline has probably faced this problem in their day-to-day jobs. What does this button do, what does this gauge tell me, could these arbitrary lights and signals have been designed in a way to indicate warnings better. Take a little time upfront to think about the future reader. Whether that’s you, the intern, the person backfilling for you, or the person on-call - everyone benefits with a little forethought in explanation.

Always remember: a story has a beginning, middle, and end. Think about and tell your story no matter the medium.


Appendix

Link: The After-Post

Code Caveat: There’s a whole other discussion I skipped when it comes to code style, but let’s just leave that monster aside for now. Every language has quirks, patterns, camelCase vs. snake-case discussions… Just decide on a style as a team and be consistent. If you need technical helpers for that, use tools like linters and auto-formatters (this links to VS Code extensions but covers the gamut of general formatters for each language as well like beautify and prettier.)