Thoughts after the ASCII music video

Introduction

Recently, I made a music video for one of my songs. I made the video using Python 3 and Asciimatics.

I'm interested in lo-fi videos, and this was a first experiment. While my experiment existed solely within the bounds of a terminal window, there were a number of lessons learned through the process.

These lessons will help me, as well as others who are interested in performing similar lo-fi experiments.

Use a real time (not frames)

Asciimatics uses a single integer "frame" counter.

Regardless of how fast the screen is updating, it's easier to deal with simple seconds and fractional seconds. Approximate timing is okay, especially if we can readily use timing data gathered from other sources.

For instance, there's timing data in closed captions. It's also easy to create timing data in something like Audacity. All of these use either human time or seconds and fractional seconds.

Leverage your other tools by dropping the notion of frames. If you _really_ need to be frame-precise, consider using a frame-separator in other timestamps, maybe using an '@' as a separator between the seconds and frames.

Idealized timeline

A show can be thought of as a series of variable-length scenes, strung together.

In my music video, I had a start screen before the music started.

You need to be with flexible intro and outro content, while also fully supporting binding the video to the audio's location. In most cases, you may be able to get away with having a "starting time" for a scene which is simply subtracted to all action in a scene.

If I have three scenes, A, B, and C, and I know they start ten minutes apart, it could be something as simple as:

scene_a = Scene(0)
scene_a.at('0:05', do_it)
# ... add 10 minutes of content (originally only 5)
scene_a.at('5:05', part_of_scene_a)
scene_b = Scene('5:00')
scene_b.at('5:05', part_of_scene_b)
# ... add 15 minutes of content
scene_c = Scene('25:00')
scene_c.at('25:05', do_it)
show = [ scene_a, scene_c, scene_b ]

Since each scene can know its own starting point, it can keep the scene timing consistent, even if the order of the scenes themselves change or the earlier scenes change length.

Do you want to add a commercial? You shouldn't have to dick with timings. Just create the scene and stick it in the show:

commercial = Scene('3:00')
show = [ scene_a, commercial, scene_c, scene_b ]

That sort of scene movement only works when music is bound to scenes, and not the whole show, of course.

Going further

For an audio book or other long-form audio stream, you should be able to grab the audio file and split it up in to separate scenes.

Getting markers for splits is, as mentioned, easy enough to gather in Audacity, but -- while you can split things up in Audacity, it shouldn't be required.

If you need to use Audacity anywhere to find the split-points, it's not really a time-saver. If, however, you can gather split-points within the application, things get more interesting.

Splitting audio at scene-breaks allows you to use scene-breaks as explicit restart points when iterating on a scene. It's faster and easier to only allow jumping forward and backward at scene changes, as you know the screen will start from black.

This means on the back-end, we'll need to track the time of the scene change for the audio to support this, anyway. If we have the data, we should support splicing a new scene in to that location.

Features and timeline

mvp:

Timeline using real time units. The line between "scene" and "show" can be blurry or not exist.

mvp+1:

Shows made of series of stitched-together scenes. Scenes described with starting times that may not map to their play-time.

mvp+2:

Scenes and timelines integrate with long audio tracks and arbitrary starting points within those tracks.

Synchronize with the audio

My first experiment used PyGame to run the song. This back-end is designed for background music in games.

You need to be able to query the audio to see where it is. If the audio isn't where it is expected to be, you need to hold everything up until it catches up.

PyGame doesn't support this. It's more of a fire-and-forget service.

Idealized timeline

In the very least, you need to delay the start of a scene until the audio starts moving.

The disadvantage of audio running in a separate thread (as is normally done) is that it may not be at the same place as the animation thread. The play speed shouldn't have glitches, but the start times can be a bit wobbly.

At the very least, you need to support pausing until the audio is ready. Having all sense of time come from the audio goes one step further, as it makes the primary timekeeper the audio system.

Some systems (such as PyGame) have a distinction between a sound effect that is loaded entirely in to memory and a streamed background music file.

Even if you're technically dealing with background music, getting the timing right may require loading more of the file in to memory more of the time. Accept that you may need a whole song in memory, and that you can only reasonably change this during scene breaks.

Going further

You should be able to do your final rendering non-real-time, so you'll always be properly synchronized.

Non-real-time is the ideal for keeping audio and video synchronized. It allows you to bite off small pieces of audio at a time and know that everything will line up.

Features and timeline

mvp:

Preload entire audio file in to memory and avoid streaming from disk.

Try to change scenes or cameras at background song boundaries. Keep these as isolated units, knowing they'll get stitched together during editing.

mvp+1:

Start of video is delayed until the audio starts playing.

If each scene is independent, a scene may have a pause to start. This can be corrected in post, as needed, but should keep audio and video consistent.

It might be useful to have the video's sense of time to come from the core background track, I'm not sure that's 100% needed without further testing.

mvp+2:

Non-realtime rendering insures that audio and video is always synchronized.

This is by far the gold standard. Ideally, you can do this at faster than real time.

ASCII as a visual form

I used big Figlet ASCII Art fonts for my test video.

Monitors are bigger and higher resolution than ever before, right?

But this is really what you need. Huge text, even in text mode.

Some of the viewers will watch it full-screen, sure, but a significant population will half-distractedly watch a thumbnail instead.

If it's a silent film, folks will need to go larger to read what is going on. So, in a silent film context a text-based Roguelike user experience may still work? Further experiment is required.

Still, consider going for an older aesthetic and angling for 40x25 (or thereabouts) instead of something more modern.

Idealized visual form

I'm still thinking about old school RPGs.

Fixed camera at best. Top-down maps. A few fixed expressions in close-up. Maybe a giant close-up like you find in visual novel games.

And a dedicated section for dialog to appear.

Maybe menu-style alternative dialog, of course this would be just a fake, but it would be easy flavor.

It would be mostly tile-based with a few larger graphics now and then.

It would probably be less than 40 tiles wide. The Roguelike people have to make a lot of compromises about visible map size versus map quality, so if you're curious about the how and why, you can always look there.

Going further

Honestly, I'd really like to have something like The Sims where instead of semiautonomous entities, you just had actors you could control and play and rewind their time.

There is MakeHuman which provides an open-source method to generate and render humans. It has a lot of output formats.

I wouldn't mind using the entire virtual worlds of The Sims, though. If we had the capacity to use assets from The Sims, (on-par with, say, other open-source games that require comercial assets), it would allow us to use the third-party assets as well, of which there are considerable and some with decent licenses.

There are other 3D games we might be able possibly to leverage, but few are designed for normal, ordinary world stuff like The Sims.

Garry's Mod might technically work, but modifying maps is a fair bit more complicated, and it uses a commercial engine... Then there's the mod community that mostly just steals stuff from commercial games and is full of fascists... Not very appealing.

Features and timeline

mvp:

Modeled after an RPG, or a text-based Roguelike. A dedicated place for the dialog. The right versus left, main character versus whomever being talked to. It's an easy UX to write that's flexible for many types of stories.

mvp+1:

It's possible to experiment with 3D without actually having a 3D game. The portraits can be animated 3D models, there can be cut-scenes. These, too, are standard components of games.

mvp+2:

This would be a 3D video, so more like a silent cartoon. Instead of the interface having a dedicated place for dialog, it would be handled more like standard closed captions.

Phase One

Timeline using real time units. The line between "scene" and "show" can be blurry or not exist.

Preload entire audio file in to memory and avoid streaming from disk.

Try to change scenes or cameras at background song boundaries. Keep these as isolated units, knowing they'll get stitched together during editing.

Modeled after an RPG, or a text-based Roguelike. A dedicated place for the dialog. The right versus left, main character versus whomever being talked to. It's an easy UX to write that's flexible for many types of stories.

Visual Idea

Here's an idea for a roguelike visual (since they map to documents easier):

+----------------------------------------+
|                 ",,,,.........."       |
|                 ",,,,.,,""""".."       |
|                 "####'##"   "..*""*    |
|                  #...AB#    *....."    |
|                  #.....#    """*..*"""*|
|                #####D######    "".....0|
|                #..........#     ""*"""*|
|                #...>......#            |
|                ############            |
|                                        |
+----------------------------------------+
|Betty can:                              |
|  signal to Ada to leave, ASAP.         |
|> ask for garlic (nicely).             <|
|  mock the blood on his necktie.        |
|                                        |
+----------+-----------------------------+
| Ada      |Dracula: Good evening!       |
|>Betty   <|Ada: We're here to fix your  |
|          | computers.                  |
|          |Dracula: The basement is     |
|          | over here!                  |
+----------+-----------------------------+


+----------------------------------------+
|                 """"""""""""""""       |
|                 ",,,,.........."       |
|                 ",,,,.,,""""".."       |
|                 "####+##"   "..*""*    |
|                             *....."    |
|                             """*..*"""*|
|                                ""...AB0|
|                                 ""*"""*|
|                                        |
|                                        |
|                                        |
|                                        |
|                                        |
+You see:--+Near Old House---------------+
|0: to car |The house appears ancient    |
|          | with fine, hand-crafted     |
+----------+ details now falling to ruin.|
|>Ada     <|Betty: We're lost, Ada.      |
| Betty    | Admit it!                   |
|          |Ada: We're not lost! We're...|
|          | ... Alright, Betty. We're   |
|          | lost.                       |
+----------+-----------------------------+

Source Idea

Here's a potential source snippet leading up to the above:

ada = Actor('Ada', player=True)
betty = Actor('Betty', player=True)
dracula = Actor('Dracula')
passage = Thing('to car')
welcome_scene = Scene('0:00', map='dracula_floor_1', audio=ambient_creep,
                    title='Near Old House',
                    place={'A':ada, 'B':betty, 'D':dracula, '0':passage})
betty.follow(ada)
betty.say('0:01', "We're lost, Ada. Admit it!")
ada.say("We're not lost. We're...")
ada.say(1, "... Alright, Betty. We're lost.")
ada.move_to('0:05', Scene.map.find('+'), proximity=3)
betty.choice("Dare Ada to lie about why we're here.",
             "Say: We're computer technicians.",
             "Say: We're here to suck his blood!",
             "Say: We're pest control.",
             pick=0, delay=0.5)
betty.emote('smiles and looks at Ada.)
ada.emote(0.2, 'squirms. "You have an idea.'
                 ' It's a bad one. That's your bad idea face.")
betty.say("We should say we're here to suck his blood.")
ada.say("What? No.")
ada.say(0.2, "There's no reason he'd let us in if we said that.")
betty.emote(0.2, 'nods. "You're right. We should do something else."')
betty.say(0.1, "I know. I dare you to say we're computer technicians.")
ada.say('What?')
ada.say(0.5, "You're mean. You know that, right?")
welcome_scene.wait(0.2)
return welcome_scene

Reflection

It's interesting that nothing about my example actually needs the background track to be sample-precise with the visual. How important is that, really? Maybe this is something that's only really needed for the lyric tracks and when there's explicit syncronized timing.

(For sample-precise timing to music, you might think of having a dedicated MIDI track for the action triggers. However, that's different than my above example.)

Even the "real timeline" thing is a bit fuzzy. Scenes start with a real time that's used as an offset for timestamps mentioned in the scene, yes. But what I actually use in the example are mostly relative time in seconds.

The given example has what could be a looping ambient track for the background. I think of it going silent and a knocking sound as part of the transition to the scene with the door open, but... I can also see long ambient tracks that fit multiple scenes.

This means we'd need an advisory_start which would start audio within the file if you're jumping in to it, but let it flow naturally if you're starting at a previous scene. Ideally, this could be part of the next bit...

Not all scenes will have fixed starting state. Sometimes state will depend upon previous state. We still need to jump to arbitrary scenes to aid in development. We can manage this by caching scene state at the end of scenes when this is needed.

We could either always overwrite, or create a new file separate from the working file and make the developer manually overwrite. I favor always-overwrite, but user-overwrite would be more like traditional film. (I want fast and easy. Post-processing audio, as for traditional film, is neither of these things.)

Roguelike games can easily have a dedicated region for text. My example above was narrow, but I think if it's a Roguelike aiming for 80+ by something 24 or greater is reasonable. Probably with three panes instead of whatever I was thinking above, one for map, one for dialog and feedback, and another for equipment or stats or even inventory.

A design aiming after a GUI RPG allows us to have potraits, but turns back and forth dialog in to what is effecitvely a cut-scene. There's nothing wrong with that, but it's different work than the main stuff.

Graphic RPGs will have smaller maps than the text-based games. Any graphic RPG game that uses a "minimap" of some sort does so because the primary view is pretty but doesn't convey enough information about where you are in relationship to your objectives. You see this less with third-person turn-based games than with first-person live-action games, but this is totally fine for our particular use-case. Huge, pretty tiles and a light-weight sketch of the neighborhood in a corner for flavor.

If a show were to mostly have back-and-forth dialog, it should probably aim to feel more like a visual novel game and not an RPG. This would be lots of dialog with big portraits and usually some relationship-based questions.


🏡Home | 📚Blog | 🏬Products | 📝Services