Day 10: The importance of jobs for networking

Yesterday I implemented about half of a recursive descent parser for RSS feeds. Today, again in a big of a shorter session than usual, I implemented the other half.

Well, I say I implemented a parser for RSS feeds, but it might be more correct to say I implemented a parser for a single RSS feed. Throughout development, I have been using a single RSS feed for all tests: that for the excellent Never Post! podcast. Currently my config.kdl contains this:

feed "https://feed.neverpo.st/"

After today’s work, I can run seance sync, and it will print out the following:

Never Post
        Mailbag #11: Someone To Tell Us How To Be
        Don’t Do It Yourself
        The Year I Learned to Pay Attention
        When We Fight We Win
        Practical Magic: A Witch’s Guide to Etsy Witches
        A.I. and New American Fascism
        ...

This being the <title> of the feed followed by the <title> of each <item> within it in the authored order. I also parse the URLs, but have omitted them above for brevity.

The code powering this is not particularly interesting, just a straightforward continuation of what I shared previously, but with the todo!s replaced with actual implementations. What remains as an open question, though, is how I ultimately want to structure Séance’s network architecture. Not necessarily so that I can build it immediately, but so that I know where I am headed, and can therefore make decisions in roughly the correct direction.

Job applications

Previously I claimed that the ideal for a consumer network service is to be an interruptible asynchronous job system. What I meant by this is that it is important that the system can be interrupted at any point, or have any external operation fail, and still be able to make forward progress over the long term. If your laptop goes offline, it notices and pauses work until it is avialable again. Importantly, it does not lose whatever data may have already downloaded. The same applies if you stop the program midway through synchronizing.

Rather than simply doing a fixed thing and giving up on failure, the system should have a model of both the current state and desired state, and work to turn one into the other. Because this state needs to persist between executions of the program, it cannot be stored in system memory, and instead must live in some sort of external database. Previously I selected SQLite as my embeded database, and I think it would also serve well here.

Another intriguing option, however, is Flawless, a “execution engine for durable computation”, which allows you to write Rust code and have each fallible or external operation recorded in a persisted log. This seems in some ways like the perfect choice, but I think it falters slightly when we look at the details. It is important that we notice when the content of a URL has changed, which may happen out of band between runs. For example, we may have loaded two out of five chunks of a podcast using a Range request when a the underlying file served at that URL changes (signaled, we hope, by a 412 Precondition Failed response). We would therefore need to discard our previous chunks and start again. It isn’t clear to me how or if Flawless can handle this scenario automatically. This issue is compounded by an unfortunate lack of API docs.

Though I’m not able to find the original source at the moment, there is a recurring meme in programming communities that all programs are fundamentally either compilers or databases. Séance’s network system should be a compiler, I think, or perhaps more accurately a build system.

Build systems like Make, Ninja, or redo (for the connoisseurs), operate on a directed acyclic graph which maps inputs to outputs. In the same way, we have a set of inputs (podcast feeds) and a set of desired outputs (the downloaded podcasts) and take a number of actions (HTTP requests, etc) to get from one to another. The state of this graph should be stored in a database, updated each time we perform one of the pending steps. When Séance starts syncing, we will check the inputs (the feeds), and use them to derive the action graph, comparing it to the one we already have stored. Any feeds that have not changed may pick up where a previous run left off, eg. downloading additional chunks of files that may be missing. This nicely generalizes to other operations we may need to perform, such as transcoding files or updating their metadata.