Day 10: The importance of jobs for networking
Yesterday I implemented about half of a recursive descent parser for RSS feeds. Today, again in a big of a shorter session than usual, I implemented the other half.
Well, I say I implemented a parser for RSS feeds, but it might be more correct to say
I implemented a parser for a single RSS feed. Throughout development, I have been using
a single RSS feed for all tests: that for the excellent
Never Post! podcast. Currently my
config.kdl contains this:
feed "https://feed.neverpo.st/"
After today’s work, I can run seance sync, and it will print out the following:
Never Post
Mailbag #11: Someone To Tell Us How To Be
Don’t Do It Yourself
The Year I Learned to Pay Attention
When We Fight We Win
Practical Magic: A Witch’s Guide to Etsy Witches
A.I. and New American Fascism
...
This being the <title> of the feed followed by the <title> of each <item> within it
in the authored order.
I also parse the URLs, but have omitted them above for brevity.
The code powering this is not particularly interesting, just a straightforward
continuation of what I shared previously,
but with the todo!s replaced with actual implementations. What remains as an open question, though,
is how I ultimately want to structure Séance’s network architecture. Not necessarily so that I can
build it immediately, but so that I know where I am headed, and can therefore make decisions in roughly
the correct direction.
Job applications
Previously I claimed that the ideal for a consumer network service is to be an interruptible asynchronous job system. What I meant by this is that it is important that the system can be interrupted at any point, or have any external operation fail, and still be able to make forward progress over the long term. If your laptop goes offline, it notices and pauses work until it is avialable again. Importantly, it does not lose whatever data may have already downloaded. The same applies if you stop the program midway through synchronizing.
Rather than simply doing a fixed thing and giving up on failure, the system should have a model of both the current state and desired state, and work to turn one into the other. Because this state needs to persist between executions of the program, it cannot be stored in system memory, and instead must live in some sort of external database. Previously I selected SQLite as my embeded database, and I think it would also serve well here.
Another intriguing option, however, is Flawless,
a “execution engine for durable computation”, which allows you to write Rust code and have each fallible
or external operation recorded in a persisted log. This seems in some ways like the perfect choice, but
I think it falters slightly when we look at the details. It is important that we notice when the content
of a URL has changed, which may happen out of band between runs. For example, we may have loaded two out
of five chunks of a podcast using a
Range request when a the underlying
file served at that URL changes (signaled, we hope, by a
412 Precondition Failed response).
We would therefore need to discard our previous chunks and start again. It isn’t clear to me how or if
Flawless can handle this scenario automatically. This issue is compounded by an unfortunate lack of API docs.
Though I’m not able to find the original source at the moment, there is a recurring meme in programming communities that all programs are fundamentally either compilers or databases. Séance’s network system should be a compiler, I think, or perhaps more accurately a build system.
Build systems like Make, Ninja, or redo (for the connoisseurs), operate on a directed acyclic graph which maps inputs to outputs. In the same way, we have a set of inputs (podcast feeds) and a set of desired outputs (the downloaded podcasts) and take a number of actions (HTTP requests, etc) to get from one to another. The state of this graph should be stored in a database, updated each time we perform one of the pending steps. When Séance starts syncing, we will check the inputs (the feeds), and use them to derive the action graph, comparing it to the one we already have stored. Any feeds that have not changed may pick up where a previous run left off, eg. downloading additional chunks of files that may be missing. This nicely generalizes to other operations we may need to perform, such as transcoding files or updating their metadata.