Day 12: Simple plan

I’ve been reflecting on the design that I discussed in the previous entry, and I’ve come to the conclusion that the general build-system model isn’t isn’t the right fit for this project. The best designs account for the small details, and in this case, they don’t match up.

To review, I want Séance to be resilient to network failures because I see it as a form of accessibility, and accessibility is non-negotiable. We should be able to interrupt the program or have arbitrary operations fail and have it will continue to make forward progress as efficiently as possible. Similarly to a build system we have inputs, outputs, and potentially some partially done work in the middle. Modelling the system abstractly in this way works well when the dependency rule set is a dynamic input itself, and build rules can form complex, multi-input and multi-output DAGs (directed acyclic graphs). However, some pen-and paper planning made it obvious this isn’t the case with podcast syncing, at least not as I currently imagine it.

While the podcast synchronization “build graph” can be represented as a DAG, it is more specifically a fixed-depth tree. The process always follows a set structure (root > podcast > episode-task*), with no branches ever joining together. We first build up a set of episodes that we need to download, and then for each each episode we run through a fixed, linear sequence of steps until it is complete (eg. download, add metadata, copy to output). Because we re-fetch the podcast feeds on every sync (from our perspective, the client does HTTP caching), the only place where we may have previous work to reuse is in the per-episode sequence of steps.

Every stage within the processing of an episode depends only on the metadata synchronized from the feed and potentially the output of the previous stage. This means that the only important thing is that the path for the output of each stage is deterministic. With that being the case, we can use a simpler strategy: each time the synchronization starts, it runs through the same motions, as if from scratch, but for each stage it checks to see if its output file already exists. If so, we skip the work and continue on.

This approach is also more flexible in convenient ways. The most likely interrupted work we will run into is that the program exits in the middle of a file download. If we stream downloads to disk, this leaves the download stage output partially created. A build system that checks only for existence would either incorrectly reuse this data (if we used the output path as the working path) or disregard it entirely (because the output path does not exist). If we have each step check for its own outputs existence, we can have it notice that the file partially exists, and proceed to only do the rest of the work (in this case, issuing a HTTP range request for the remaining data). The presence of MP3 metadata could similarly be checked in-place.

To ensure that we always generate the same working path for each episode, I’m currently using the <guid> element from the RSS feed. Unfortunately it isn’t used universally, so I will eventually need a more sophisticated fingerprint, maybe falling back to the <link> element or enclosure URL.

I’ve sketched out an early version of this today:

async fn sync_item(client: Arc<ClientWithMiddleware>, item: &Item) -> miette::Result<()> {
    let project_dirs = dirs::project_dirs()?;
    let data_dir = project_dirs.data_local_dir();
    std::fs::create_dir_all(data_dir).into_diagnostic()?;

    // XXX: UNSAFE! We should hash the guid or similar to prevent directory traversal
    let download_name = format!("{}.download", item.guid.as_ref().expect("guids required for now"));
    let download_path = data_dir.join(download_name);

    if !std::fs::exists(&download_path).into_diagnostic()? {
        let body_bytes = client.get(item.enclosure.url.as_str())
                .send()
                .await
                .into_diagnostic()?
                // TODO: Stream to disk instead of buffeirng here.
                .bytes()
                .await
                .into_diagnostic()?;

        let download_path2 = download_path.clone();
        task::spawn_blocking(move || {
            std::fs::write(download_path2, body_bytes)
        }).await.into_diagnostic()?;
    } else {
        // TODO: Ensure full thing is downloaded, otherwise do range download.
        // For now, assume it is complete.
    }

    let Some(user_dirs) = UserDirs::new() else {
        bail!("can't figure out where to store podcasts")
    };

    let Some(audio_dir) = user_dirs.audio_dir() else {
        bail!("can't find audio dir");
    };

    let podcast_dir = audio_dir.join("Podcasts");
    std::fs::create_dir_all(&podcast_dir);

    // XXX: Also check title safety here for directory traversal.
    let output_path = podcast_dir.join(format!("{}.mp3", &item.title));

    // TODO: Set mp3 metadata

    task::spawn_blocking(move || {
        std::fs::copy(download_path, output_path);
    }).await;
}

When running the above, the program pauses for a while before exiting, having created the target MP3 file in my “Music/Podcasts” directory. The code is a mess, of course, but I think it proves out the general design. From here, I can polish the finer details, most importantly by removing the direct use of I/O in favour of a more testable abstraction. One possible approach to this would be to continuing using the plan-execute pattern. We could have each episode synchronization check to see if various outputs already exist when forming a plan and then have a later system execute the difference. This breaks up the logic into two cleanly separated pieces, as previously discussed.