Day 9: Channel surfing
Yesterday I ended up working longer than I probably should have on my adventure, which lead to a sleepier following day than I would have liked. Because of this, I set myself what I thought was a manageable task for today, but ended up veering into a similar pattern. However! We must remember the creed of December Adventures:
The December Adventure is low key. The goal is to write a little bit of code every day in December.
And so, I stopped early today, without a fully implemented chunk of work, so that this whole project remains sustainable.
Let me walk you through what I got done, and afterwards we will gaze briefly into the complexity abyss that underlies an ideal implementation of this project.
RSS feed parsing
For my purposes so far, I am modelling RSS feeds as a list of items, each with a title and enclosure, and some shared metadata relating to the “channel”. This yields a data model like this:
struct Channel {
title: String,
items: Vec<Item>
}
struct Item {
title: String,
enclosure: Enclosure
}
struct Enclosure {
url: Url,
content_type: String,
length: u64
}
My original goal for today was to translate the XML event feed we built at the end of yesterday’s session into the above data model. I have a path now to how I can do this, but the actual code isn’t quite complete.
In short, I treat this as a recursive descent parser. The idea of these parsers is that when walking the stream of events, you typically find yourself jumping between a number of hierarchical states. Each state consumes from a shared list of tokens (or in our case Events), has its own associated working data, and produces some final result when it completes (assuming the parse succeeds). These attributes: hierarchy, unique associated data, and a final structured result, make multiple nested parsing functions an elegant implementation. Here’s what I implemented to do this:
fn parse_channel(content: &str) -> Option<Channel> {
let mut reader = Reader::from_str(content);
let mut buf: Vec<u8> = Vec::new();
let mut title: Option<String> = None;
let mut items: Vec<Item> = Vec::new();
loop {
match reader.read_event_into(&mut buf) {
Ok(Event::Start(tag)) if tag.name().0 == b"item" => {
match parse_item(&mut reader, &mut buf) {
Some(item) => {
items.push(item);
},
None => {
return None;
}
}
}
Ok(Event::Eof) => break,
_ => ()
}
}
title.map(|title| Channel { title, items })
}
fn parse_item(reader: &mut Reader<&[u8]>, mut buf: &mut Vec<u8>) -> Option<Item> {
let mut title: Option<String> = None;
let mut enclosure: Option<Enclosure> = None;
loop {
match reader.read_event_into(&mut buf) {
Ok(Event::Empty(tag)) => if tag.name().0 == b"enclosure" {
todo!("parse enclosure")
},
Ok(Event::Start(tag)) if tag.name().0 == b"title" => {
todo!("parse title");
},
Ok(Event::Eof) => break,
_ => ()
}
}
title.zip(enclosure).map(|(title, enclosure| Item { title, enclosure })
}
Here we have a high-level parse_channel function, which looks for both its own title and
any nested items. To parse an item, we call into parse_item. Note that within parse_item
we also have a “title” field that we’re looking for, but because our state is represented
by being in another function, we know that the title should be associated with an item,
not with the channel as a whole.
There code above is incomplete, of course. The todo!()
calls make that clear, but a less obvious shortcoming is that the parsing functions return
an Option rather than a Result. Results would be more appropriate here, because we’d like to be
able to generate an error message explaining why the feed is invalid, not just that it was.
The complexity hiding under our bed: networking
It can be tempting to think of HTTP requests as just a special kind of function call. Sure, maybe they are really slow, and you have to turn all the arguments into strings first, but that’s mostly it, right? This is an idea that can go unchallenged for a long time if you primarily develop network services that run within data centres, where perfect high-speed links are the norm and a reverse proxy like nginx barricades you from the lurking horrors of the broader internet.
This, of course, is just a comforting fantasy. Network connections go down, drop packets, and may pause indefinitely. HTTP requests can time out and need to be retried and on slow networks may need to be performed in parts rather than all at once. My favourite article on this is Engineering for Slow Internet , written by an IT worker stationed at a research station in Antarctica. But this is a much broader need: Dan Luu writes in How web bloat impacts users with slow connections about similar experiences using dialup internet in the rural US, as do Rek and Devine of Hundred Rabbits about downloading XCode at a cafe in French Polynesia. A proper consumer network service is therefore an interruptible asynchronous job system. Heavy use of timeouts, retries, and range-queries is essential.
I would like Séance to support this before I consider it “done”, though I worry that tackling it all at once would make the task feel hopeless. Still, this could have profound consequences for the ultimate architecture of the system, and so tomorrow I plan to sketch out a high-level design for it.