Taming Nebula and YouTube subscriptions with babashka and org-mode

10 minutes

What is the problem?

A while ago, some of my favourite YouTube creators joined forces in a new venture called Nebula. It is a paywalled video sharing website, with no ads and only a limited list of content creators.

For a while I've been looking for a way to support their work, and Nebula seems like a nice way to do so without watching ads.

The user experience with Nebula is not entirely bad, but it's just bad enough to be occasionaly infuriating. It seems to be optimized for discovery of the available content creators rather than "Just show me the new stuff from people I follow" which is what I'm interested in.

The functionality, similar to YouTube's subscription box, certainly exists under the My Shows tab, but it lacks a couple of quality-of-life features such as grouping videos by date of upload, or marking already watched videos.

The subscriptions-like page also takes annoyingly long to load which is the main reason I hate using it.

Can we solve it with software?

In my daily workflow I use an inbox.org file into which I add ideas and new task as they pop up. Whenever I have a bit of downtime or I'm between tasks, I first check if there is something in inbox.org and sort it out before picking the next thing to work on. (This is all acccording to the GTD strategy.)

My idea is to have something similar with nebula subscriptions - to have a cronjob running in the background periodially scraping the nebula site for videos I haven't watched yet, and feeding them into a subscriptions_inbox.org file which I would go through after work to look for stuff to watch. It doesn't matter if loading the data from Nebula takes a long time, it would be the robot's time being wasted, not mine. I would only have to open my org file to see what's new, which is basically instantaneous.

Although Nebula doesn't have a documented backend api, a quick inspection with devtools reveals that it uses Zype to organize and deliver the content.

Zype has a publicly documented api, and the necessary access key and user_id can be recovered using devtools.

The solution needs to do three things:

  1. pull data from zype and create a list of recently uploaded videos
  2. filter out videos which I've already watched, or decided not to watch
  3. append records to the inbox file

Babashka

Babashka is build of clojure on GraalVM, managed by the singular Michiel Borkent. With its fast startup and a bundle of included libraries, it is optimized for small/medium sized scripts for which one would usually use a shell scripting language.

I envisioned my little subscriptions utility to be something of that nature, so I reached for babashka.

Getting the data

We'd like to have a function that looks like this

(defn get-recent-videos [config]
    ...)

From our previous adventures with the devtools, it's obvious that we'll need more than one request to do it.

  1. First we need to get a list of creators we're following
  2. Then we'll need to load data about their channels, from which we can extract playlist_ids
  3. Finally we'll load pages of videos from those playlists until we reach records old enough that we can be reasonably sure we've caught them on a previous pass of the script (let's say three days)

With babashka's builtin curl wrapper, the http requests are farily straight forward:

(require '[babashka.curl :as curl]
         '[cheshire.core :as json])


(defn get-following [{:keys [zype-api-host api-key user-id]}]
  (let [response (curl/get (str zype-api-host "/zobjects")
                           {:query-params {"zobject_type" "following"
                                           "user" user-id
                                           "per_page" "100"
                                           "api_key" api-key}})]
    (-> response :body (json/parse-string true) :response)))


(defn get-channel [channel-id {:keys [zype-api-host api-key]}]
  (let [response (curl/get (str zype-api-host "/zobjects")
                           {:query-params {"zobject_type" "channel"
                                           "id" channel-id
                                           "api_key" api-key}})]
    (-> response :body (json/parse-string true) :response first)))


(defn get-videos [playlist-id {:keys [zype-api-host api-key]}]
  (let [response (curl/get (str zype-api-host "/videos")
                           {:query-params {"playlist_id.inclusive" playlist-id
                                           "sort" "published_at"
                                           "order" "desc"
                                           "api_key" api-key
                                           "per_page" "100"}})]
    (-> response :body (json/parse-string true) :response)))

Nebula's catalog is not all that massive, and if there is a limit on api requests, I've seen no indication that I'm close to reaching it, so I've chosen to pull all videos for each channel and only filter out the non-recent ones at the very end.

Clojure is great at data manipulation, so putting all our responses together is a breeze:

(defn recent?
  ([datetime] (recent? datetime 24))
  ([datetime hours]
   (.isAfter datetime
             (.minusHours (java.time.ZonedDateTime/now) hours))))


(defn get-recent-videos [{:keys [recent-video-interval-hours] :as config}]
  (let [following (get-following config)
        channels (->> following
                      (mapv :channel)
                      (map #(get-channel % config)))
        videos (mapcat
                (fn [{playlist-id :playlist_id, title :title}]
                  (->> (get-videos playlist-id config)
                       (map #(assoc % :channel-title title))))
                channels)]
    (->> videos
         (map (fn [{:keys [title published_at friendly_title channel-title]}]
                {:published-at (java.time.ZonedDateTime/parse published_at)
                 :title title
                 :link (str (:video-link-prefix config) friendly_title)
                 :id friendly_title
                 :creator channel-title}))
         (filter #(recent? (:published-at %) recent-video-interval-hours)))))

There is a fair bit of stuff going on here:

  • Java interop is used to turn datetime data from iso-formatted strings into datetimey objects.
  • A helper function recent? encompasses our datetime comparisson, which is implemented using the java interop mentioned previously.
  • Channel titles are assoced into video records - this is because in the final output I want to see both the name of the title of the video and the name of its creator.
  • The video data is reduced to a small map, and a link to the video is crafted.

config is a map of all constants used in various stages of our program. This is what it looks like:

{:zype-api-host "https://api.zype.com"
 :user-id "<your user id>"
 :api-key "<secret>"
 :video-link-prefix "https://watchnebula.com/videos/"
 :recent-video-interval-hours 72}

Keeping track of already watched videos

It is desirable that the records loaded by our periodic script partially overlap records from previous runs, to reduce the chance of missing something in case the script fails, api is down, etc...

On the other hand, we don't want to be constantly adding duplicate lines to our final inbox file, and we don't want records to reappear once we've purposely removed them.

For these reasons we'll keep a "database" of video ids which we've already visited, and check them before adding new rows to the inbox file.

The data throughput is tiny, so a simple edn file will be sufficient for our database.

This is what it could look like:

;; database.edn
{:visited-ids #{}}

And a procedure for registering newly visited videos:

(require '[clojure.pprint :refer [pprint]]
         '[clojure.set :as set]
         '[clojure.edn :as edn])

(defn add-visited-ids! [filename video-data]
  (let [video-ids (set (map :id video-data))
        file-content (edn/read-string (slurp filename))]
    (spit
     filename
     (with-out-str
       (pprint (update file-content :visited-ids set/union video-ids))))))

It is not the most efficient thing ever, but it will be just fine for the low volumes we expect.

We'll take advantage of set arithmetic to avoid duplicates. Pretty formatting using pprint won't hurt us and will make manual inspecion of the database more pleasant if it's ever necessary.

I'm a big fan of clojure's with-out-str which captures the stdout effect of the forms in its body and gives them back as the return value.

Building the inbox file

Adding to an org inbox file is even easier than keeping track of visited videos, we just have to format the data and append it to the file:

(defn video-data->notificaiton-str [{:keys [creator title link]}]
  (format "* %s: =%s=\n%s\n\n" creator title link))
  
  
(defn append-notifications! [filename video-data]
  (spit filename (apply str (map video-data->notificaiton-str video-data)) :append true))

video-data->notificaiton-str creates a string like this

* Real Science: =Why Horseshoe Crab Blood Is So Valuable=
https://watchnebula.com/videos/real-science-why-horseshoe-crab-blood-is-so-valuable

The relevant info is all there along with a link straight to the video.

If we wanted to include some other info as well, we'd just have to extract it from the raw data in get-recent-videos, pass it along in video-data and consume it in video-data->notification-str.

Putting it all together

To recap, we want to:

  1. pull the video data
  2. check which videos we've already visited
  3. generate notifications for unvisited videos
  4. update database of visited videos

Here's the code:

(let [recent-videos (get-recent-videos config)
      already-visited-ids (-> db-filename slurp edn/read-string (get :visited-ids #{}))
      recent-ids (set (map :id recent-videos))
      to-notify (set/difference recent-ids already-visited-ids)
      notify-data (->> recent-videos
                       (filter #(to-notify (:id %)))
                       (sort-by :published-at))]
  (append-notifications! inbox-filename notify-data)
  (add-visited-ids! db-filename notify-data))

That's it! Just wrap it in a cronjob and let it run.

Also include YouTube subscription, how hard can it be?

Now that I've changed the way I interact with Nebula subscriptions, I figured I'd do the same for YouTube subscriptions and avoid manual checking of that website as well. I'd have one universal subscription box directly in emacs.

Turns out all we have to do to integrate YouTube into our existing tool is to implement another get-recent-videos method; The approach is very similar so I won't go into details here, if you're interested, the code is available in this repo.

We make the following changes in our main file:

  1. add a call to our youtube-based get-recent-videos method
  2. include the service name (youtube/nebula) in the visited video ids (this is practically unnecessary because the ids from both services are formed completely differently so there's no chance of conflict, but it doesn't hurt)
(defn video-data->id [video-data]
  [(:service video-data) (:id video-data)])


(defn -main [config-filename]
  (let [{{inbox-filename :inbox-file, db-filename :database-file} :main, youtube-config :youtube, nebula-config :nebula}
        (-> config-filename slurp edn/read-string)

        recent-videos (concat
                       (->> (youtube/get-recent-videos youtube-config)
                            (map #(assoc % :service :youtube)))
                       (->> (nebula/get-recent-videos nebula-config)
                            (map #(assoc % :service :nebula))))
        already-visited-ids (-> db-filename slurp edn/read-string (get :visited-ids #{}))
        recent-ids (set (map video-data->id recent-videos))
        to-notify (set/difference recent-ids already-visited-ids)
        notify-data (->> recent-videos
                         (filter #(to-notify (video-data->id %)))
                         (sort-by :published-at))]
    (append-notifications! inbox-filename notify-data)
    (add-visited-ids! db-filename notify-data)))

Notice that I wrapped the whole thing in a -main function which receives a path to our config file as a parameter. I've also extended the config to include separate sections for each component of our tool.

{:main
 {:inbox-file "path/to/subscriptions_inbox.org"
  :database-file "path/to/database.edn"}

 :youtube
 {:api-url "https://www.googleapis.com/youtube/v3"
  :video-link-prefix "https://youtube.com/watch?v="
  :api-key "<secret>"
  :channel-id "<your channel id>"
  :recent-video-interval-hours 72}

 :nebula
 {:zype-api-host "https://api.zype.com"
  :user-id "<your user id>"
  :api-key "<secret>"
  :video-link-prefix "https://watchnebula.com/videos/"
  :recent-video-interval-hours 72}}

What did we learn?

Clojure excels at data manipulation. The focus on core datastructures and simple interface of our datasource allowed us to add another service to our solution with minimal changes in existing code.

The core library contains several practical tools such as spit/slurp, with-out-str, pprint which alleviate the hassle associated with manipulating text files on disk.

Beyond the already wealthy clojure core libraries, there is the vast landscape of java functionality available through interop.

Babashka plays into clojure's strengths, extends them, and lets us employ them in our dirty little scripts. Babashka's fast startup time didn't get the opportunity to shine in this project because the majority of the runtime is dominated by http requests.

To improve performance, the first step would be to replace babashka.curl with a library which allows for async http requests. As mentioned at the start however, the performance isn't really relevant because the script will be run by a cronjob every couple of hours, so a few seconds here or there won't be that valuable.

The code discussed in this post can be found here.