Recently, I’ve worked on two projects (Power Hungry, Hot or Not) which (among other things) collect write-once data over time, and graph the results. The projects collect very different data, but this task was painful enough in postgres that I ended up switching to a temporal database for the second go, and it made the data collection & querying much easier. What follows is a brief discussion of the problems I faced with postgres, and how moving to RRD solved them.
RRD (Round Robin Database) is an open-source temporal database system which stores time-series. It is used in many open source projects, such as Munin, Nagios, & Ganglia. This is intended as an overview of some of the benefits of RRD for projects such as the type I often work on (periodic data logging).
Data was being collected at a higher resolution than I needed
The Problem: For Power Hungry, I get sensor readings about every 2 seconds, but that is way more data than I need for the most detailed graph (1 day’s data). I mitigated this problem in code by aggregating the readings in memory, then writing to the database every 5 minutes.
The Solution: RRD databases have this functionality built-in. You give the database an interval, and then send it updates as often as you like. It will aggregate any updates it gets during an interval, and write out an aggregated record as needed. It records not only the average, but also the min, max, and last values, so you have a little more flexibility when querying.
No sense of missed readings
The Problem: For Power Hungry, occasionally I am not collecting data (the sensors are unplugged, there is a bug, I’m doing an update, etc…). When this happens, nothing gets written to the database until the process starts up again. A small gap may be acceptable, but a larger gap should be marked as an absence of data. Postgres has no idea that data is missing (there are no empty records), so this has to be dealt with in the application, and it’s ugly.
The Solution: RRD Databases deal with missed intervals automatically. If an interval goes by without any updates to the database, it is marked as a NaN. This works great for graphing, because it’s just a matter of dealing with that magic value (the graphing library I use handles it out of the box).
Recent data is much more interesting
The Problem: For Power Hungry, I have a graph for the past day and the past week. The data I store in postgres is at the appropriate resolution for the day graph (300 records per day), so I am able to just pull records from the database and map them into points on the graph. For the past week, though, I have about 2,000 records, which is way too many for a nice-looking (or performant) graph, so I have to aggregate groups of readings together, which is slow and wasteful of resources. Currently, I do this in memory, but it could be done similarly by periodically polling the main dataset and writing aggregated points to a sister table (at the desired lower resolution).
The Solution: RRD databases have this feature built-in. When creating a database, you tell it the various resolutions you want to store, and it automatically maintains parallel datasets. For my particular project, I want about 300 points for a graph, so I set my database up with 4 archives, at 300 points per day/week/month/year. As I send updates to the database, it aggregates and maintains the 4 parallel sets of data, as the prescribed resolutions. When I query for the records from a given span of time, it automatically gives me the data at the best resolution it can.
The Problem: For Power Hungry, (as I said above) I only care about data at the default resolution (300 records per day), and at a lesser resolution (300 points per week). I might end up wanting to add month/year/whatever, but whatever the span I’m interested, data will fall off the end. In postgres, I’d have to have some sort of background worker to delete data after it falls out of scope.
The Solution: An RRD database is a set of circular buffers (where each buffer is your data at a different, prescribed resolution). When you initially define the buffer (aka archive), you tell it how much data to keep at that resolution, and when data falls off the end of that window, it is automatically overwritten. My “day” archive, for instance, stores the past 300 records (at 5 minute intervals). My largest archive (300 records per year) actually stores 5 years (it’s capped at 1500 records).
Efficiency on disk
The Problem: For Power Hungry, because I’m adding the temporality to the data manually (by having a timestamp column), I’m storing an 8-byte timestamp on every row, which isn’t necessarily uncommon, but doesn’t leverage the periodic natural of the data.
The Solution: Because RRD databases store data at fixed, known intervals, they can much more efficiently store the time of each record (by storing a start time, and calculating the offset for each record).
RRD is only one member of the family of temporal databases, but it’s a pretty common one. There are others, including an extension for postgres. Many temporal databases (including the postgres extension) focus on a wider purpose: separating the record of “what happened at point A” from “what did the database think happened at point A, when it was at point B”. Such “bitemporal” features are unneeded for write-once data, which is what my hardware data-logging projects collect.
For more information on RRD, check out the docs.
For more discussion of this post, see the Hacker News comments.