Why we need to learn OpenTSDB? Is it good study case for us to know how to design HBase Table? For me, I would totally say yes. There are many good optimizations which already are applied to OpenTSDB, this open source project. So this post will only say how does OpenTSDB design the HBase table, not focus on how to use OpenTSDB or how to implement OpenTSDB to monitor server. Maybe in the future, I will write down this part.
So first, Let’s simply know some basic concepts in OpenTSDB.
What is OpenTSDB?
It is the distributed, scalable, time series database which is for modern monitor needs. It can collect, store and serve billion data points with no less of precision, can be used with Tcollector. Here are two key points, one is time series, the other is billion data. So timestamp is important point in OpenTSDB, and there are many data points which OpenTSDB needs to deal with. (That’s the main reason we need to learn OpenTSDB’s design; we are also facing big data and time is also significant field for the data)
Even though OpenTSDB is open source project, it is also used many other big companies, including Yahoo, Ebay, Pinterest, and so on.
- data points: (time, value)
- metrics: proc.loadavg.cpu
- tags: hosts=haimeili, ip=127.0.0.1
- metric + tags = time series
There are two tables which OpenTSDB use to store data, one is tsdb, the other is tsdb-uid. Currently, it already have two additional tables, named tsdb-meta, tsdb-tree.(new in OpenTSDB 2.0)
This table is to map uid to name or map name to uid. There are only three kinds of qualifiers: metric, tagk and tagv. We need to remember that this is two ways, one is from uid to name, the other is from name to uid. Here is the example,
tsdb is the main table to store data point. Its rowkey is a concatenation of uids and time.
- This is rowkey format: <metric uid><timestamp><tagk1><tagv1><tagk2><tagv2>….
- Timestamp normalized on 1 hour boundaries
- All data points for an hour are stored in one row
- There are two qualifer formats, one is 2 bytes, the other is 4 bytes. For 2 bytes, it looks like this: <12 bits><4bits>. The first 12 bits is to store min-second information. the 4 bits is a flag, first 1 bit is to tell the value is integer or double, the rest three bits is to tell the length of the value from 0 to 8 bytes. e.g. “000” means 1 byte value, “010” means 2 bytes value, etc. For 4 bytes, it looks like this: <4 bits><22 bits><2 bits><4 bits>. The first 4 bits is “0000” or “1111”. The 22 bits is the min-second information. The last 4 bits is flag which is the same with above.
Here is one example:
1297574486 = 2011-02-13 13:21:26 MWeP = 01001101 01010111 01100101 01010000 = 1297573200 = 2011-02-13 13:00:00 (only select hours and cut down mins which will be stored in qualifier) PK = 01010000 01101011 = 1286 (1286 seconds = 21 mins 26 seconds) 1297573200+1286=1297574486
When you design table for big table, you need to consider to use concatenation method to save space. If you have time-based data, you need to think about the position to store timestamp, and whether you want to store the data for per second or per minute. Also if your data is not good format, or too long, or you have the list of data, you might need to map data to a uid to save space.