Category Archives: Database

MySQL(3)-Joins

  • left join = A
    select <select_list> from tableA A left join tableB B on A.key=B.key
  • inner join = (common part between A and B)
    select <select_list> from tableA A inner join tableB B on A.key=B.key
  • right join = B
    select <select_list> from tableA A right join tableB B on A.key=B.key
  • A – (common part between A and B)
    select <select_list> from tableA A left join tableB B on A.key=B.key 
    where B.key is NULL
  • B – (common part between A and B)
    select <select_list> from tableA A right join tableB B on A.key=B.key
    where A.key is NULL
  • A + B
    select <select_list> from tableA A full outer join tableB B on 
    A.key=B.key
  • A + B – (common part between A and B)
    select <select_list> from tableA A full outer join tableB B on 
    A.key=B.key where A.key is NULL or B.key is NULL
Advertisements

Slick (3) – connection pool

Before talking into detailed knowledge about connection pool in Slick, we need to figure out two things: One is about Slick. Slick does not provide a connection pool implementation of its own. The other thing is about JDBC, which not provides asynchronous database driver for JVM. So we need to thread pool in Play Framework application to help improve performance.

1. Add connection pool to Slick:

Event Database contains an AsyncExecutor that manages the thread pool for asynchronous execution of Database I/O Actions. Its size is the main parameter to tune for the best performance of the Database object. It should be set to the value that we would use for the size of the connection pool in traditional, blocking application. For Slick, its default connection pool is HikariCP, which is a “zero-overhead” production-quality connection pool.

import play.api.db.DB
import play.api.Play.current
import slick.driver.MySQLDriver.simple._
object Contexts {
  def withSession[Res](f: Session => Res) = {
    Database.forDataSource(DB.getDataSource(),
      AsyncExecutor(<executor_name>, <num_threads>, <queue_size>)) withSession f
  }
}

2. connection pool configuration

There’re some important default configuration which we need to know. Of course, we can override them to optimize.

# HikariCP connection pool configure
play.db.hikaricp.connectionTimeout=30 seconds
play.db.hikaricp.maxLifetime=30 minutes
play.db.hikaricp.maximumPoolSize=10

3. thread pool vs connection pool

There is dependency between these two pool setting is in maximum attribute.Let us assume that your database can support maximum of 50 connections then it is straight forward to configure the connection pool maximum attribute as 50. But how this setting going to impact thread pool setting? Configure application server thread pool as maximum of 100. In this scenario application server will allow 100 request to be processed simultaneously however database have only 50 connection. So only 50 thread will get connection another 50 thread will fight to get database connection hence connections will be switched between threads frequently that slow down overall application performance.

Just adding database connections will not improve performance, and can compromise availability because of the need to reestablish all the connections during failover.

Let us assume we set thread pool maximum setting same as connection pool maximum setting of 500. Application server will allow 50 thread to be processed simultaneously the remaining thread will be in wait state. All the 50 thread will get database connection immediately hence it will processed quickly.

The above example assumed each thread will use one database connection (may be multiple connections but sequentially); if your application uses two database connections parallels by each thread then configure thread pool maximum setting as half of the connection pool maximum.

Setting the thread pool size too large can cause performance problems because if there are too many concurrent threads, task switching overhead becomes a serious bottleneck. Any more threads than the number of connections available could be wasteful given contention for the connections. Fewer threads will not consume the number of connections available. 

4. number of processor core vs the size of a thread pool

  • More threads mean more memory usage. Each thread requires a thread stack. For recent HotSopt JVMs, the minimum thread stack size is 64KB, and the default can be as much as 1MB. That can be significant. In addition, any thread that is alive is likely to own or share objects in the heap whether or not it is currently runnable. Therefore it is reasonable to expect that more threads mean a larger memory working set.
  • A JVM cannot have more threads actually running than their cores (or hyperthreaded cores) on the execution hardware.

5. Involve connection pool to database querying

def delete(userID: Int): Boolean = Contexts.withSession { implicit session =>
  val users = TableQuery[Users]
  val filterInfo = users.filter(_.id === userID)

  if (filterInfo.exists.run) {
    filterInfo.delete
    true
  } else false
}

6. Reuse session within one logic

We know the more sessions we use, the much easier the connection pool uses up. So within one logic, we suggestion to pass session around to finish one purpose. In order to reuse session, we need to pass session as implicit.

def save(implicit session: Session, traitInfo: TraitInfo): Int = {
  val traits = TableQuery[Traits]
  (traits returning traits.map(_.id)) += traitInfo
}

7. Additional important things to know

Some Concepts on Slick we need to know/understand:

  1. For every database session (withSession blocks), Slick is opening a new connection to the database server.
  2. Session Handling:
    1. The Database object’s withSession method creates a Session, passes it to a given function and closes it afterward. If we use a connection pool, closing the Session returns the connection to the pool.
    2. Only the method actually executing the query in the database requires a Session. The executing methods are made available via implicit conversions.
    3. It is not recommended to handle the lifetime of a Session manually.
  3. All methods that execute a query take an implicit Session value. Of course, you can also pass a session explicit if you prefer.

Basic Knowledge on connection pool:

  1. Connection pools are common in network applications and they’re often tricky to implement. For example, it is often desirable to have timeouts on acquisition from the pool since various clients have different latency requirement.
  2. Pools are simple in principle: we maintain a queue of connections and we satisfy waiters as they come in. With additional synchronization primitives this typically involves keeping two queues: one of waiters (when there are no connections), and one of connections (when there are no waiters).

8. Todo In Future:

There are some points we can optimize.

  1. Slick uses prepared statements wherever possible but it does not cache them on its own. We should therefore enable prepared statement caching in the connection pool’s configuration and select a sufficiently large pool size.

MySQL(2)- performance monitor

Currently, I use MySQL as Database to store data. The whole architecture is build under Play framework. But I meet the problem, like MySQL connection timeout, etc. So MySQL performance becomes more serious important. Here I list some commands to monitor MySQL performance. I don’t like to install additional tools to do it. In fact, MysqlAdmin already can do a lot of things.

1. MysqlAdmin

  • check MySQL server status
    mysqladmin -u root -p ping
  • check MySQL version
    mysqladmin -u root -p version
  • find out the current status of MySQL server?
    mysqladmin -u root -p status
    # Output:
    Uptime: 269161  Threads: 11  Questions: 36607  Slow queries: 0  Opens: 96  Flush tables: 1  Open tables: 70  Queries per second avg: 0.136
  • check status of all MySQL server variable’s and value’s?
    mysqladmin -u root -p extended-status
  • see all MySQL server variables and values
    mysqladmin -u root -p variables
  • check all the running process of MySQL server
    mysqladmin -u root -p processlist
  • some useful MySQL flush commands
    mysqladmin -u root -p flush-hosts 
    mysqladmin -u root -p flush-tables 
    mysqladmin -u root -p flush-threads 
    mysqladmin -u root -p flush-logs 
    mysqladmin -u root -p flush-privileges 
    mysqladmin -u root -p flush-status
  • kill sleeping MySQL client process
    mysqladmin -u root -p processlist
    mysqladmin -u root -p kill #id,#id

2. Within MySQL

  • find slow query

    • show configuration/variables first
      show variables like "%slow%";

      “slow_query_log” and “slow_query_log_file” are important.
      screenshot-2016-11-02-14-10-57

       show variables like 'long_query_time';

      Screenshot 2016-11-02 14.34.46.png

    • set long_query_time and turn on slow_query_log
      set long_query_time=1;
      set global slow_query_log='ON';
      
    • view log file
      cat /var/lib/mysql/<file_name>.log | more
    • show slow queries
      show global status like "%slow%";

      Screenshot 2016-11-02 14.11.34.png

  • check read/write collision or table lock check

    • table status
      show status like 'Table%';

      Table_locks_immediate shows the table-level lock number and these locks are immediately released. Table_locks_waited shows the table-level lock number which needs wait.
      If Table_locks_immediate/Table_locks_waited > 5000, this means you’d better to adopt InnoDB engine. Because InnoDB uses row-level lock, but MyISAM uses table-level lock. For high concurrent write application, InnoDB would be better.
      If Table_looks_waited is too high, this means it exists worse table-level lock contention situation.
      Screenshot 2016-11-02 14.12.27.png

  • check key lock which is related to index buffer usage status

    • show key buffer size
      show variables like "key_buffer_size";
      

      Screenshot 2016-11-02 14.13.24.png

    • show hit ratio
      show global status like "key_blocks_u%";
      

      Note: If “Key_blocks_unused” is too high, buffer size should be decreased.
      Screenshot 2016-11-02 14.13.51.png

    • modify key buffer size
      set global key_buffer_size=4194304;
  • find full table scan statement

    select * from sys.statements_with_full_table_scans;

    Screenshot 2016-11-02 14.15.01.png

  • others
    show [session|global] status like ...

    The difference betwee seesion and global is that session is only for current conversation, global is for the whole running time.

    # show all mysql operating info
    show status;
    # show insert times
    show status like "com_insert%";
    # show delete times
    show status like "com_delete%";
    # show query times
    show status like "com_select%";
    # show mysql service running time
    show status like "uptime";
    # show myslq connect times
    show status like "connections";

3. MySQL Health Check

http://www.fromdual.com/mysql-database-health-check is an online website which helps you to check MySQL status.

Screenshot 2016-11-02 14.15.36.png

4. Related Links which deserve to read

HBase (1): OpenTSDB Table Design

Why we need to learn OpenTSDB? Is it good study case for us to know how to design HBase Table? For me, I would totally say yes. There are many good optimizations which already are applied to OpenTSDB, this open source project. So this post will only say how does OpenTSDB design the HBase table, not focus on how to use OpenTSDB or how to implement OpenTSDB  to monitor server. Maybe in the future, I will write down this part.

So first, Let’s simply know some basic concepts in OpenTSDB.

What is OpenTSDB?

It is the distributed, scalable, time series database which is for modern monitor needs. It can collect, store and serve billion data points with no less of precision, can be used with Tcollector. Here are two key points, one is time series, the other is billion data. So timestamp is important point in OpenTSDB, and there are many data points which OpenTSDB needs to deal with. (That’s the main reason we need to learn OpenTSDB’s design; we are also facing big data and time is also significant field for the data)

Even though OpenTSDB is open source project, it is also used many other big companies, including Yahoo, Ebay, Pinterest, and so on.

Some Concepts

  • data points: (time, value)
  • metrics: proc.loadavg.cpu
  • tags: hosts=haimeili, ip=127.0.0.1
  • metric + tags = time series

There are two tables which OpenTSDB use to store data, one is tsdb, the other is tsdb-uid. Currently, it already have two additional tables, named tsdb-meta, tsdb-tree.(new in OpenTSDB 2.0)

tsdb-uid

This table is to map uid to name or map name to uid. There are only three kinds of qualifiers: metric, tagk and tagv. We need to remember that this is two ways, one is from uid to name, the other is from name to uid. Here is the example,

Screen Shot 2014-11-05 at 10.41.33 AM

tsdb

tsdb is the main table to store data point. Its rowkey is a concatenation of uids and time.

  • This is rowkey format: <metric uid><timestamp><tagk1><tagv1><tagk2><tagv2>….
  • Timestamp normalized on 1 hour boundaries
  • All data points for an hour are stored in one row
  • There are two qualifer formats, one is 2 bytes, the other is 4 bytes. For 2 bytes, it looks like this: <12 bits><4bits>. The first 12 bits is to store min-second information. the 4 bits is a flag, first 1 bit is to tell the value is integer or double, the rest three bits is to tell the length of the value from 0 to 8 bytes. e.g. “000” means 1 byte value, “010” means 2 bytes value, etc. For 4 bytes, it looks like this: <4 bits><22 bits><2 bits><4 bits>. The first 4 bits is “0000” or “1111”. The 22 bits is the min-second information. The last 4 bits is flag which is the same with above.

Here is one example:

1297574486 = 2011-02-13 13:21:26    
MWeP = 01001101 01010111 01100101 01010000 = 1297573200 = 2011-02-13 13:00:00 (only select hours and cut down mins which will be stored in qualifier)
PK = 01010000 01101011 = 1286 (1286 seconds = 21 mins 26 seconds)
1297573200+1286=1297574486

Screen Shot 2014-11-05 at 10.59.54 AM

Summary

When you design table for big table, you need to consider to use concatenation method to save space. If you have time-based data, you need to think about the position to store timestamp, and whether you want to store the data for per second or per minute. Also if your data is not good format, or too long, or you have the list of data, you might need to map data to a uid to save space.

MySQL(1)- Index Optimization

  • Table Schema

This is table condition. There are five fields.

Image

My command is here to add new data.

MonthlyMovie.objects.filter(MovieID=movieid, Time=itime).update(Count=clicktime)

Now my requirement is to add big data to this table, almost 700,000 rows.

At first, I want to optimize each field. For example, I try to use small field type. Int is better than char. Meanwhile, I use constant to define each field’e length.But the kind of optimization doesn’t have great influence on efficiency.And then, I try to create index on the table. Different index on different field would give variable result.

  • Experiments

I do some experiments on it.

solution1: idx1:MovieID,Time,Count;idx2:MovieID,Time,Rating
time: 846.411153
solution2: idx1:Time,MovieID,Count;idx2:Time,MovieID,Rating
time: 855.032292
solution3:idx1:Count;idx2:Rating;idx3:MovieID, Time
time: 840.703046
solution4:idx1:Count;idx2:MovieID, Time
time: 826.489348
  • Conclusions:

    • focus on query
    • focus on sort
    • don’t create unused index
    • understand composite index
  • Tips:

    • indexes are not just for the primary keys or the unique keys. If there are any columns in your table that you will search it, you should almost always index them.
    • The rule applies on a partial search like “last_name LIKE ‘A%'”. When searching from the beginning of the string, MySQL is able to utilize the index on that column.
    • Some of searches can’t use the regular indexes. For instance, when searching for a world, like “WHERE post_content LIKE ‘%apple%'”, you will not see a benefit from anormal index.
    • If your application contains many JOIN queries, you need to make sure that the columns you join by are indexed on both tables.