Monthly Archives: September 2015

JVM Memory Management (1)

Last week, we normally deal with lots of things related to performance. But we didn’t dig into it and check why the problem happened. So I decide to write a post to explain JVM Memory Management. I know there are many blogs which are talking about it. It is not a new topic, but I will write down it from my own understanding and how I use some commands to prove this knowledge.

1. Check Memory Usage

Before we go to understand what is garbage collection, what is young generation, etc. First, we go to see our application’s memory usage status. You can use htop to see all thread’s memory usage. For Java/Scala Application, you have more choices.

# get java application pid
>> jcmd 
# force Garbage Collection from the Shell
>> jcmd <PID> GC.run
>> jps -l
# check which instances cost most memory
>> jmap -histo:live <PID> | head
>> jmap -histo:live <PID> | head -n 20
# check real-time memory usage status
>> jstat -gc <PID> 1000ms

2. Understand jstat output

Here we list my application jstat output:

 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT   
61952.0 62976.0  0.0   32790.1 1968128.0 29322.0  2097152.0   24306.3   60888.0 60114.9 7808.0 7676.4     33    0.545  13      1.057    1.601

We explain each column’s meaning:

  • S0C/S1C: the current size of the Survivor0 and Survivor1 areas in KB
  • S0U/S1U: the current usage of the Survivor0 and Survivor1 areas in KB. Notice that one of the survivor areas are empty all the time. (See “Young Generation” to know reason)
  • EC/EU: the current size and usage of Eden space in KB. Note that EU size is increasing and as soon as it crosses the EC, Minor GC is called and EU size is decreased.
  • OC/OU: the current size and current usage of Old generation in KB.
  • PC/PU: the current size and current usage of Perm Gen in KB.
  • YGC/YGCT: YGC displays the number of GC event occurred in young generation. YGCT displays the accumulated time for GC (unit is second) operations for Young generation. Notice that both of them are increasing in the same row where EU value is dropped because of minor GC. (See “Young Generation” to know reason)
  • FGC/FGCT: FGC displays the number of Full GC event occurred. FGCT displays the accumulated time for Full GC operations. Notice that Full GC time is too high when compared to young generation GC timing. 
  • GCT: total accumulated time for GC operations. Notice that it is sum of YGCT and FGCT values.

3. How to set JVM parameters

 Here we explain why you see S0C, S1C, EC, OC value is like above. There are multiple parameters which can set these values by VM Switch

  • -Xms: For setting the initial heap size when JVM starts
  • -Xmx: For setting the maximum heap size
  • -Xmn: For setting the size of the Young Generation, rest of the space goes for Old Generation
  • -XX:PermGen: For setting the initial size of the Permanent Generation memory
  • -XX:MaxPermGen: For setting the maximum size of Perm Gen
  • -XX:SurvivorRatio: For providing a ratio of Eden space and Survivor Space, for example, if Young Generation size is 10m and VM switch is -XX:SurvivorRatio=2 then 5m will be reserved for Eden Space and 2.5m each for both Survivor spaces. The default value is 8
  • -XX:NewRatio: For providing a ratio of old/new generation sizes. The default value is 2

4. JVM Memory Usage

The primary use of memory is in the heap and outside of the heap memory is also consumed in Metaspace, and the stack.

(1) Java Heap

The heap is where your class instantiations or objects are stored. Instance variables are stored in Objects. When discussing Java memory and optimization we most often discuss the heap because we have the most control over it and it is where Garbage Collection and GC optimizations take place. Heap size is controlled by the -Xms and -Xmx JVM flags.

(2) Java Stack

Each thread has its own call stack. The stack stored primitive local variables and object references along with the call stack (method invocations) itself. The stack is cleaned up as stack frames move out of context so there is no GC performed here. The -Xss JVM option controls how much memory gets allocated for each thread’s stack.

(3) Metaspace

Metaspace stores the class definitions of your objects. The size of Metaspace is controlled by setting -XX:MetaspaceSize.

(4) Additional JVM

In addition to the above values, there is some memory consumed by the JVM itself. This holds the C libraries for the JVM and some C memory allocation overhead that it takes to run the rest of the memory pools above. This type of memory can be affected by Tuning glibc Memory Behavior.

5. JVM Memory Model

Until now, we already know the status of our application. But we still don’t know what is Eden, What is Survivor, etc. Here we talk about how does JVM organizes memory. And then finally, we will better understand how to optimize it. I suggest when we read this part, we’d better go back to part2 and part3 to map each concept to real data output. This would be better.

There are five JVM Memory Models:

  • Eden
  • S0
  • S1
  • Old Memory
  • Perm

Eden + S0 + S1 === Young Gen (-Xmn)

Screenshot 2016-04-21 11.37.11

Eden + S0 + S1 + Old Memory === JVM Heap (-Xms  -Xmx)

Screenshot 2016-04-21 11.40.41

JVM Heap memory is physically divided into two parts-Young Generation and Old Generation. 

(1) Young Generation

Young generation is the place where all the new objects are created. When young generation is filled, garbage collection is performed. This garbage collection is called Minor GC. Young Generation is divided into three parts-Eden Memory and two Survivor Memory spaces.

  • Most of the newly created objects are located in the Eden Memory space. All new allocation happens in Eden. It only costs a pointer bump.
  • When Eden space is filled with objects, Minor GC is performed and all the survivor objects are moved to one of the survivor spaces. When Eden fills up, stop-the-world copy-collection into the survivor space. Dead objects cost zero to collect.
  • Minor GC also checks the survivor objects and move them to the other survivor space. So at a time, one of the survivor space is always empty.
  • Objects that are survived after many cycles of GC, are moved to the old generation memory space. Usually it’s done by setting a threshold for the age of the young generation objects before they become eligible to promote to Old generation.

Since Young Generation keeps short-lived objects, Minor GC is very fast and the application doesn’t get affected by this.

(2) Old Generation

Old Generation memory contains the objects that are long lived and survived after many rounds of Minor GC. Usually garbage collection is performed in Old Generation memory when it is full. Old Generation Garbage Collection is called Major GC and usually takes longer time. 

Major GC takes longer time because it checks all the live objects. Major GC should be minimized because it will make your application unresponsive for the garbage collection duration.

throughput collections: -XX:+UseSerialGC -XX:+UseParallelGC -XX:+UseParallelOldGC

low-pause collectors: -XX:+UseConcMarkSweepGC -XX:+UseGIGC

6. Garbage Collection

All the Garbage Collections are “Stop the world” events because all application threads are stopped until the operation completes.

One of the best feature of java programming language is the automatic garbage collection. There are many JVM switch to enable the garbage collection strategy for the application: (I will not explain each) Serial GC (-XX:+UseSerialGC), Parallel GC(-XX:+UseParallelGC), Parallel Old GC(-XX:+UseParallelOldGC), Concurrent Mark Sweep(CMS) Collector (-XX:+UseConcMarkSweepGC) and G1 Garbage Collector( -XX:+UseG1GC).

7. How to optimize JVM parameters

We talk about so much, it looks like JVM already has automatic garbage collection, so we don’t need to do anything. In fact, there are still some tunings we can do.

(1) java.lang.OutOfMemoryError: PermGen

increase the Perm Gen memory space using -XX:PermGen and -XX:MaxPermGen

(2) a lot of Full GC operations

increase Old generation Memory space.

(3) java.lang.StackOverflowError

increase stack size by -Xss

(4) Good Practices

  • set the minimum -Xms and maximum -Xmx heap sizes to the same value
  • -Xmn value should be lower than the -Xmx value. 
  • older generation is the value of -Xmx minus the -Xmn. Generally, you don’t want the Eden to be too big or it will take long for the GC to look through it for space that can be reclaimed.
  • keep the Eden size between one fourth and one third the maximum heap size. The old generation must be larger than the new generation. 

To summary, there is no universal solution to fix all. When we meet problems, we need to use tool to find root and dig into it and then fix it.

Image Extraction From URL by Scala

There are less methods or posts which are talking about how to extract image from url. Unfortunately recently in company’s project I need to have this feature. I search lots of places, but not too much results. Even though some people provide chrome plugins or codes to obtain all images in url (This is quite simple, you only need to parse the url’s html and find all img and that’s all), this is not what we want. I want to get one main image in url; It is likely to use one main image within the link to reflect the url’s whole content. Now we live in internet era, we don’t lack information. In fact, we already lost in too many messages. If we can read less words or texts to use several images to show all, our life can speed up. (Of course, if you are old man, you like slow life. Just Enjoy.) So that’s the purpose we want to provide image before user clicks the url.

Idea:

I already write down the reason why we need this feature. Next step is to explain how we achieve it. There are several logics we follow:

  1. obtain all img tag by parsing url’s html
  2. filter all known public bad images, like logo, brand, icon, etc (Because nobody would like to use one icon to show an article’s content. It is a known knowledge.)
  3. filter all images which sizes are not qualified, like too long, too wide, too small, etc. (Because a main image which can be described should have some size to hold in page.)
  4. obtain the rest images’ real size and do 3rd step again. (Because sometimes, img’s attr does not contain width/height attribution. In this case, we need to read real data from img link)
  5. map the rest images to sort by its real image area. (Because we believe the larger the size, the more opportunity the main image.)
  6. I have one more filter is that I know the url’s main topic, so the img’s value/des also can be a measure when sorting.

To be honest, it is not 100% true to obtain main image from url, even though we already use multiple methods to filter, to sort. You also need to modify the parameters to make it with high performance.

Codes:

package controllers

import java.io.IOException
import java.net.URL
import javax.imageio.ImageIO
import scala.collection.mutable.HashMap

import org.jsoup.nodes.{Document, Element}
import org.jsoup.Jsoup
import collection.JavaConversions._

class ImageResolverService {
  def checkUrl(url: String): Boolean = {
    var returnInfo = true
    val tempUrl = url.toLowerCase
    if (url == "" || tempUrl.contains("logo") || tempUrl.contains("icon") || tempUrl.contains("loading") ||
      tempUrl.contains(".gif") || tempUrl.contains("badge") || tempUrl.contains("1x1") ||
      tempUrl.contains("doubleclick") || tempUrl.contains("pixel") || tempUrl.contains("gravatar.com") ||
      tempUrl.contains("widget") || tempUrl.contains("spinner") || tempUrl.contains("feeds.feedburner.com") ||
      tempUrl.contains("/ads/") || tempUrl.contains("http://mcclatchy.112.2o7.net/") ||
      tempUrl.contains("http://ientry.rotator.hadj1.adjuggler.net/") || tempUrl.contains("g+.jpg")) {
      returnInfo = false
    }
    returnInfo
  }

  def getUrl(s: Element): String = {
    var returnInfo = ""
    if (s.attr("src") != "") {
      returnInfo = s.attr("src")
    } else {
      if (s.attr("data-src") != "") {
        returnInfo = s.attr("data-src")
      } else {
        if (s.attr("data-lazy-src") != "") {
          returnInfo = s.attr("data-lazy-src")
        } else {
          if (s.attr("data-original") != "") {
            returnInfo = s.attr("data-original")
          }
        }
      }
    }
    returnInfo
  }

  def fixUrl(url: String, domain: String): String = {
    var returnInfo = url
    if (!url.toLowerCase.startsWith("http")) {
      if (url.startsWith("/")) {
        if (url.startsWith("//")) {
          returnInfo = domain.split("//")(0) + url
        } else {
          returnInfo = domain + url
        }
      } else {
        returnInfo = domain + "/" + url
      }
      if (url.startsWith("../")) {
        returnInfo = domain + "/" + url.replace("../", "")
      }
    }
    url.replace(" ", "%20")
    returnInfo
  }

  def getSrcFromDoc(doc: Document, domain: String, item: String): String = {
    var srcMap = new HashMap[String, Int]
    val elementImages = doc.select("img").iterator().toList
    var src = ""
    elementImages.foreach{s =>
      var imageElement = getUrl(s)
      if (checkUrl(imageElement)) {
        imageElement = fixUrl(imageElement, domain)
        var w = 1
        var h = 1
        val widthAttr = s.attr("width")
        val heightAttr = s.attr("height")
        if (widthAttr != "") {
          if (widthAttr.toLowerCase.contains("px")) {
            w = widthAttr.toLowerCase.split("px")(0).toFloat.toInt
          } else {
            try {
              w = widthAttr.toFloat.toInt
            } catch {
              case e:Exception =>
            }
          }
        }
        if (heightAttr != "") {
          if (heightAttr.toLowerCase.contains("px")) {
            h = heightAttr.toLowerCase.split("px")(0).toFloat.toInt
          } else {
            try {
              h = heightAttr.toFloat.toInt
            } catch {
              case e:Exception =>
            }
          }
        }
        if ( w == 1 || h == 1 || (w > 128 && h > 128)) {
          try {
            imageElement = imageElement.replaceAll("""(?m)\s+$""", "")
            val imageUrl = new URL(imageElement)
            val image = ImageIO.read(imageUrl)
            if (image != null) {
              w = image.getWidth
              h = image.getHeight
            } else {
              w =  1
              h = 1
            }
            if (w/h <= 3 && h/w <=3 && w > 128 && h > 128) {
              if (s.attr("alt").toLowerCase.contains(item.toLowerCase)) {
                src = imageElement
              }
              srcMap += (imageElement -> w * h)
            }
          } catch {
            case e: IOException =>
          }
        }
      }
    }
    val srcMapSorted = srcMap.toList.sortBy{-_._2}

    if (srcMapSorted.nonEmpty && src == "") {
      src = srcMapSorted.head._1
    }
    src
  }

  def extract(url: String, item: String): String = {
    var src = ""
    if (url.startsWith("http")) {
      val domain = url.split("//")(0) + "//" + url.split("//")(1).split("/")(0)
      try {
        var res = Jsoup.connect(url).
        timeout(60000).ignoreHttpErrors(true).ignoreContentType(true).followRedirects(true).execute()
        if (res.statusCode() == 307) {
          val sNewUrl = res.header("Location")
          if (sNewUrl != null && sNewUrl.length() > 7)
            res = Jsoup.connect(sNewUrl).timeout(60000).execute()
        }
        val doc = res.parse()

        src = getSrcFromDoc(doc, domain, item)
      } catch {
        case e: IOException =>
      }
    }
    src
  }
}

Play Framework (2) -Security (especially for web application)

Here we talk about how to make our web application much safer. Even though most front-end codes can be detected by any developers, we still can use back-end to protect each request’s security. I will list the method which attackers might be used to hack your system and then I give out basic and simple solution. Here I need to say something in advance: First, the solution here targets play framework. But i think the solution’s idea/view is general for other website framework. Second, because play framework already has multiple versions, i can’t try all of them and here we use 2.3.x . So if you use other play version, please consider how to implement solution in your own version. All in all, I will give out problem first and talk about my solution’s idea, and at last give code snippet to explain idea.

1. problem

we all know front-end uses router to connect with back-end in order to get some data or post data to store. Once the router is recognized by attacker. It is quite easy for other developers to send requests to back-end without requiring to login the system or in website. There are many public tools which can do this thing, like postman, etc.

2. solution

The solution is easy to think out: authentication. The authentication is already greatly used in many places and there are many mature libraries which already provide packaged class/object to call, like social-auth, play2-auth, etc.

But there are too complex, if you want simpler one, you just need to consider how to make sure “Action” is safe in your code. As we all know, each router mapping to back-end is Action, no matter GET or POST. If we can make sure “Action” is a safe action, our router is safe. So we try to re-define a double check Action which needs to check each request’s header whether it contains the authenticated info. If not, we treat this request is not authenticated, the request can’t be finished. For browser, this means the page needs to guide back to login page to force user login or signup. Only when user logins or signs up, we assign the authentication to the user. After the user is authenticated, the other requests are allowed from this user.

Here we list the logic:

  1. user login/signup, back-end will set the user authenticated
  2. user does other requests, back-end will treat it as authenticated
  3. if user not login/signup or already logout, the next request will be treated as unauthenticated, 401 will be returned.

3. code snippet

Define a useraction:

package controllers

import play.api.http.Status
import play.api.mvc._

import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global

object UserAction extends ActionBuilder[Request] {
  def invokeBlock[A](request: Request[A], block: (Request[A]) => Future[Result]) = {
    if (request.session.get("WWW-Authenticate").isDefined)
      block(request)
    else
      Future { Results.Status(Status.UNAUTHORIZED) }
  }
}

Login/signup with an authentication:

Ok(resultInfo).withSession("WWW-Authenticate" -> "user")

Logout:

Ok.withNewSession

Other router using defined useraction:

/**
 * Delete one user back to front-end
 */
def delete = UserAction {
  request => {
      // do your things
    }
  }
}

4. Additional Info

(1) CSRF(cross-site-request-forgery)

Problem: If the attacker uses false info to signup, he/she still can use this authenticated user to do some dangerous things.

Solution: Of course, we can assign different levels of authentication. For example, for admin operations, only use who knows admin’s right or has admin’s authentication, he/she can obtain the operation.

But there is another easy way to solve it: filter.

To allow simple protection for non browser requests, such as requests made through AJAX, Play also supports the following:

  • If an X-Requested-With header is present, Play will consider the request safe. X-Requested-With is added to requests by many popular Javascript libraries, such as jQuery.
  • If a Csrf-Token header with value nocheck is present, or with a valid CSRF token, Play will consider the request safe.

(2) ActionBuilder

Action is just an implementation of ActionBuilder[Request]; we can extend ActionBuilder to use in place of Action as above codes in snippet.

ActionBuilder requires that we implement invokeBlock, which takes two parameters, the first is the incoming request, and the second is the function body, taking Request[A] as a parameter and returning a Future[Result]

block(request) means request processing continues as expected. (Don’t confuse the world “block” to mean the request gets blocked, it is actually executing the code block or function body it was passed earlier)

5. Recommend Links

(1) Here is link about how to use play header to fix CSRF: https://www.playframework.com/documentation/2.3.x/ScalaCsrf 

(2) Action Composition in Play Framework: https://www.playframework.com/documentation/2.3.x/ScalaActionsComposition and http://iankent.uk/2014/02/10/action-composition-in-play-framework/ (this link’s knowledge is a little old, but really useful and completely. And it also provides more integral solution than mine. Strongly suggest to read. )