Performance Control

There are several ways to do performance control. In past blogs, I already mentioned different ways to see your server’s performance and how to tune it. In this blog, we focus on automatically control of the performance.

As we know, the more transactions come, the more pressure the server undertakes. In this case, we need an automatical way to know the server’s status. Here we introduce two ways, one is crontab, one is loop.

crontab

  • several important commands which we need to know.
    // edit crontab script
    crontab -e 
    // list active crontabs
    crontab -l
    // view log file to check crontab's status
    sudo grep cron /var/log/syslog

loop

  • Step1: you need to write a script, for example its name is autocheck.sh.
    nohup sh -c `while true; do <your_script>.sh >> <your_log>.txt; sleep 1800; done` &

    Here 1800’s unit is seconds, so it is equal to 30 minutes.
    Please note, here <your_script>.sh is the real content which you want to do, not autocheck.sh.
    In this line, we use three important commands:

    • nohup
      It makes your job keep running in the background when the process gets sighup.
    • &
      It essentially returns control to you immediately and allows the command to complete in the background.
    • while …  do …  done
    • sleep 1800
  • Step2: run this script.
    sh autocheck.sh
  • Step3: check your log, you will see what you want to print out.
    vi <your_log>.txt
Advertisements

Text Extraction From URL by Scala

In this post, we talk about how to extract text from URL. Please note, we not involve special pages (e.g. Facebook posts, Facebook comments, etc) into this talk. But in another post, I will write a solution for Facebook posts extracting.

We know there are many types for a url, like text/*, application/xml, application/xhtml+xml, application/pdf, image, etc. In this post, we only support the types which we list.

There are three parts for the code snippet, one is for  text/*, application/xml, application/xhtml+xml, see Case 1, the other is for application/pdf, see Case 2. The other is for image, like png, jpg, etc. see Case 3.

Case 1:

import org.jsoup.Jsoup
val doc = Jsoup.connect(<your_url>).get()
getTextByDoc(doc)

According to Jsoup, we get doc, but within the doc, there are many useless elements, like footer, header, etc. In fact, we don’t need them, we just want to obtain pure meaningful content. So here we do some filters. Please note, we all know we can’t filter all, because we don’t know which part is useful, which is not. What we can do is to try all our best to remove common known useless parts. 

import org.jsoup.nodes.Document
private def getTextByDoc(doc: Document): String = {
  doc.head().remove()
  doc.getElementsByTag("header").remove()
  doc.getElementsByTag("footer").remove()
  doc.getElementsByTag("form").remove()
  doc.getElementsByTag("table").remove()
  doc.getElementsByTag("meta").remove()
  doc.getElementsByTag("img").remove()
  doc.getElementsByTag("a").remove()
  doc.getElementsByTag("br").remove()

  doc.getElementsByClass("tags").remove()
  doc.getElementsByClass("copyright").remove()
  doc.getElementsByClass("widget").remove()

  doc.select("div[class*=foot").remove()
  doc.select("div[class*=tag").remove()
  doc.select("div[class*=Loading").remove()
  doc.select("div[class*=Widget").remove()
  doc.select("div[class*=Head").remove()
  doc.select("div[class*=menu").remove()
  doc.select("p[class*=link").remove()

  val paragraphs = doc.select("p")
  val divs = doc.select("div")

  paragraphs.text() + divs.text()
}

Case 2:

For pdf url, it is a little complex. First we need to get its content type to make sure it is “application/pdf” and then we create a local temporary file and then to extract local pdf to obtain pure text. Finally, we delete this temporary file.

import java.io.File
import java.net.URL
val url = new URL(<your_url>)
val conn = url.openConnection()
val contentType = conn.getContentType

contentType match {
  case "application/pdf" =>
    val fileName = Random.alphanumeric.take(5).mkString + ".pdf"
    url #> new File(fileName) !!
    val texts = getTextFromPDF(None, None, fileName)
    val of = new File(fileName)
    of.delete()
    texts
  case _ => None
}

Here is to extract text from local pdf file. Here because I don’t know its start page and end page, I just skip it. By default, it will fetch all.

import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.util.PDFTextStripper
private def getTextFromPDF(startPage: Option[Int], endPage: Option[Int], fileName: String): Option[String] = {
  try {
    val pdf = PDDocument.load(new File(fileName))
    val stripper = new PDFTextStripper()
    startPage match {
      case Some(startInt) => stripper.setStartPage(startInt)
      case None =>
    }
    endPage match {
      case Some(endInt) => stripper.setEndPage(endInt)
      case None =>
    }
    Some(stripper.getText(pdf))
  } catch {
    case e: Throwable => None
  }
}

Case 3:

For image, it involves into a new technology, named ‘OCR’ which can help to parse image’s content. So we need a java-ocr-api into system.

Step1:

In build.sbt to add one line to add dependence.

libraryDependencies += "com.asprise.ocr" % "java-ocr-api" % "[15,)"

Step2:

To import library:

import com.asprise.ocr.Ocr

Step3:

Here is the code snippet to show how to implement it. Please note: here <your_file> is a File type. If you only have fileName/filePath, you need to use new File(<file_name>) to convert it. 

try {
  // Image
  Ocr.setUp()
  val ocr = new Ocr
  ocr.startEngine("eng", Ocr.SPEED_FASTEST)
  val files = List(<your_file>)
  val outputString = ocr.recognize(files.toArray, Ocr.RECOGNIZE_TYPE_ALL, Ocr.OUTPUT_FORMAT_PLAINTEXT)
  ocr.stopEngine()
  Some(outputString)
} catch {
  case e: Exception => None // todo: to support multiple file types
}

Play Framework(12)-template engine

Concept:

Templates are complied as standard Scala functions, following a simple naming convention. If you create a views/Application/index.scala.html template file, it will generate a views.html.Application.index class that has an apply() method.

Special Character :

Scala template uses @ as the single special character. Every time this character is encountered, it indicates the beginning of a dynamic statement. If you want to insert a multi-token statement, explicitly mark it using brackets/curly brackets. Because @ is a special character, you’ll sometimes need to escape it by using @@.

Make sure that { is on the same line with for to indicate that expression continues to next line.

Tips:

A template is like a function, so it needs parameters, which must be declared at the top of the template file.

You can write server side block comments in templates using @@.

Dynamic content parts are escaped according to the template type’s (e.g. HTML or XML) rules. If you want to output a raw content fragment, wrap it in the template content type.

 

MySQL(3)-Joins

  • left join = A
    select <select_list> from tableA A left join tableB B on A.key=B.key
  • inner join = (common part between A and B)
    select <select_list> from tableA A inner join tableB B on A.key=B.key
  • right join = B
    select <select_list> from tableA A right join tableB B on A.key=B.key
  • A – (common part between A and B)
    select <select_list> from tableA A left join tableB B on A.key=B.key 
    where B.key is NULL
  • B – (common part between A and B)
    select <select_list> from tableA A right join tableB B on A.key=B.key
    where A.key is NULL
  • A + B
    select <select_list> from tableA A full outer join tableB B on 
    A.key=B.key
  • A + B – (common part between A and B)
    select <select_list> from tableA A full outer join tableB B on 
    A.key=B.key where A.key is NULL or B.key is NULL

Scala (20) – Execution Context

Execution Context:

  • An ExecutionContext is similar to an Executor: it is free to execute computations in a new thread, in a pooled thread or in the current thread (although executing the computation in the current thread is discouraged)

The Global Execution Context:

  • ExecutionContext.global is an ExecutionContext backed by a ForkJoinPool. It should be sufficient for most situations but requires some care.
    A ForkJoinPool manages a limited amount of threads (the maximum amount of thread being referred to as parallelism level). The number of concurrently blocking computations can exceed the parallelism level only if each blocking call is wrapped inside a blocking call. Otherwise, there is a risk that the thread pool in the global execution context is starved, and no computation can process.
  • By default, the ExecutionContext.global sets the parallelism level of its underlying fork-join-pool to the amount of available processors (Runtime.availableProcessors).  This configuration can be overridden by  setting the following VM attributes: scala.concurrent.context.minThreads, scala.concurrent.context.numThreads, scala.concurrent.context.maxThreads.

Thread Pool:

  • If each incoming request results in a multitude of requests to get another tier of systems, in these systems, thread pools must be managed so that they are balanced according to the ratios of requests in each tier: mismanagement of one thread pool bleeds into another.

Scala (19) – Futures

Futures

  • They hold the promise for the result of a computation that is not yet complete. They are a simple container- a placeholder. A computation could fail of course, and this must also be encoded. a Future can be in exactly one of 3 states:
    • pending
    • failed
    • completed
  • With flatMap we can define a Future that is the result of two futures sequenced, the second future computed based on the result of the first one.
  • Future defines many useful methods:
    • Use Future.value() and Future.exception() to create pre-satisfied futures
    • Future.collect(), Future.join() and Future.select() provide combinators that turn many futures into one (i.e. the gather part of a scatter-gather operation)
  • By default, futures and promises are non-blocking, making use of callbacks instead of typical blocking operations. Scala provides combinators such as flatMap, foreach and filter used to compose futures in a non-blocking method.

Akka (7) -Configuration

There are serval places which we can configure Akka:

  • log level and logger backend
  • enable remote
  • message serializers
  • definition of routers
  • tuning of dispatchers

Two important concepts we need to understand when we do configuration:

  • Throughput
    It defines the number of messages that are processed in a batch before the thread is returned to the pool.
  • parallelism factor
    The parallelism factor is used to determine thread pool size using the following formula: ceil (available processors * factor). Resulting size is then bounded by the parallelism-min and parallelism-max values.