Category Archives: Scala

Text Extraction From URL by Scala

In this post, we talk about how to extract text from URL. Please note, we not involve special pages (e.g. Facebook posts, Facebook comments, etc) into this talk. But in another post, I will write a solution for Facebook posts extracting.

We know there are many types for a url, like text/*, application/xml, application/xhtml+xml, application/pdf, image, etc. In this post, we only support the types which we list.

There are three parts for the code snippet, one is for  text/*, application/xml, application/xhtml+xml, see Case 1, the other is for application/pdf, see Case 2. The other is for image, like png, jpg, etc. see Case 3.

Case 1:

import org.jsoup.Jsoup
val doc = Jsoup.connect(<your_url>).get()
getTextByDoc(doc)

According to Jsoup, we get doc, but within the doc, there are many useless elements, like footer, header, etc. In fact, we don’t need them, we just want to obtain pure meaningful content. So here we do some filters. Please note, we all know we can’t filter all, because we don’t know which part is useful, which is not. What we can do is to try all our best to remove common known useless parts. 

import org.jsoup.nodes.Document
private def getTextByDoc(doc: Document): String = {
  doc.head().remove()
  doc.getElementsByTag("header").remove()
  doc.getElementsByTag("footer").remove()
  doc.getElementsByTag("form").remove()
  doc.getElementsByTag("table").remove()
  doc.getElementsByTag("meta").remove()
  doc.getElementsByTag("img").remove()
  doc.getElementsByTag("a").remove()
  doc.getElementsByTag("br").remove()

  doc.getElementsByClass("tags").remove()
  doc.getElementsByClass("copyright").remove()
  doc.getElementsByClass("widget").remove()

  doc.select("div[class*=foot").remove()
  doc.select("div[class*=tag").remove()
  doc.select("div[class*=Loading").remove()
  doc.select("div[class*=Widget").remove()
  doc.select("div[class*=Head").remove()
  doc.select("div[class*=menu").remove()
  doc.select("p[class*=link").remove()

  val paragraphs = doc.select("p")
  val divs = doc.select("div")

  paragraphs.text() + divs.text()
}

Case 2:

For pdf url, it is a little complex. First we need to get its content type to make sure it is “application/pdf” and then we create a local temporary file and then to extract local pdf to obtain pure text. Finally, we delete this temporary file.

import java.io.File
import java.net.URL
val url = new URL(<your_url>)
val conn = url.openConnection()
val contentType = conn.getContentType

contentType match {
  case "application/pdf" =>
    val fileName = Random.alphanumeric.take(5).mkString + ".pdf"
    url #> new File(fileName) !!
    val texts = getTextFromPDF(None, None, fileName)
    val of = new File(fileName)
    of.delete()
    texts
  case _ => None
}

Here is to extract text from local pdf file. Here because I don’t know its start page and end page, I just skip it. By default, it will fetch all.

import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.util.PDFTextStripper
private def getTextFromPDF(startPage: Option[Int], endPage: Option[Int], fileName: String): Option[String] = {
  try {
    val pdf = PDDocument.load(new File(fileName))
    val stripper = new PDFTextStripper()
    startPage match {
      case Some(startInt) => stripper.setStartPage(startInt)
      case None =>
    }
    endPage match {
      case Some(endInt) => stripper.setEndPage(endInt)
      case None =>
    }
    Some(stripper.getText(pdf))
  } catch {
    case e: Throwable => None
  }
}

Case 3:

For image, it involves into a new technology, named ‘OCR’ which can help to parse image’s content. So we need a java-ocr-api into system.

Step1:

In build.sbt to add one line to add dependence.

libraryDependencies += "com.asprise.ocr" % "java-ocr-api" % "[15,)"

Step2:

To import library:

import com.asprise.ocr.Ocr

Step3:

Here is the code snippet to show how to implement it. Please note: here <your_file> is a File type. If you only have fileName/filePath, you need to use new File(<file_name>) to convert it. 

try {
  // Image
  Ocr.setUp()
  val ocr = new Ocr
  ocr.startEngine("eng", Ocr.SPEED_FASTEST)
  val files = List(<your_file>)
  val outputString = ocr.recognize(files.toArray, Ocr.RECOGNIZE_TYPE_ALL, Ocr.OUTPUT_FORMAT_PLAINTEXT)
  ocr.stopEngine()
  Some(outputString)
} catch {
  case e: Exception => None // todo: to support multiple file types
}

Slick (1) – fix more than 22 columns case

Slick is a modern database query and access library for Scala. So it is quite clear to understand Slick’s purpose is to  query database. This post is to explain how to fix the access to a table which has more than 22 columns. First, we give known solution for less than 22 columns case. And then, we explain why this solution is not applied to more than 22 columns case. Last, we give modified solution to fix more than 22 columns case.

Solution – less than 22 columns case

Here is a graph to show this simple table’s schema.

Screenshot 2016-03-31 09.52.41

Here is the solution by Slick to access this table.

case class Friend(
 id: Int,
 name: String,
 firstName: String,
 lastName: String,
 createTime: String)

class Friends(tag: Tag) extends Table[Friend](tag, "friends") {
 def id = column[Int]("id", O.AutoInc)
 def name = column[String]("name")
 def firstName = column[String]("first_name")
 def lastName = column[String]("last_name")
 def createTime = column[String]("create_time")

 def * = (id, name, firstName, lastName, createTime) <> (Friend.tupled, Friend.unapply _)
}

Reason

The * operator is to project to a custom type. the <> operator as Scala tuple. Because the result row of a query as a Scala tuple. The type of the tuple will match the Projection that is defined. The root reason is caused by tupled.  Scala programming language limits the max size of a tuples to 22 and tuples are a nice way to represent table rows. You can’t use tupled and unapply for more than 22 columns case.

Solution – more than 22 columns case

Here we give out solution directly. The important part is to build a custom type to satisfy * operator. So you can see we package some columns to one case Class and then apply tupled method back. Here I use three different case Class according to different meanings. You can use your logic to select which columns packaging together, which not.

This is table schema, I also use different colors to help you to know where are the three packages. In fact, these case Class will not influence real physical columns.

Screenshot 2016-03-31 09.55.49

Here is the real code, please notice the colorful marks which will help you to grab important point quickly.

case class User(
                 id: Int,
                 var basicInfo: BasicInfo,
                 var oauth1Info: Oauth1Info,
                 var oauth2Info: Oauth2Info)

class Users(tag: Tag) extends Table[User](tag, "users") {
  def id = column[Int]("id", O.AutoInc)
  def name = column[String]("name")
  def firstName = column[String]("first_name")
  def lastName = column[String]("last_name")
  def email = column[String]("email")
  def avatarUrl = column[String]("avatar_url")

  def timeZone = column[Int]("time_zone")
  def token = column[String]("token")
  def showStatus = column[String]("show_status")
  def showTutorial = column[String]("show_tutorial")
  def createTime = column[String]("create_time")
  def updateTime = column[String]("update_time")

  def provider = column[String]("provider")
  def password = column[String]("password")
  def teamIDs = column[String]("team_ids")

  def oauth1Id = column[String]("oauth1_id")
  def oauth1Token = column[String]("oauth1_token")
  def oauth1Secret = column[String]("oauth1_secret")

  def oauth2Id = column[String]("oauth2_id")
  def oauth2AccessToken = column[String]("oauth2_access_token")
  def oauth2Scope = column[String]("oauth2_scope")
  def oauth2ExpiresIn = column[String]("oauth2_expires_in")
  def oauth2LongLivedToken = column[String]("oauth2_long_lived_token")

  def * = (id,
    (name, firstName, lastName, email, avatarUrl, timeZone, token, showStatus, showTutorial,
      createTime, updateTime, provider, password, teamIDs),
    (oauth1Id, oauth1Token, oauth1Secret),
    (oauth2Id, oauth2AccessToken, oauth2Scope, oauth2ExpiresIn, oauth2LongLivedToken)).shaped <> (
    {case (id, basicInfo, oauth1Info, oauth2Info) =>
     User(id,
       BasicInfo.tupled.apply(basicInfo),
       Oauth1Info.tupled.apply(oauth1Info),
       Oauth2Info.tupled.apply(oauth2Info))},
    { u: User =>
      def f1(p: BasicInfo) = BasicInfo.unapply(p).get
      def f2(p: Oauth1Info) = Oauth1Info.unapply(p).get
      def f3(p: Oauth2Info) = Oauth2Info.unapply(p).get
      Some((u.id, f1(u.basicInfo), f2(u.oauth1Info), f3(u.oauth2Info)))}
    )
}

case class BasicInfo(
                      name: String,
                      firstName: String,
                      lastName: String,
                      email: String,
                      avatarUrl: String,
                      timeZone: Int,
                      token: String,
                      showStatus: String,
                      showTutorial: String,
                      createTime: String,
                      updateTime: String,
                      provider: String,
                      password: String,
                      teamIDs: String)

case class Oauth1Info(
                       oauth1Id: String,
                       oauth1Token: String,
                       oauth1Secret: String)

case class Oauth2Info(
                       oauth2Id: String,
                       oauth2AccessToken: String,
                       oauth2Scope: String,
                       oauth2ExpiresIn: String,
                       oauth2LongLivedToken: String)

Another important thing is that when you use filter to do query or update by Slick, the additional case Class will not influence it. Only when you do save or fetch , you need to open each case Class.  Here we give examples: The first one is to filter.

def isExist(token: String,): Boolean = {
  implicit val db = Database.forDataSource(DB.getDataSource()).createSession()
  val users = TableQuery[Users]
  var findFlag = false
  val filterInfo = users.filter(_.token === token)
  if (filterInfo.exists.run) findFlag = true
  db.close()
  findFlag
}

The second one is to fetch.

def search(name: String): Option[JsArray] = {
  implicit val db = Database.forDataSource(DB.getDataSource()).createSession()
  val users = TableQuery[Users]
  val filterInfo = users.filter{ u => u.name.toLowerCase.like(name)}
  val result =
    if (filterInfo.exists.run) {
      var resultInfo = Json.arr()
      filterInfo.foreach{ f =>
        val tempInfo = Json.obj(
          "id" -> f.id,
          "avatar_url" -> f.basicInfo.avatarUrl)
        resultInfo ++= Json.arr(tempInfo)
      }
      Some(resultInfo)
    } else None
  db.close()
  result
}

References

Here are some useful links which greatly help me to understand and fix this problem. I recommend to read them again if you have time.

Scala Slick method I can not understand so far

Custom mapping to nested case class structure in Slick (more than 22 columns)

GitHub-Example

Play Framework (1) – WebSocket

Even though Play uses async method to communicate with front-end and back-end. But sometimes, we still need to call third-party restful api to do something else. Especially, third-party restful api takes too long time to return data. In this case, it is wasteful to wait there and always ask whether it is ok or not. Good way to handle it is to make use of third-party callback and web socket to deal with this case.

So it is very clear that we have three sides: one is third-party api, the other is our back-end, another is our front-end.

What we need to do is that front-end builds a WebSocket with back-end. And then back-end requests data from third-party api. When third-party api returns data,  back-end sends data back to front-end.

  1. Build a web socket between front-end and back-end.
  2. front-end sends data to back-end
  3. back-end calls third-party api
  4. third-party api returns data back to back-end
  5. back-end returns data back to front-end by web socket

So, there are two routes, one is for web socket, another is for callback. The connection between the two is to store actorRef to keep the bridge. You also needs to ask callback to append special token back in order to figure out this actor.

  • Javascript code to send web socket to build connection between front-end and back-end.
  • var myWebSocket = new WebSocket("ws://<your_domain_name>/<your_ws_route>/");
    myWebSocket.onopen = function() { myWebSocket.send(<your_data>); }; 
    myWebSocket.onmessage = function(event) { alert(event.data); // to render data }; 
    myWebSocket.onclose = function() { console.log("wsconnect close."); };
  • Nginx configuration to build connection. Please note: Here if you don’t configure “

    proxy_read_timeout” attribute, your websocket default connection time is 60 seconds and then it is closed automatically. So here normally I set it as 1d.

  • location /daoApp/websocket/create/ {
      proxy_pass  http://<your_domain_name>;
      proxy_http_version 1.1;
      proxy_set_header Host $host;
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection "upgrade";
      proxy_read_timeout 1d;
    }
  • Play routes to accept web socket request and another one is to accept callback request
  • POST    /<your_callback_route>/ controllers.WebSocketApp.apiCallback
    GET     /<your_ws_route>/       controllers.WebSocketApp.create
  • And next I have one global variable to store actorRef for each connect between front-end and back-end. For each web socket request, I will store unique key combining the actorRef and trigger call to third-party api. For each callback request, I will to find its mapping actorRef to send data back to front-end.
  • object WebSocketApp extends Controller {
      // remember to build a map to store your actorRef
      // here we use ConcurrentHashMap which is to avoid memory leak in multiple cores device
      var channels = new ConcurrentHashMap[String, ChannelMsg]().asScala
      def apiCallback = Action {
        request =>
          val requestBody = request.body.asFormUrlEncoded
          requestBody match {
            case Some(s) =>
              if (s.contains("parentID")) {
                val parentID = s("parentID").head
                val channelInfo = channels.get(parentID)
                channelInfo match {
                  case Some(info) =>
                    val yourParsedData = <deal_with_data_from_api>
                    actorInfo match {
                      case Some(actor) =>
                        actor ! <your_parse_data>
                        channels.remove(parentID)
                      case None =>
                    }
                  case None =>
                }
              }
            case None =>
          }
          Ok
      }
    
      def create = WebSocket.acceptWithActor[String, String] { request => out =>
        Props(new DAOWSCreateActor(out, request))
      }
    }
    class DAOWSCreateActor(out: ActorRef, request: RequestHeader) extends Actor {
      override def receive = {
        case topicName: String =>
          val queryTokenInfo = <your_send_request_to_third_party>
          // remember to involve your actorRef to channelMsg which will be used to parse actor 
          valchannelMsg = <construct_your_channel_msg>
          // remember to involve your key to callback on post which will be used to find actor back to send data back to front-end
          WebSocketApp.channels.put(<your_key>, channelMsg)
        case _ =>
      }
    }

Scala (18) – Concurrency with Futures in Scala

Even though we all want everything goes parallel and each task is independent with others totally, sometimes it is not simply as we think. Everything has an order, what we can do is to let independent things work at the same time, and let dependent things work by order.

A future gives us a simple way to deal with it. Each future is like an independent tasks and it starts running concurrently. But within a future, each steps in a task run by order.

Here we give two examples. one is only one task, the other is multiple tasks.

One Future

import scala.concurrent.{ Future, Await }
import scala.concurrent.duration._
val response1 = Future { create1(item) }
val returnInfo = Await.result(response1, Duration.Inf)

Multiple Futures

val response1 = Future { create1(item) }
val response2 = Future { create2(item) }
val response3 = Future { create3(item) }

val finalInfoTemp = for { 
  t <- response1
  g <- response2
  r <- response3
} yield (t, g, r)

val (t, g, r) = Await.result(finalInfoTemp, Duration.Inf)
val finalInfo = t ++ g ++ r

Note: Await.result will block the thread with a mandatory timeout duration. Blocking is not a good thing– you should block only if you really have to — but this is useful, in particular for testing purposes. blocking in general is discouraged when working with Futures and concurrency in order to avoid potential deadlocks and improve performance. Instead, use callbacks or combinators to remain in the future domain:

val finalInfoTemp: Future[Int] = for {
  a <- aFuture
  b <- bFuture
} yield a - b
finalInfoTemp.onSuccess {
  case x if x > 100 =>
}

 

XML Parse by Scala

Recently, I need to do some social media collectors to scrawling data from multiple sources, like twitter, google news, rss, etc. Twitter already has nice java library to get data, named twitter4j. But for other sources, we need to parse the data by myself. Most of them can return xml format back, like google news and rss. So the rest thing is to know how to parse xml, that’s enough. (Note: even though most files use xml, but its detail structure is different sometimes. For example, some use “channel” -> “item” structure, but others are using “entry” structure.)

  1. Load data, here sourceURL is the xml/rss link. For example, RSS/XML EXAMPLE

    1. import scala.xml.XML
      val xml = XML.load(sourceURL)
  2. Extract each tag

    1. val forecast = xml \ "channel" \ "item" \ "forecast"
      // if you don't know its full path, you can use this one
      val forecast = xml \\ "forecast"
  3. Extract each attribute under one tag

    1. val url = xml \ "channel" \ "item" \ "forecast" \ "@url"
      // if you know url attribute is unique, you can use this one to shorten
      val url = xml \\ "@url"
      // convert NodeSeq to String
      val urlString = (forecast \ "@url").text
      // obtain Node label
      val forecaseLabel = forecast.label

Read More:

http://alvinalexander.com/scala/xml-parsing-xpath-extract-xml-tag-attributes

http://alvinalexander.com/scala/how-to-extract-data-from-xml-nodes-in-scala

JVM Memory Management (1)

Last week, we normally deal with lots of things related to performance. But we didn’t dig into it and check why the problem happened. So I decide to write a post to explain JVM Memory Management. I know there are many blogs which are talking about it. It is not a new topic, but I will write down it from my own understanding and how I use some commands to prove this knowledge.

1. Check Memory Usage

Before we go to understand what is garbage collection, what is young generation, etc. First, we go to see our application’s memory usage status. You can use htop to see all thread’s memory usage. For Java/Scala Application, you have more choices.

# get java application pid
>> jcmd 
# force Garbage Collection from the Shell
>> jcmd <PID> GC.run
>> jps -l
# check which instances cost most memory
>> jmap -histo:live <PID> | head
>> jmap -histo:live <PID> | head -n 20
# check real-time memory usage status
>> jstat -gc <PID> 1000ms

2. Understand jstat output

Here we list my application jstat output:

 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT   
61952.0 62976.0  0.0   32790.1 1968128.0 29322.0  2097152.0   24306.3   60888.0 60114.9 7808.0 7676.4     33    0.545  13      1.057    1.601

We explain each column’s meaning:

  • S0C/S1C: the current size of the Survivor0 and Survivor1 areas in KB
  • S0U/S1U: the current usage of the Survivor0 and Survivor1 areas in KB. Notice that one of the survivor areas are empty all the time. (See “Young Generation” to know reason)
  • EC/EU: the current size and usage of Eden space in KB. Note that EU size is increasing and as soon as it crosses the EC, Minor GC is called and EU size is decreased.
  • OC/OU: the current size and current usage of Old generation in KB.
  • PC/PU: the current size and current usage of Perm Gen in KB.
  • YGC/YGCT: YGC displays the number of GC event occurred in young generation. YGCT displays the accumulated time for GC (unit is second) operations for Young generation. Notice that both of them are increasing in the same row where EU value is dropped because of minor GC. (See “Young Generation” to know reason)
  • FGC/FGCT: FGC displays the number of Full GC event occurred. FGCT displays the accumulated time for Full GC operations. Notice that Full GC time is too high when compared to young generation GC timing. 
  • GCT: total accumulated time for GC operations. Notice that it is sum of YGCT and FGCT values.

3. How to set JVM parameters

 Here we explain why you see S0C, S1C, EC, OC value is like above. There are multiple parameters which can set these values by VM Switch

  • -Xms: For setting the initial heap size when JVM starts
  • -Xmx: For setting the maximum heap size
  • -Xmn: For setting the size of the Young Generation, rest of the space goes for Old Generation
  • -XX:PermGen: For setting the initial size of the Permanent Generation memory
  • -XX:MaxPermGen: For setting the maximum size of Perm Gen
  • -XX:SurvivorRatio: For providing a ratio of Eden space and Survivor Space, for example, if Young Generation size is 10m and VM switch is -XX:SurvivorRatio=2 then 5m will be reserved for Eden Space and 2.5m each for both Survivor spaces. The default value is 8
  • -XX:NewRatio: For providing a ratio of old/new generation sizes. The default value is 2

4. JVM Memory Usage

The primary use of memory is in the heap and outside of the heap memory is also consumed in Metaspace, and the stack.

(1) Java Heap

The heap is where your class instantiations or objects are stored. Instance variables are stored in Objects. When discussing Java memory and optimization we most often discuss the heap because we have the most control over it and it is where Garbage Collection and GC optimizations take place. Heap size is controlled by the -Xms and -Xmx JVM flags.

(2) Java Stack

Each thread has its own call stack. The stack stored primitive local variables and object references along with the call stack (method invocations) itself. The stack is cleaned up as stack frames move out of context so there is no GC performed here. The -Xss JVM option controls how much memory gets allocated for each thread’s stack.

(3) Metaspace

Metaspace stores the class definitions of your objects. The size of Metaspace is controlled by setting -XX:MetaspaceSize.

(4) Additional JVM

In addition to the above values, there is some memory consumed by the JVM itself. This holds the C libraries for the JVM and some C memory allocation overhead that it takes to run the rest of the memory pools above. This type of memory can be affected by Tuning glibc Memory Behavior.

5. JVM Memory Model

Until now, we already know the status of our application. But we still don’t know what is Eden, What is Survivor, etc. Here we talk about how does JVM organizes memory. And then finally, we will better understand how to optimize it. I suggest when we read this part, we’d better go back to part2 and part3 to map each concept to real data output. This would be better.

There are five JVM Memory Models:

  • Eden
  • S0
  • S1
  • Old Memory
  • Perm

Eden + S0 + S1 === Young Gen (-Xmn)

Screenshot 2016-04-21 11.37.11

Eden + S0 + S1 + Old Memory === JVM Heap (-Xms  -Xmx)

Screenshot 2016-04-21 11.40.41

JVM Heap memory is physically divided into two parts-Young Generation and Old Generation. 

(1) Young Generation

Young generation is the place where all the new objects are created. When young generation is filled, garbage collection is performed. This garbage collection is called Minor GC. Young Generation is divided into three parts-Eden Memory and two Survivor Memory spaces.

  • Most of the newly created objects are located in the Eden Memory space. All new allocation happens in Eden. It only costs a pointer bump.
  • When Eden space is filled with objects, Minor GC is performed and all the survivor objects are moved to one of the survivor spaces. When Eden fills up, stop-the-world copy-collection into the survivor space. Dead objects cost zero to collect.
  • Minor GC also checks the survivor objects and move them to the other survivor space. So at a time, one of the survivor space is always empty.
  • Objects that are survived after many cycles of GC, are moved to the old generation memory space. Usually it’s done by setting a threshold for the age of the young generation objects before they become eligible to promote to Old generation.

Since Young Generation keeps short-lived objects, Minor GC is very fast and the application doesn’t get affected by this.

(2) Old Generation

Old Generation memory contains the objects that are long lived and survived after many rounds of Minor GC. Usually garbage collection is performed in Old Generation memory when it is full. Old Generation Garbage Collection is called Major GC and usually takes longer time. 

Major GC takes longer time because it checks all the live objects. Major GC should be minimized because it will make your application unresponsive for the garbage collection duration.

throughput collections: -XX:+UseSerialGC -XX:+UseParallelGC -XX:+UseParallelOldGC

low-pause collectors: -XX:+UseConcMarkSweepGC -XX:+UseGIGC

6. Garbage Collection

All the Garbage Collections are “Stop the world” events because all application threads are stopped until the operation completes.

One of the best feature of java programming language is the automatic garbage collection. There are many JVM switch to enable the garbage collection strategy for the application: (I will not explain each) Serial GC (-XX:+UseSerialGC), Parallel GC(-XX:+UseParallelGC), Parallel Old GC(-XX:+UseParallelOldGC), Concurrent Mark Sweep(CMS) Collector (-XX:+UseConcMarkSweepGC) and G1 Garbage Collector( -XX:+UseG1GC).

7. How to optimize JVM parameters

We talk about so much, it looks like JVM already has automatic garbage collection, so we don’t need to do anything. In fact, there are still some tunings we can do.

(1) java.lang.OutOfMemoryError: PermGen

increase the Perm Gen memory space using -XX:PermGen and -XX:MaxPermGen

(2) a lot of Full GC operations

increase Old generation Memory space.

(3) java.lang.StackOverflowError

increase stack size by -Xss

(4) Good Practices

  • set the minimum -Xms and maximum -Xmx heap sizes to the same value
  • -Xmn value should be lower than the -Xmx value. 
  • older generation is the value of -Xmx minus the -Xmn. Generally, you don’t want the Eden to be too big or it will take long for the GC to look through it for space that can be reclaimed.
  • keep the Eden size between one fourth and one third the maximum heap size. The old generation must be larger than the new generation. 

To summary, there is no universal solution to fix all. When we meet problems, we need to use tool to find root and dig into it and then fix it.

Image Extraction From URL by Scala

There are less methods or posts which are talking about how to extract image from url. Unfortunately recently in company’s project I need to have this feature. I search lots of places, but not too much results. Even though some people provide chrome plugins or codes to obtain all images in url (This is quite simple, you only need to parse the url’s html and find all img and that’s all), this is not what we want. I want to get one main image in url; It is likely to use one main image within the link to reflect the url’s whole content. Now we live in internet era, we don’t lack information. In fact, we already lost in too many messages. If we can read less words or texts to use several images to show all, our life can speed up. (Of course, if you are old man, you like slow life. Just Enjoy.) So that’s the purpose we want to provide image before user clicks the url.

Idea:

I already write down the reason why we need this feature. Next step is to explain how we achieve it. There are several logics we follow:

  1. obtain all img tag by parsing url’s html
  2. filter all known public bad images, like logo, brand, icon, etc (Because nobody would like to use one icon to show an article’s content. It is a known knowledge.)
  3. filter all images which sizes are not qualified, like too long, too wide, too small, etc. (Because a main image which can be described should have some size to hold in page.)
  4. obtain the rest images’ real size and do 3rd step again. (Because sometimes, img’s attr does not contain width/height attribution. In this case, we need to read real data from img link)
  5. map the rest images to sort by its real image area. (Because we believe the larger the size, the more opportunity the main image.)
  6. I have one more filter is that I know the url’s main topic, so the img’s value/des also can be a measure when sorting.

To be honest, it is not 100% true to obtain main image from url, even though we already use multiple methods to filter, to sort. You also need to modify the parameters to make it with high performance.

Codes:

package controllers

import java.io.IOException
import java.net.URL
import javax.imageio.ImageIO
import scala.collection.mutable.HashMap

import org.jsoup.nodes.{Document, Element}
import org.jsoup.Jsoup
import collection.JavaConversions._

class ImageResolverService {
  def checkUrl(url: String): Boolean = {
    var returnInfo = true
    val tempUrl = url.toLowerCase
    if (url == "" || tempUrl.contains("logo") || tempUrl.contains("icon") || tempUrl.contains("loading") ||
      tempUrl.contains(".gif") || tempUrl.contains("badge") || tempUrl.contains("1x1") ||
      tempUrl.contains("doubleclick") || tempUrl.contains("pixel") || tempUrl.contains("gravatar.com") ||
      tempUrl.contains("widget") || tempUrl.contains("spinner") || tempUrl.contains("feeds.feedburner.com") ||
      tempUrl.contains("/ads/") || tempUrl.contains("http://mcclatchy.112.2o7.net/") ||
      tempUrl.contains("http://ientry.rotator.hadj1.adjuggler.net/") || tempUrl.contains("g+.jpg")) {
      returnInfo = false
    }
    returnInfo
  }

  def getUrl(s: Element): String = {
    var returnInfo = ""
    if (s.attr("src") != "") {
      returnInfo = s.attr("src")
    } else {
      if (s.attr("data-src") != "") {
        returnInfo = s.attr("data-src")
      } else {
        if (s.attr("data-lazy-src") != "") {
          returnInfo = s.attr("data-lazy-src")
        } else {
          if (s.attr("data-original") != "") {
            returnInfo = s.attr("data-original")
          }
        }
      }
    }
    returnInfo
  }

  def fixUrl(url: String, domain: String): String = {
    var returnInfo = url
    if (!url.toLowerCase.startsWith("http")) {
      if (url.startsWith("/")) {
        if (url.startsWith("//")) {
          returnInfo = domain.split("//")(0) + url
        } else {
          returnInfo = domain + url
        }
      } else {
        returnInfo = domain + "/" + url
      }
      if (url.startsWith("../")) {
        returnInfo = domain + "/" + url.replace("../", "")
      }
    }
    url.replace(" ", "%20")
    returnInfo
  }

  def getSrcFromDoc(doc: Document, domain: String, item: String): String = {
    var srcMap = new HashMap[String, Int]
    val elementImages = doc.select("img").iterator().toList
    var src = ""
    elementImages.foreach{s =>
      var imageElement = getUrl(s)
      if (checkUrl(imageElement)) {
        imageElement = fixUrl(imageElement, domain)
        var w = 1
        var h = 1
        val widthAttr = s.attr("width")
        val heightAttr = s.attr("height")
        if (widthAttr != "") {
          if (widthAttr.toLowerCase.contains("px")) {
            w = widthAttr.toLowerCase.split("px")(0).toFloat.toInt
          } else {
            try {
              w = widthAttr.toFloat.toInt
            } catch {
              case e:Exception =>
            }
          }
        }
        if (heightAttr != "") {
          if (heightAttr.toLowerCase.contains("px")) {
            h = heightAttr.toLowerCase.split("px")(0).toFloat.toInt
          } else {
            try {
              h = heightAttr.toFloat.toInt
            } catch {
              case e:Exception =>
            }
          }
        }
        if ( w == 1 || h == 1 || (w > 128 && h > 128)) {
          try {
            imageElement = imageElement.replaceAll("""(?m)\s+$""", "")
            val imageUrl = new URL(imageElement)
            val image = ImageIO.read(imageUrl)
            if (image != null) {
              w = image.getWidth
              h = image.getHeight
            } else {
              w =  1
              h = 1
            }
            if (w/h <= 3 && h/w <=3 && w > 128 && h > 128) {
              if (s.attr("alt").toLowerCase.contains(item.toLowerCase)) {
                src = imageElement
              }
              srcMap += (imageElement -> w * h)
            }
          } catch {
            case e: IOException =>
          }
        }
      }
    }
    val srcMapSorted = srcMap.toList.sortBy{-_._2}

    if (srcMapSorted.nonEmpty && src == "") {
      src = srcMapSorted.head._1
    }
    src
  }

  def extract(url: String, item: String): String = {
    var src = ""
    if (url.startsWith("http")) {
      val domain = url.split("//")(0) + "//" + url.split("//")(1).split("/")(0)
      try {
        var res = Jsoup.connect(url).
        timeout(60000).ignoreHttpErrors(true).ignoreContentType(true).followRedirects(true).execute()
        if (res.statusCode() == 307) {
          val sNewUrl = res.header("Location")
          if (sNewUrl != null && sNewUrl.length() > 7)
            res = Jsoup.connect(sNewUrl).timeout(60000).execute()
        }
        val doc = res.parse()

        src = getSrcFromDoc(doc, domain, item)
      } catch {
        case e: IOException =>
      }
    }
    src
  }
}