Image Extraction From URL by Scala

There are less methods or posts which are talking about how to extract image from url. Unfortunately recently in company’s project I need to have this feature. I search lots of places, but not too much results. Even though some people provide chrome plugins or codes to obtain all images in url (This is quite simple, you only need to parse the url’s html and find all img and that’s all), this is not what we want. I want to get one main image in url; It is likely to use one main image within the link to reflect the url’s whole content. Now we live in internet era, we don’t lack information. In fact, we already lost in too many messages. If we can read less words or texts to use several images to show all, our life can speed up. (Of course, if you are old man, you like slow life. Just Enjoy.) So that’s the purpose we want to provide image before user clicks the url.

Idea:

I already write down the reason why we need this feature. Next step is to explain how we achieve it. There are several logics we follow:

  1. obtain all img tag by parsing url’s html
  2. filter all known public bad images, like logo, brand, icon, etc (Because nobody would like to use one icon to show an article’s content. It is a known knowledge.)
  3. filter all images which sizes are not qualified, like too long, too wide, too small, etc. (Because a main image which can be described should have some size to hold in page.)
  4. obtain the rest images’ real size and do 3rd step again. (Because sometimes, img’s attr does not contain width/height attribution. In this case, we need to read real data from img link)
  5. map the rest images to sort by its real image area. (Because we believe the larger the size, the more opportunity the main image.)
  6. I have one more filter is that I know the url’s main topic, so the img’s value/des also can be a measure when sorting.

To be honest, it is not 100% true to obtain main image from url, even though we already use multiple methods to filter, to sort. You also need to modify the parameters to make it with high performance.

Codes:

package controllers

import java.io.IOException
import java.net.URL
import javax.imageio.ImageIO
import scala.collection.mutable.HashMap

import org.jsoup.nodes.{Document, Element}
import org.jsoup.Jsoup
import collection.JavaConversions._

class ImageResolverService {
  def checkUrl(url: String): Boolean = {
    var returnInfo = true
    val tempUrl = url.toLowerCase
    if (url == "" || tempUrl.contains("logo") || tempUrl.contains("icon") || tempUrl.contains("loading") ||
      tempUrl.contains(".gif") || tempUrl.contains("badge") || tempUrl.contains("1x1") ||
      tempUrl.contains("doubleclick") || tempUrl.contains("pixel") || tempUrl.contains("gravatar.com") ||
      tempUrl.contains("widget") || tempUrl.contains("spinner") || tempUrl.contains("feeds.feedburner.com") ||
      tempUrl.contains("/ads/") || tempUrl.contains("http://mcclatchy.112.2o7.net/") ||
      tempUrl.contains("http://ientry.rotator.hadj1.adjuggler.net/") || tempUrl.contains("g+.jpg")) {
      returnInfo = false
    }
    returnInfo
  }

  def getUrl(s: Element): String = {
    var returnInfo = ""
    if (s.attr("src") != "") {
      returnInfo = s.attr("src")
    } else {
      if (s.attr("data-src") != "") {
        returnInfo = s.attr("data-src")
      } else {
        if (s.attr("data-lazy-src") != "") {
          returnInfo = s.attr("data-lazy-src")
        } else {
          if (s.attr("data-original") != "") {
            returnInfo = s.attr("data-original")
          }
        }
      }
    }
    returnInfo
  }

  def fixUrl(url: String, domain: String): String = {
    var returnInfo = url
    if (!url.toLowerCase.startsWith("http")) {
      if (url.startsWith("/")) {
        if (url.startsWith("//")) {
          returnInfo = domain.split("//")(0) + url
        } else {
          returnInfo = domain + url
        }
      } else {
        returnInfo = domain + "/" + url
      }
      if (url.startsWith("../")) {
        returnInfo = domain + "/" + url.replace("../", "")
      }
    }
    url.replace(" ", "%20")
    returnInfo
  }

  def getSrcFromDoc(doc: Document, domain: String, item: String): String = {
    var srcMap = new HashMap[String, Int]
    val elementImages = doc.select("img").iterator().toList
    var src = ""
    elementImages.foreach{s =>
      var imageElement = getUrl(s)
      if (checkUrl(imageElement)) {
        imageElement = fixUrl(imageElement, domain)
        var w = 1
        var h = 1
        val widthAttr = s.attr("width")
        val heightAttr = s.attr("height")
        if (widthAttr != "") {
          if (widthAttr.toLowerCase.contains("px")) {
            w = widthAttr.toLowerCase.split("px")(0).toFloat.toInt
          } else {
            try {
              w = widthAttr.toFloat.toInt
            } catch {
              case e:Exception =>
            }
          }
        }
        if (heightAttr != "") {
          if (heightAttr.toLowerCase.contains("px")) {
            h = heightAttr.toLowerCase.split("px")(0).toFloat.toInt
          } else {
            try {
              h = heightAttr.toFloat.toInt
            } catch {
              case e:Exception =>
            }
          }
        }
        if ( w == 1 || h == 1 || (w > 128 && h > 128)) {
          try {
            imageElement = imageElement.replaceAll("""(?m)\s+$""", "")
            val imageUrl = new URL(imageElement)
            val image = ImageIO.read(imageUrl)
            if (image != null) {
              w = image.getWidth
              h = image.getHeight
            } else {
              w =  1
              h = 1
            }
            if (w/h <= 3 && h/w <=3 && w > 128 && h > 128) {
              if (s.attr("alt").toLowerCase.contains(item.toLowerCase)) {
                src = imageElement
              }
              srcMap += (imageElement -> w * h)
            }
          } catch {
            case e: IOException =>
          }
        }
      }
    }
    val srcMapSorted = srcMap.toList.sortBy{-_._2}

    if (srcMapSorted.nonEmpty && src == "") {
      src = srcMapSorted.head._1
    }
    src
  }

  def extract(url: String, item: String): String = {
    var src = ""
    if (url.startsWith("http")) {
      val domain = url.split("//")(0) + "//" + url.split("//")(1).split("/")(0)
      try {
        var res = Jsoup.connect(url).
        timeout(60000).ignoreHttpErrors(true).ignoreContentType(true).followRedirects(true).execute()
        if (res.statusCode() == 307) {
          val sNewUrl = res.header("Location")
          if (sNewUrl != null && sNewUrl.length() > 7)
            res = Jsoup.connect(sNewUrl).timeout(60000).execute()
        }
        val doc = res.parse()

        src = getSrcFromDoc(doc, domain, item)
      } catch {
        case e: IOException =>
      }
    }
    src
  }
}
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s