Monthly Archives: May 2016

Text Extraction From URL by Scala

In this post, we talk about how to extract text from URL. Please note, we not involve special pages (e.g. Facebook posts, Facebook comments, etc) into this talk. But in another post, I will write a solution for Facebook posts extracting.

We know there are many types for a url, like text/*, application/xml, application/xhtml+xml, application/pdf, image, etc. In this post, we only support the types which we list.

There are three parts for the code snippet, one is for  text/*, application/xml, application/xhtml+xml, see Case 1, the other is for application/pdf, see Case 2. The other is for image, like png, jpg, etc. see Case 3.

Case 1:

import org.jsoup.Jsoup
val doc = Jsoup.connect(<your_url>).get()
getTextByDoc(doc)

According to Jsoup, we get doc, but within the doc, there are many useless elements, like footer, header, etc. In fact, we don’t need them, we just want to obtain pure meaningful content. So here we do some filters. Please note, we all know we can’t filter all, because we don’t know which part is useful, which is not. What we can do is to try all our best to remove common known useless parts. 

import org.jsoup.nodes.Document
private def getTextByDoc(doc: Document): String = {
  doc.head().remove()
  doc.getElementsByTag("header").remove()
  doc.getElementsByTag("footer").remove()
  doc.getElementsByTag("form").remove()
  doc.getElementsByTag("table").remove()
  doc.getElementsByTag("meta").remove()
  doc.getElementsByTag("img").remove()
  doc.getElementsByTag("a").remove()
  doc.getElementsByTag("br").remove()

  doc.getElementsByClass("tags").remove()
  doc.getElementsByClass("copyright").remove()
  doc.getElementsByClass("widget").remove()

  doc.select("div[class*=foot").remove()
  doc.select("div[class*=tag").remove()
  doc.select("div[class*=Loading").remove()
  doc.select("div[class*=Widget").remove()
  doc.select("div[class*=Head").remove()
  doc.select("div[class*=menu").remove()
  doc.select("p[class*=link").remove()

  val paragraphs = doc.select("p")
  val divs = doc.select("div")

  paragraphs.text() + divs.text()
}

Case 2:

For pdf url, it is a little complex. First we need to get its content type to make sure it is “application/pdf” and then we create a local temporary file and then to extract local pdf to obtain pure text. Finally, we delete this temporary file.

import java.io.File
import java.net.URL
val url = new URL(<your_url>)
val conn = url.openConnection()
val contentType = conn.getContentType

contentType match {
  case "application/pdf" =>
    val fileName = Random.alphanumeric.take(5).mkString + ".pdf"
    url #> new File(fileName) !!
    val texts = getTextFromPDF(None, None, fileName)
    val of = new File(fileName)
    of.delete()
    texts
  case _ => None
}

Here is to extract text from local pdf file. Here because I don’t know its start page and end page, I just skip it. By default, it will fetch all.

import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.util.PDFTextStripper
private def getTextFromPDF(startPage: Option[Int], endPage: Option[Int], fileName: String): Option[String] = {
  try {
    val pdf = PDDocument.load(new File(fileName))
    val stripper = new PDFTextStripper()
    startPage match {
      case Some(startInt) => stripper.setStartPage(startInt)
      case None =>
    }
    endPage match {
      case Some(endInt) => stripper.setEndPage(endInt)
      case None =>
    }
    Some(stripper.getText(pdf))
  } catch {
    case e: Throwable => None
  }
}

Case 3:

For image, it involves into a new technology, named ‘OCR’ which can help to parse image’s content. So we need a java-ocr-api into system.

Step1:

In build.sbt to add one line to add dependence.

libraryDependencies += "com.asprise.ocr" % "java-ocr-api" % "[15,)"

Step2:

To import library:

import com.asprise.ocr.Ocr

Step3:

Here is the code snippet to show how to implement it. Please note: here <your_file> is a File type. If you only have fileName/filePath, you need to use new File(<file_name>) to convert it. 

try {
  // Image
  Ocr.setUp()
  val ocr = new Ocr
  ocr.startEngine("eng", Ocr.SPEED_FASTEST)
  val files = List(<your_file>)
  val outputString = ocr.recognize(files.toArray, Ocr.RECOGNIZE_TYPE_ALL, Ocr.OUTPUT_FORMAT_PLAINTEXT)
  ocr.stopEngine()
  Some(outputString)
} catch {
  case e: Exception => None // todo: to support multiple file types
}

Play Framework(12)-template engine

Concept:

Templates are complied as standard Scala functions, following a simple naming convention. If you create a views/Application/index.scala.html template file, it will generate a views.html.Application.index class that has an apply() method.

Special Character :

Scala template uses @ as the single special character. Every time this character is encountered, it indicates the beginning of a dynamic statement. If you want to insert a multi-token statement, explicitly mark it using brackets/curly brackets. Because @ is a special character, you’ll sometimes need to escape it by using @@.

Make sure that { is on the same line with for to indicate that expression continues to next line.

Tips:

A template is like a function, so it needs parameters, which must be declared at the top of the template file.

You can write server side block comments in templates using @@.

Dynamic content parts are escaped according to the template type’s (e.g. HTML or XML) rules. If you want to output a raw content fragment, wrap it in the template content type.

 

MySQL(3)-Joins

  • left join = A
    select <select_list> from tableA A left join tableB B on A.key=B.key
  • inner join = (common part between A and B)
    select <select_list> from tableA A inner join tableB B on A.key=B.key
  • right join = B
    select <select_list> from tableA A right join tableB B on A.key=B.key
  • A – (common part between A and B)
    select <select_list> from tableA A left join tableB B on A.key=B.key 
    where B.key is NULL
  • B – (common part between A and B)
    select <select_list> from tableA A right join tableB B on A.key=B.key
    where A.key is NULL
  • A + B
    select <select_list> from tableA A full outer join tableB B on 
    A.key=B.key
  • A + B – (common part between A and B)
    select <select_list> from tableA A full outer join tableB B on 
    A.key=B.key where A.key is NULL or B.key is NULL