Text Extraction From URL by Scala

In this post, we talk about how to extract text from URL. Please note, we not involve special pages (e.g. Facebook posts, Facebook comments, etc) into this talk. But in another post, I will write a solution for Facebook posts extracting.

We know there are many types for a url, like text/*, application/xml, application/xhtml+xml, application/pdf, image, etc. In this post, we only support the types which we list.

There are three parts for the code snippet, one is for  text/*, application/xml, application/xhtml+xml, see Case 1, the other is for application/pdf, see Case 2. The other is for image, like png, jpg, etc. see Case 3.

Case 1:

import org.jsoup.Jsoup
val doc = Jsoup.connect(<your_url>).get()
getTextByDoc(doc)

According to Jsoup, we get doc, but within the doc, there are many useless elements, like footer, header, etc. In fact, we don’t need them, we just want to obtain pure meaningful content. So here we do some filters. Please note, we all know we can’t filter all, because we don’t know which part is useful, which is not. What we can do is to try all our best to remove common known useless parts. 

import org.jsoup.nodes.Document
private def getTextByDoc(doc: Document): String = {
  doc.head().remove()
  doc.getElementsByTag("header").remove()
  doc.getElementsByTag("footer").remove()
  doc.getElementsByTag("form").remove()
  doc.getElementsByTag("table").remove()
  doc.getElementsByTag("meta").remove()
  doc.getElementsByTag("img").remove()
  doc.getElementsByTag("a").remove()
  doc.getElementsByTag("br").remove()

  doc.getElementsByClass("tags").remove()
  doc.getElementsByClass("copyright").remove()
  doc.getElementsByClass("widget").remove()

  doc.select("div[class*=foot").remove()
  doc.select("div[class*=tag").remove()
  doc.select("div[class*=Loading").remove()
  doc.select("div[class*=Widget").remove()
  doc.select("div[class*=Head").remove()
  doc.select("div[class*=menu").remove()
  doc.select("p[class*=link").remove()

  val paragraphs = doc.select("p")
  val divs = doc.select("div")

  paragraphs.text() + divs.text()
}

Case 2:

For pdf url, it is a little complex. First we need to get its content type to make sure it is “application/pdf” and then we create a local temporary file and then to extract local pdf to obtain pure text. Finally, we delete this temporary file.

import java.io.File
import java.net.URL
val url = new URL(<your_url>)
val conn = url.openConnection()
val contentType = conn.getContentType

contentType match {
  case "application/pdf" =>
    val fileName = Random.alphanumeric.take(5).mkString + ".pdf"
    url #> new File(fileName) !!
    val texts = getTextFromPDF(None, None, fileName)
    val of = new File(fileName)
    of.delete()
    texts
  case _ => None
}

Here is to extract text from local pdf file. Here because I don’t know its start page and end page, I just skip it. By default, it will fetch all.

import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.util.PDFTextStripper
private def getTextFromPDF(startPage: Option[Int], endPage: Option[Int], fileName: String): Option[String] = {
  try {
    val pdf = PDDocument.load(new File(fileName))
    val stripper = new PDFTextStripper()
    startPage match {
      case Some(startInt) => stripper.setStartPage(startInt)
      case None =>
    }
    endPage match {
      case Some(endInt) => stripper.setEndPage(endInt)
      case None =>
    }
    Some(stripper.getText(pdf))
  } catch {
    case e: Throwable => None
  }
}

Case 3:

For image, it involves into a new technology, named ‘OCR’ which can help to parse image’s content. So we need a java-ocr-api into system.

Step1:

In build.sbt to add one line to add dependence.

libraryDependencies += "com.asprise.ocr" % "java-ocr-api" % "[15,)"

Step2:

To import library:

import com.asprise.ocr.Ocr

Step3:

Here is the code snippet to show how to implement it. Please note: here <your_file> is a File type. If you only have fileName/filePath, you need to use new File(<file_name>) to convert it. 

try {
  // Image
  Ocr.setUp()
  val ocr = new Ocr
  ocr.startEngine("eng", Ocr.SPEED_FASTEST)
  val files = List(<your_file>)
  val outputString = ocr.recognize(files.toArray, Ocr.RECOGNIZE_TYPE_ALL, Ocr.OUTPUT_FORMAT_PLAINTEXT)
  ocr.stopEngine()
  Some(outputString)
} catch {
  case e: Exception => None // todo: to support multiple file types
}
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s