Project Oxford – Image Text Detection

Microsoft has a new set of services which use machine learning to extrapolate data from images and speech called Project Oxford. Each service has a public facing REST api allowing any program capable of communicating through HTTP to utilize it, and for smaller amounts of data, Project Oxford is free to use.

Microsoft classifies these services into three categories: vision, speech, and language. The vision section contains several different capabilities including facial recognition, facial emotion detection, video stabilization, and optical character recognition (OCR). It’s OCR api allows a system to send an image URL and in turn will return the text it detected. Since the service is system agnostic, it’s possible to create an HTML page (with a little help from a service like Imgur) which can upload an image and extract the text without using server side resources. (The completed example can be found here.)

How to create

Uploading the Image

The first step is uploading the image to a publicity accessible store. Imgur works well for this, because it’s easy to use and free. It requires an account and registering the “application”, but this is relatively quick and painless. The page to do so is found here, and with the api key, all that is left is to retrieve the image from the input and upload the file.

Uploading the File

HTML 5 has the input type: file. To upload the file when the user selects it, add an onchange event to the input.

Now the function uploadImage fires whenever the user select a new file.

With the help of the FileReader, the webpage can load the image into memory to send to the server. It has several methods for retrieving files such as readAsText and readAsBinaryString, but the one which allows the image to upload correctly is readAsDataURL as this uploads the image in the correct encoded format.

Reading the file is an asynchronous operation and requires a method to call when loading. There is a property in the FileReader object named onload and has event parameter, but it’s not necessary in this context, because the reader’s result object is available through closure, so omitting it is fine.

Sending the File to Imgur

Whether using JQuery, another framework, or plain JavaScript, the upload process is straight forward. This example is in JQuery.

The Imgur api exposes the https://api.imgur.com/3/image URL, and to upload an image, use POST and add the HTTP header Authorization with the client id: Client-ID application_client_id. The trickiest part concerns the file upload from the FileReader. The readAsDataURL returns more data than necessary and the contents look like

The comma and everything before it is extraneous and will cause an error, so it’s necessary to remove it. Using String.Split(‘,’) and picking the second item in the array is the easiest approach.

Posting the image to Imgur returns several pieces of data about the object, height, width, size, etc., but the key pieces are, link (which is the URL to the image), and deletehash (this will be used to remove the image once completed. The api documentation says in some cases the id can be used, but since this is an anonymous posting, the delete hash is necessary).

Data OCR

Project Oxford requires a login to grant the necessary api key. After signing up and logging in, the service portal lists a page with access to the various API keys (there is one for each service). The once necessary for OCR looks like:

get api key for ocr

To retrieve the image text information, POST to: https://api.projectoxford.ai/vision/v1/ocr. It requires the HTTP Header Ocp-Apim-Subscription-Key with the value being the client id retrieved from the key dashboard. The data contains the URL as the key, and there are two optional querystring parameters which the request can have: language and detectOrientation.

language

The parameter language specifies which language to use when performing the OCR. If it’s not provided, the service attempts a determination based on the image and return what it thinks is correct. This can be helpful, if the language is unknown, and goal of the OCR process is to extract the text and then translate it. (Unfortunately, Microsoft has disabled the features, mainly Cross Origin Request Sharing, for retrieving the authentication token necessary to create an application not having any server side processing in it’s Microsoft Translator API). Not including leaves the service up to guess and can provide incorrect results.

The parameter is the BCP-47 language code which follows ISO 639-1. (A lowercase two character code representing each language e.g. en for English, es for Spanish, etc.) At this time the API only supports about 20 languages, but Microsoft has said it is working to expand the list. The API documentation lists the supported languages under the language parameter.

detectOrientation

This is a boolean parameter indicating it should attempt to recognize the text and orient it to be parallel with the top of image’s bounding box. Sometimes this can help in how it groups the returned text.

Response

The service returns an object looking like this:

language

The language property is what the service thinks the language in the image is, and if there are multiple languages in the text, it still only returns one. The service makes its best determination and won’t necessarily pick the first it encounters.

TextAngle

TextAngle is the tilt the service thinks the image is set to, and orientation is the direction the top text is facing. If the text is facing 45 degrees to the right, the text angle would be “-45.000” and the orientation would be “Right”.

Regions

multilanguage The found text in the image is not lumped together when returned. There are separate entries for each section of text found. In the example to the right, Hola and Hello are found in two different areas, so there are two regions returned in the array. Each region has a property boundingBox which is a list of comma separated coordinates where the text region exists on the page. In each region there is a lines property which is an array of objects each with their own boundingBox and each line has a words property object array separating each word and also containing the boundingBox property.

Removing the Image

Removing the image is similar to the initial POST action. This time the action is DELETE and the URL has the deletehash appended to the end.

The Javascript code can be found here.