Using AWS Textract / Azure Computer Vision for Insurance OCR

The Enrapt tech blog hasn’t been updated, since I really haven’t had the time to sit down and write due to various customer obligations.  Since the start of COVID-19,  due to being home for long hours and days,  I have had a bit more time to sit down, reflect, and write about some of the inquiries we get in regards to the cloud. 

This post will be focused on using cloud solutions such as AWS Textract and Azure Computer Vision for the purpose of  OCR (Optical Character Recognition).  In Japan, as a result of COVID-19, adoption of paperless has been greatly accelerated however many forms still use paper.  The paper-based forms are typically scanned in via a high-speed scanner and then passed through an OCR tool for various levels of form recognition.  

The most common types of OCR requests we are see the following

1. Form Recognition

Typically, each form has an identifier used to identify what type of document it is.  At the bottom of most paper forms there is text or which indicates the type of form it is.

2. Handwriting Recognition

There are numerous types of handwriting recognition required on japanese forms:

  • Date field recognition
  • Kanji/Kana field recognition
  • Telephone field recognition
  • AlphaNumeric field recognition
  • Circled Text field recognition

3. Pre-printed Text Recognition

Pre-printed text such as name and policy numbers are printed on the forms so that customers to simplify entry as well as to aid OCR systems recognize the text.

4. Hanko Recognition

Instead of signatures, japanese people typically carry a hanko or a block that is used for personal identification in place of a signature.  There are several types of hanko used in practice.

  • Corporate Hanko
  • Personal Hanko
  • Dated Hanko

The below is an example of a paper form for Pet Insurance at Anicom.  The different types of OCR needs are highlighted below.

Azure Computer Vision Evaluation

Scan Results with Azure Computer Vision (Recognize Printed Text v3.1)

Note: Recognize printed text v3.1 is the only API that currently supports Japanese.

All handwritten fields were ignored since Azure Recognize Printed Text does not support handwriting.

Scan Results with Azure Computer Vision (Read v3.2)

Note: Read APIv3.2 supports handwriting but does not support Japanese

  • Date Field Results: 20 2) 4 A

Problems occurred as the handwritten text could not be specified as either a number or field so “21” was converted to “2)” .  

  • Telephone Field Results: 080 1234 $679

Since the telephone field could not be specified as numeric “5” was converted to “$”.

  • Alphanumeric Field Results: 234567890

For some reason the leading 1 was not recognized on the handwritten field.

  • Circled Text, Hanko, Kana, and Kanji are not supported. 

Azure Overall

The current version of Azure Computer Vision has no support for handwritten Japanese, however does work for handwritten numbers and letters.  Without the ability to specify which fields can contain numbers or characters, the error variability for handwritten text is very high.

Amazon Textract Evaluation

Scan Results with Amazon Textract

Textract does not support Japanese text recognition so the only evaluation point was the ability to detect handwritten numbers and letters.

  • Date Field Results: 20 21 # 4 A 12

The numbers were read in correctly however the japanese characters were converted into unrelated text such as “#” or “A”.

  • Telephone Field Results: 080 – 1234 – 5679

Recognition of the handwritten telephone number was very accurate.  Even the dashes in the telephone were recognized.

  • Alphanumeric Field Results: |23456789

Since the alphhanumeric field could not be specified as numeric “1” was incorrectly converted to “|”

  • Circled Text, Hanko, Kana, and Kanji are not supported. 

Amazon Overall

The current version of Amazon Textract in regards to handwritten text appears to be superior to Azure Computer Vision.  However, due to the lack of printed Japanese as well as handwritten Japanese the use is still quited limited.

Conclusions

When using cloud-based OCR solutions for Japanese forms, the various limitations of each cloud solution must be incorporated into the specific OCR use-case.  In terms of printed Japanese text, Azure Computer Vision works quite well.  For handwritten Japanese text, there isn’t a current Azure or Amazon offering that is available.  For handwritten numbers and english characters, Amazon seems to provide the best recognition based on my handwriting.  

Note: Using other handwriting samples may produce different results.

コメントを残す

以下に詳細を記入するか、アイコンをクリックしてログインしてください。

WordPress.com ロゴ

WordPress.com アカウントを使ってコメントしています。 ログアウト /  変更 )

Google フォト

Google アカウントを使ってコメントしています。 ログアウト /  変更 )

Twitter 画像

Twitter アカウントを使ってコメントしています。 ログアウト /  変更 )

Facebook の写真

Facebook アカウントを使ってコメントしています。 ログアウト /  変更 )

%s と連携中

%d人のブロガーが「いいね」をつけました。