Using AWS Textract / Azure Computer Vision for Insurance OCR

The Enrapt tech blog hasn’t been updated, since I really haven’t had the time to sit down and write due to various customer obligations. Since the start of COVID-19, due to being home for long hours and days, I have had a bit more time to sit down, reflect, and write about some of the inquiries we get in regards to the cloud.

This post will be focused on using cloud solutions such as AWS Textract and Azure Computer Vision for the purpose of OCR (Optical Character Recognition). In Japan, as a result of COVID-19, adoption of paperless has been greatly accelerated however many forms still use paper. The paper-based forms are typically scanned in via a high-speed scanner and then passed through an OCR tool for various levels of form recognition.

The most common types of OCR requests we are see the following

1. Form Recognition

Typically, each form has an identifier used to identify what type of document it is. At the bottom of most paper forms there is text or which indicates the type of form it is.

2. Handwriting Recognition

There are numerous types of handwriting recognition required on japanese forms:

Date field recognition
Kanji/Kana field recognition
Telephone field recognition
AlphaNumeric field recognition
Circled Text field recognition

3. Pre-printed Text Recognition

Pre-printed text such as name and policy numbers are printed on the forms so that customers to simplify entry as well as to aid OCR systems recognize the text.

4. Hanko Recognition

Instead of signatures, japanese people typically carry a hanko or a block that is used for personal identification in place of a signature. There are several types of hanko used in practice.

Corporate Hanko
Personal Hanko
Dated Hanko

The below is an example of a paper form for Pet Insurance at Anicom. The different types of OCR needs are highlighted below.

Azure Computer Vision Evaluation

Scan Results with Azure Computer Vision (Recognize Printed Text v3.1)

Note: Recognize printed text v3.1 is the only API that currently supports Japanese.

All handwritten fields were ignored since Azure Recognize Printed Text does not support handwriting.

Scan Results with Azure Computer Vision (Read v3.2)

Note: Read APIv3.2 supports handwriting but does not support Japanese

Date Field Results: 20 2) 4 A

Problems occurred as the handwritten text could not be specified as either a number or field so “21” was converted to “2)” .

Telephone Field Results: 080 1234 $679

Since the telephone field could not be specified as numeric “5” was converted to “$”.

Alphanumeric Field Results: 234567890

For some reason the leading 1 was not recognized on the handwritten field.

Circled Text, Hanko, Kana, and Kanji are not supported.

Azure Overall

The current version of Azure Computer Vision has no support for handwritten Japanese, however does work for handwritten numbers and letters. Without the ability to specify which fields can contain numbers or characters, the error variability for handwritten text is very high.

Amazon Textract Evaluation

Scan Results with Amazon Textract

Textract does not support Japanese text recognition so the only evaluation point was the ability to detect handwritten numbers and letters.

Date Field Results: 20 21 # 4 A 12

The numbers were read in correctly however the japanese characters were converted into unrelated text such as “#” or “A”.

Telephone Field Results: 080 – 1234 – 5679

Recognition of the handwritten telephone number was very accurate. Even the dashes in the telephone were recognized.

Alphanumeric Field Results: |23456789

Since the alphhanumeric field could not be specified as numeric “1” was incorrectly converted to “|”

Circled Text, Hanko, Kana, and Kanji are not supported.

Amazon Overall

The current version of Amazon Textract in regards to handwritten text appears to be superior to Azure Computer Vision. However, due to the lack of printed Japanese as well as handwritten Japanese the use is still quited limited.

Conclusions

When using cloud-based OCR solutions for Japanese forms, the various limitations of each cloud solution must be incorporated into the specific OCR use-case. In terms of printed Japanese text, Azure Computer Vision works quite well. For handwritten Japanese text, there isn’t a current Azure or Amazon offering that is available. For handwritten numbers and english characters, Amazon seems to provide the best recognition based on my handwriting.

Note: Using other handwriting samples may produce different results.

Using AWS Textract / Azure Computer Vision for Insurance OCR

1. Form Recognition

2. Handwriting Recognition

3. Pre-printed Text Recognition

4. Hanko Recognition

Azure Computer Vision Evaluation

Azure Overall

Amazon Textract Evaluation

Amazon Overall

Conclusions

いいね:

関連

コメントを残すコメントをキャンセル

1. Form Recognition

2. Handwriting Recognition

3. Pre-printed Text Recognition

4. Hanko Recognition

Azure Computer Vision Evaluation

Azure Overall

Amazon Textract Evaluation

Amazon Overall

Conclusions

いいね:

関連

コメントを残すコメントをキャンセル

Enrapt Labsをもっと見る