MOMO v2

CSE455 Winter 2023 · Yu Hao Wong · Harmony Shirk

Note: This pipeline can occasionally fail under certain image conditions. We discuss some of these issues here. To best demonstrate this project, try this demo image that was used during testing.

Try Demo Image

Problem Statement

Splitting bills shouldn't be so hard. Whether you're out grocery shopping, eating, or at any recreational events with a large party, itemizing a receipt is a chore. MOMO aims to make this process as seamless as possible. Simply by taking a photo of your receipt, MOMO will automatically parse each item and allow you to simply select who pays what, then request the funds. Finally, you have an easy interface to reflect on past group transactions which are still pending. Through this, we aim to bring additional convenience to the community.

From MOMO v1

Previous Work

Concept from MOMO v1

(DubHacks 2022, YH Wong, R Liao, RJ Mao, M Lee)

Video

How it works

The image processing pipeline uses the OpenCV, Tesseract and PIL libraries to run OCR on your uploaded image.

Ingest the image

The original image is uploaded to the backend, a Flask server Dockerized and hosted on AWS Lightsail. We use a 1x Small Instance with 1GB of RAM and a shared vCPU, so performance is not optimal.

Detect Regions of Interest

Preprocessing : Downscale and grayscale the image, apply Gaussian Blur to reduce noise.

Detecting Regions: Apply dilation to highlight white regions on the image. We can assume that the receipt is the largest white region in the image

Image Convolutions / Blurring · Morphological Dilation

Highlight Region in Image

Edge Detection: Run OpenCV's Canny Edge detection. Edges can get broken up due to contrast, so apply erosion and dilation to roughly combine them.

Contour Detection:Use OpenCV to detect contours from the edges. Use contour approximation to find the bounding box of the receipt. This is our region of interest in the image!

Morphological Dilation / Erosion · Canny Edge Detection · Contour Approximation

Perspective Correction to Fit Region

Given the contours describing our region, the original image and the resizing ratio, map the 4 corners of our region to the original image. Apply perspective warp to this region to flatten it.

Perspective Transformations / Warping

Image Binarization

Apply preprocessing to create a b/w image for better text detection. We use adaptive thresholding to account for differences in brightness / contrast across the region

Binarization · Adaptive Thresholding

Run OCR Model To Extract Information

Use PyTesseract to detect text on our image, and parse relevant information with RegEx.

Return this data as the API response.

Discussion

What problems did you encounter?

API Performance: The pipeline is hosted on a 1xSmall AWS Lightsail container instance, which might not have sufficient computing power to complete all steps quickly, or handle concurrent users at scale. Uploading large image files to the API is also a bottleneck. The baseline image that we used for testing takes approx. 25 seconds to complete, but the API reports only taking 0.42s to run the actual pipeline on the image.

Character Recognition Accuracy: The baseline Tesseract model used is not optimized for receipts, resulting in somewhat-accurate text-box detection, but sometimes-inaccurate character recognition, leading to jumbled item names.

Region Detection Accuracy: If the contrast difference between the receipt and the background is not sufficiently high, the pipeline may fail to detect the contours. This is a common issue with most OCR, but this hand implementation is especially vulnerable to it.

RegEx Matching Accuracy: The current RegEx used to parse detected text is set to recognize dollar values in the $xx.xx format. Anything beyond that is not supported at this time.

Are there next steps you would take if you kept working on the project?

Improve system design of the application for better scaling and performance

Implement dynamic image downscaling to better handle large resolution files without loss of accuracy

Train the Tesseract model on the CORD dataset for better OCR accuracy

How does your approach differ from others? Was that beneficial?

The initial concept was based on my DubHacks 2022 submission, hence the v2. We replaced the original 3rd party OCR API with our own, allowing us to better tweak preprocessing steps to suit receipts better. We are now also able to see intermediate steps as produced by the pipeline. While detection accuracy has not quite caught up, our design allows for modularity of the OCR modules. We could feasibly drop-in a PyTorch or Google Vision model without much modification.