Exploring Serverless Applications for Machine Learned OCR

As most people in tech know, “serverless” is the new buzzword and lots of blog posts/books/utilities/tweets/etc. are being written about it every day. This post outlines my initial experience with serverless computing, the issues I had along the way, and some considerations in case you, like me, are thinking about giving it a try for the first time.

Background

Not many people outside of Brazil might've heard about the Serenata de Amor project, but it’s making a lot of noise on this side of the globe. Operação Serenata de Amor is an artificial intelligence and data science project that aims to inform the general public about government corruption and spending (more info can be found at the project's website and this nice podcast at The Changelog).

Serenata de Amor is also an open source project that I contribute to it whenever I have some free time. Recently, I’ve been thinking about how we can leverage serverless computing to help in the cause. Before getting into the details of why I think serverless computing makes sense for the project and how I approached the problem, it might help to know a bit more about the Serenata project itself.

So far, the main focus of the Serenata team (and where it currently excels) is analyzing meal reimbursements from congresspeople. Those reimbursements are funded by the CEAP ("Cota para Exercício da Atividade Parlamentar" or "Quota for Exercising Parliamentary Activity" in English) provided by the Chamber of Deputies.

The below infographic offers a bit more context on the project:

As of today, Brazilian Real R$ 1 = US$ 0.31
(image source: Serenata's Website)

One interesting aspect of the project is its need to "read" scanned reimbursement receipt PDFs provided by deputies in search of things like alcoholic beverages and exact timestamps of when a meal was purchased. That data can be used to flag congress person's reimbursements as suspicious since they should not be reimbursed for alcohol. Another use case is crossing the timestamps found in the receipt with other datasets we have about parliamentary activity: if a congressperson was in a session and the meal receipt is around the same time in another city, it probably means they bought food for other people (also not allowed).

Why use a Serverless Architecture?

Mostly to reduce operational overhead and costs of managing dedicated servers for getting the job done.

In order to extract text out of those reimbursement receipts, we need to apply OCR to them. While PDFs generated by Word provide readers the ability to search within the text with Ctrl+F, the PDFs provided by congresspeople to justify the reimbursements are basically "scanned pieces of paper." In other words, the receipt files are actually images inside a PDF! While we could put together some self-hosted infra for this using OSS tools like Tesseract, big players like Google and Microsoft provide OCR as a service with a reasonable price. In using those, we don't have to worry about managing and monitoring our own boxes and fine-tuning tesseract configs, saving both time and money.

Another reason for going serverless is the fact that Serenata de amor is crowdfunded, which means having a small team/budget to keep their infra going. It also means the project doesn’t have a person to work on it on a daily basis.

In terms of $, some initial estimates I made showed that it'd cost less than US$100 total to have a serverless API handling OCR of the 20k+ receipts submitted by deputies each month (mostly the Google Cloud Vision API bill). Imagine how much we'd pay for running a set of beefy boxes to handle the OCR work which, for long periods of time, would just sit idle. Also consider the fact that at the end of the day, those third-party services are probably going to do a better job than us.

How did it go?

Instead of dealing with the nitty-gritty of configuring and deploying a "Serverless API," I decided to explore this new world by leveraging tools that automate processes and reduce the boilerplate required to get up and running.

From what I've heard, the most popular tools seem to be the Serverless Framework and Claudia.JS.

The Serverless framework “is a CLI tool that allows users to build & deploy auto-scaling, pay-per-execution, event-driven functions." It supports seven different cloud providers (as of today).

Claudia.JS is a simpler tool with a focus on deploying Node.js projects to AWS Lambda.

First Attempt

While serverless was new to me, the "OCR with Google Cloud Vision API" part was not. I've been able to OCR nearly 200k receipts already, so my first approach to the problem was to convert the Python code I previously wrote into a serverless function. Google provides support for that, so I chose to stick to the Serverless CLI and use Google Cloud Function in order to reduce the network latency between hosts.

The original plan was to have a client provide the reimbursement ID and let my new API handle all the rest. It would load the reimbursements CSV (containing all of the reimbursements submitted), identify the reimbursement, and use the additional data it provides to build a URL and then download and OCR the receipt.

Reading the CSV and doing the reimbursement lookup reusing the code meant using Python's pandas and numpy under the hood, relying on native extensions in order for things to work. That's when things started getting hairy.

These days my main dev environment is a MacBook, and the cloud function runs on Linux. In order to cross-compile extensions for deployment, I used a plugin that deals with compiling the code in a docker container that resembles the function env. The plugin seemed to work, but the problem I faced after that was that my function's resulting pkg was too big to be deployed due to the native extensions and the CSV.

In hindsight, I’m thankful that it didn’t work well. I realized that it was a lot for a function to handle since it’s supposed to be something lightweight and short lived.

Second Attempt

With pandas out of the equation, I decided to switch to a simpler approach using node.js. I still attempted a deployment to a Google Cloud Function to reduce network latency and used the Serverless CLI for the heavy lifting.

The idea would be to keep the additional reimbursement information lookup for downloading receipts in the client so the function didn’t have to read a large dataset to grab a handful of values to build the download URL. In other words, instead of hitting an endpoint like /ocr/<reimbursement_id>, clients would hit /ocr/<applicant_id>/<year>/<reimbursement_id>.

This went well until I had to convert a PDF into a PNG image to be sent over to the Cloud Vision API.

I originally used popplerutils to handle the conversion, but no serverless function has that tool in its default stack. I tried using pdfjs to handle that, but that also didn't work.

After some digging, I found out that ImageMagick can handle PDF to PNG, and it’s available in the Google Cloud Function stack. I got this part of the process working locally, but it failed in "production." Turns out that for ImageMagick to handle the PDF-to-PNG conversion, it needs to have Ghostscript available. Later, I found out that it got out of the cloud function stack.

Third Attempt

As you might've guessed, I eventually gave up on Google Functions and moved on to AWS Lambda. Along with that I chose to use ClaudiaJS, since it seemed like a simpler tool and more focused on the Node.js + Lambda combo.

I eventually got things working with the official Node.js Cloud Vision API Client locally. But, again, it failed on Lambda. This time I tracked it down to the fact that the official npm pkg for cloud vision uses gRPC under the hood, which requires some native extensions as well.

I tried Docker to compile extensions using the lambci/lambda image, but for whatever reason, it also didn't work. After some debugging and digging through AWS's UI, I noticed the function size increased quite a lot (about 20MB+ IIRC) and realized that things could actually be made simpler.

Final Attempt

The Cloud Vision API has a rest interface that can be easily used with simple HTTP clients. While gRPC is probably more efficient, HTTP also gets the job done. So, I adapted my original Python code to JS using node-fetch and called it a day. I still had to bump the function's memory to 1Gb to handle big documents (6+ pages) and "massage" ImageMagick parameters to handle some corner cases, but I eventually managed to get a proof of concept that worked for all the initial tests I made.

The code is available in GitHub, and I already have some ideas of next steps documented in the issue tracker. Here's a quick demo of going from zero to OCR in close to a minute:

Conclusion

The problem of deploying applications that leverage serverless infrastructures seems to be handled nicely by tools like the Serverless Framework CLI and Claudia.JS. While that can save us a bunch of time, given the issues outlined above, I believe that deployment is going to be a small effort of your serverless endeavors.

Based on this initial experience, my opinion is that the most important things to watch out for are function sizes (making sure they have the least amount of responsibilities) and dependencies (they can become a big PITA to get working in prod if they have to be compiled). The experience with FaaS (Function as a Service) seems to be similar to most PaaS environments like "Heroku pre-Docker." Back in those days, we didn’t have much control over system packages installed in our Dynos, and even these days we need to watch out for our slug sizes.

Finally, my main recommendation to people trying out serverless architectures for the first time is to embrace eXtreme Programming's baby steps and don't go too far with "Works on my machine."

This article was previously posted on my personal blog.

Be sure to follow @doximity_tech if you'd like to be notified about new blog posts._