Building A Test Automation Pipeline for Data Science

BTW, we're ⚡ hiring Infra, SRE, Web, Mobile, and Data engineers at Doximity (see roles) -- find out more about our technical stack.

Data science and machine learning are increasingly finding novel ways to improve and advance modern technology, and at Doximity, it's no different. Data science1 software is somewhat unique in the software world due to its subtly different problem-space and a different kind of engineering that goes into it. Testing data science software is a unique challenge and a topic with a surprising lack of online documentation.

When we first started to try to support data science software in our test automation stack several years ago, this lack of documentation on the topic was problematic. It meant we had no real roadmap for how we should be testing data science code, what tools we needed to provide for the data team to empower them in their test writing, or what challenges we might face building out a data science testing framework.

Below, we'll elaborate on the sort of document we wish we had when we began building our test automation framework for data science.

Towards a Data Science Test Automation Framework

So what does the source code look like that we will be testing in data science? The most widely used language in data science, and what we use the most at Doximity, is Python. This code makes use of databases for data analysis and transforms. It also often must be run via analytics engines such as Apache Spark containing code that can take hours or even days to execute fully. It’s not usually run as a server or service, but rather as something that is run periodically, often on a schedule, and then is done until it needs to be run again.

What does data science need from a testing perspective? A lot of the conventional uses of software testing don’t really apply to data science. We’re not testing code that powers a website or a mobile application, and we’re not usually testing APIs (though that certainly does still happen). We’re frequently testing code that will run on a schedule or a trigger, code that could have data science algorithms, or specialized code that performs a specific transformation on data. Additionally, the code we test often requires services unique to data science, services that can be difficult or even impossible to properly mock. Finally, we need a framework that is convenient and easy to use by a more broad swath of engineers than typical test automation frameworks need to accommodate. In data science, you have everyone from traditional software engineers to specialist mathematicians and scientists writing code that needs to be tested. It is a challenge, to be sure.

Foundational Technologies

As we have established, Python is used pretty heavily in the data science world. This is advantageous for those of us interested in testing data science code because Python has an abundance of automated testing tools and frameworks: from unittest and nose2, to Pytest and Hypothesis. Really, any of these frameworks can be made to work for data science. However, at Doximity, we went with Pytest due to several advantages it has.

Pytest is not opinionated about how you write and organize your tests. It supports unittest code as well as code written specifically for pytest. It has test discovery, allowing for a variety of testing layouts.
Pytest is easily extensible and has a rich tapestry of third-party plugins. It’s a great foundation upon which to build a custom testing framework from.
Pytest is an active and well-maintained project. As a result, there is little risk of it becoming orphaned or abandoned by those maintaining it.

Beyond Pytest, we have several third-party plugins as well as custom in-house plugins we have included in our framework. These include various linting and “code smell” plugins, code coverage tools, test and build diagnosis extensions, and more.

Now that we have a baseline for a Python testing framework let’s delve a bit deeper into some of those extensions that are important for testing data science code in particular.

Database abstractions

Data science code frequently involves interactions with databases. Thus being able to support code that makes heavy interactions with database software is important. The problem comes down to how much depth you support in testing code that touches databases. Superficially, you may want to mock some form of database functionality. This is a topic that frequently comes up in software testing, and there’s a mountain of documentation online dealing with it.

However, in data science, the code you’re testing often makes large-scale transformations on data in your database. These interactions can make mocking the entirety of the transactions difficult and can make mocking the data itself nearly impossible. Further, you frequently have code that is written using an ORM (such as SQLAlchemy), meaning that the underlying queries aren’t always known. Simple code changes can radically alter the underlying query without your knowledge. So there are extra challenges when trying to support code that interacts with databases in data science. Thus, it’s generally advisable to take a more tiered approach to testing database interaction code in a data science testing framework and provide several features, of increasing complexity, for testing more and more database entrenched code.

Tier 1: SQLite Fixtures
- Perhaps the easiest and most direct way to support database interactions is to provide a fixture that abstracts away your real database connections with a locally created SQLite database. Fixtures are a quick way to get your tests running in a testing or CI environment. You can create and destroy simple on-disk SQLite databases as your tests need them. If you’re using SQLAlchemy, creating a fixture that uses SQLite on the backend is fairly straightforward. The downside, and the real limitation is that SQLite is quite rudimentary compared to other SQL dialects. Meaning that, while this approach will certainly work for many tests, it will not work for testing all of your database interaction code.
Tier 2: A larger SQLite Mocking Manager
- A step up from simple SQLite Fixtures is to create a larger mocked testing manager which attempts to translate SQL dialect quirks from a larger, more featureful database into a smaller, on-disk, database like SQLite. This approach is much more involved than simply swapping the database backend out for SQLite, but it has the extra advantage of working with more database code, provided you’re willing to put effort into mocking those features your code is using. However, this solution doesn’t scale well when you are dealing with queries and transformations that deal with large datasets.
Tier 3: Running an actual database in your test environment
- Often considered “the nuclear option” for when you have test code that utilizes database features that are not easily mockable or recreated with a smaller, on-disk database. If you are running your tests inside some sort of virtualization or abstraction (such as on a VM or in Docker), you can very easily stand up a database server (such as MySQL, MariaDB, or PostgreSQL) for use by your tests. This gives you much more flexibility to support database code in your tests, but it comes at a high cost. For one, you must be rigorous with your clean-up between tests, or else data from previous tests may affect the results of subsequent tests. For another, it can be expensive from a resource perspective. For example, running tests locally on your laptop can put more demand on your hardware and run slower when an entire database is being stood up for your tests. On CI you may have extra duration to your tests as well as potential monetary costs for running a database in your environment.
Tier 4: Connecting to production-like databases
- Finally, if all else fails, it is possible to open up production-like databases to your tests when running in CI. Obviously, you want to ensure that all security concerns are accounted for, and you may want to have mocked data (or, indeed, scrubbed and obfuscated data cloned from real data), but it is possible to connect your code under test to something that, for all intents and purposes, is exactly what the data will look like in production.

Ultimately, offering several options allows your data team to pick and choose which method is right for their code. At Doximity, we offer this sort of à la carte selection of database code testing features in our data science testing framework.

Data Flow Testing and Support

Beyond merely testing code that interacts with databases, there’s also a class of tests you may want to perform on the actual flow of the data. This could be ensuring that code changes do not affect underlying queries in unexpected ways, making sure that data flow through the system is secure, or increasing your confidence in data processing pipeline code.

Freezing Your SQL Queries

Let’s start with the problem that modern ORMs can cause a disconnect between the code that is written and the underlying query that’s being constructed. The ORM abstraction will often hide or obfuscate what the query looks like under the hood, and changes to the code can inadvertently alter the underlying query without the implementer’s knowledge.

A useful feature of your data science test automation framework is something that can freeze those underlying SQL queries, allowing you to instrument tests that check for changes to the queries, or track deltas to the queries over time.

What we came up with internally for this functionality is a Pytest plugin called “SQL Freeze”. This adds a decorator that can be used in tests for taking a snapshot of the queries being generated behind the scenes in SQLAlchemy during test execution. We can then use that snapshot to perform future or additional tests against those same queries when run on code changes.

So what does this SQL Freeze look like? Well, we start by using Python’s “functools” to wrap our test methods and then intercept SQLAlchemy calls to decode the underlying queries. It can use these decoded queries to compare to existing text files or create new text files containing the frozen queries.

def sql_freeze_test_wrapper(user_function, dialect):
    if dialect == 'snowflake':
        sqla_dialect = SnowflakeDialect()
    elif dialect == 'mysql':
        sqla_dialect = MySQLDialect_mysqldb()
    else:
        raise ValueError(f"Unsupported dialect '{dialect}'")

    def wrapper(*args, **kwargs):
        outcome = user_function(*args, **kwargs)
        outcome = dialect_format(outcome, sqla_dialect)
        frozen_sql_file = get_frozen_file_path(user_function)
        try:
            with open(frozen_sql_file, "r", encoding="utf-8") as f:
                frozen_sql = _format(f.read(), sqla_dialect)
        except FileNotFoundError:
            frozen_sql = ""

        if kwargs['sql_freeze_overwrite_option']:
            if outcome != frozen_sql:
                write_new_frozen_file(outcome, frozen_sql_file)
            assert True
            return

        outcome = outcome.strip()
        expected = frozen_sql.strip()
        if outcome != expected:
            diff = difflib.ndiff(outcome.split('\n'), expected.split('\n'))
            print('\n'.join(diff))
            raise Exception(
                "The sql generated in this test does not match the expected sql."
            )

    return wrapper

Usage of this decorator looks something like the following:

@sql_freeze_test
def test_snowflake():
    query = make_snowflake_query()
    return query

Using these frozen SQL queries you can safeguard against ORM updates unexpectedly changing the behavior you are anticipating. These frozen queries can be incorporated into additional tests or checks that can happen locally or in CI, or they could be useful in debugging data pipelines. Knowing the consistency of your queries is vital to ensuring confidence in your pipelines.

Testing data pipeline code that uses Apache Spark

Apache Spark is an analytics engine for data processing and is widely used in data science. It provides a rich interface for developing distributed data processing tasks with implicit data parallelism and fault tolerance. Testing code that uses Apache Spark can be difficult, as it often requires Spark instances running, but these instances can be resource-intensive as well as difficult to stand up, and their management shouldn’t be the responsibility of the testing code or the data engineers writing the tests.

At Doximity, we’ve solved this problem by creating a set of fixtures that generate a shared Spark session across tests. We have also created fixtures used to generate dataframes easily from Python data structures, which help the brevity of Spark testing code.

We have a spark fixture which provides and manages a local Spark session. Using it in a test might look something like this:

def test_stopped_context(spark):
    '''
    Tests that a stopped context raises errors when creating a dataframe
    '''
    spark.stop()

    with pytest.raises(AttributeError):
        spark.createDataFrame([{'user_id': 3}, {'user_id': 6}])

Then, we have a create_dataframe fixture that can be used to create dataframes using inspection of Python data structures. Using this fixture in a test would look like this:

def firstname_filter(dataframe):
    '''
    Filters firstnames that don't start with letters
    '''
    return dataframe.where(dataframe.firstname.regexplike('^[A-Za-z]'))


def test_example_filter(create_dataframe):
    '''
    Tests that the ``firstname_filter`` eliminates firstnames that start with a numeric value
    '''
    dataset = [
        {
            'firstname': 'Steph',
            'lastname': 'Curry'
        },
        {
            'firstname': '0123456',
            'lastname': 'Bot'
        }
    ]

    df = create_dataframe(dataset)
    assert firstname_filter(df).count() == 1

Taint analysis of data flow code

Because data science code frequently handles large amounts of data, it’s important to know that your code handles that data securely. One threat we must deal with is tainted data or data that has the potential of being abused by malicious actors. Code paths that interact with data pipelines which do not ensure those pipelines are sanitized could potentially spell disaster for downstream software that digests the data.

One way we guard against this threat is through a taint analysis tool inside of our testing framework. We have built a custom, in-house Pytest extension that wraps Pyre’s Python Static Analyzer, Pysa, to perform static analysis of code that handles data interactions. This will test the code against potential vulnerabilities in its data flow. We then provide a Pysa Scanner fixture in our plugin that allows data engineers to write tests that selectively perform taint analysis on sections of the code we might be most concerned about.

As an example, a data engineer may have a test in their suite which performs a selective scan on a specific module of their project, while at the same time, excluding certain subdirectories. Their test might look something like this:

class TestPysa():
    def test_selective_scanning(self, pysaScanner):
        # We're manually initializing the scanner object with the directories
        # that should be scanned & excluded, as well as where the models for
        # this run are located.
        pysa = pysaScanner(
            scan_dirs=["datapipeline/db_code"], 
            exclude_dirs=[r".*\/datapipeline\/failing_file.*"],
            taint_models_path="datapipeline/db_code/pysa_test_config"
        )
        pysa_results = pysa.execute() # Execute the security scan, return results instance

        # Assert the command executed properly, and print the stderr if not
        assert pysa_results.execution_successful, pysa_results.stderr 

        # Now assert that there is nothing in the list of security concerns,
        # printing them to console should they indeed exist.
        assert not pysa_results.security_concerns, json.dumps(pysa_results.security_concerns, indent=2)

Conclusion

What we’ve covered here provides a reasonable foundation for building out and extending a test automation framework for data science projects. We hope that this may give you an idea of what challenges face data science test automation and some solutions to those challenges. With this framework, a data analyst, scientist or engineer can increase their confidence that their code is sound as well as secure. This increased confidence in their code can lead to improved development velocity, an assurance of correctness in the operations the code performs, and overall stability in the data pipelines it contains.

Be sure to follow @doximity_tech if you'd like to be notified about new blog posts.

Footnote 1

Even though data science, machine learning, and artificial intelligence are three different domains with overlapping technology, I will mainly be lumping them all under the “data science” umbrella for brevity. If you’d like a much better, more in-depth, explanation about what each of these are, please see Bushra’s wonderful post here.