OmiX Architecture

Vi-BIO

Visual Intelligence for Biology

OmiXschema and Durable Core

Peter Groves - pgroves@illinois.edu

Matthew Berry (mberry@illinois.edu)

Visual Intelligence for Biology (http://visari.org)

National Center for Supercomputing Applications (http://ncsa.illinois.edu)

White Paper, November 2017

Motivation

The VI-BIO group (Visual Intelligence for Biology) at NCSA is building a durable-core software platform that will support its current and future projects involving the development of data-intensive software applications. The primary system challenge of this undertaking is to define a supporting architecture that invests heavily in a data modeling layer so that new and modified data types are automatically integrated into all aspects of the system: storage, querying, web access, user interface data modeling, machine learning consumption, and data documentation.

We aim to build a system of enduring value to the community, which entails many factors:

Our target audience must find it easy to use in pursuing answers to a variety of research questions, and it must keep pace with innovation.
It must be reliable and secure.
It must scale to meet user demand while remaining affordable to run in both periods of high utilization and periods of low utilization.
For the development team, the system must be maintainable and supportable.
For the greater development community, its components must be FAIR (Findable, Accessible, Interoperable, and Reusable) [1].

All of these present challenges, and our approach is a carefully considered strategy to meet them.

Obsolescence, always a threat to software longevity, is a particular concern here because of the fast pace of innovation in the relatively young fields of microbiome informatics and containerized cloud computing. We mitigate that risk with several strategies already proven during our previous projects built upon the same core software. First, we minimize the software affected by any technology change by composing our system of loosely coupled modules that communicate over standards-based, well-defined interfaces. Second, we minimize the data affected by any technology change with our data model, which is designed around commonalities observed over decades of visual analytics projects and is agnostic to changes in upstream data sources and the analytical algorithms and visualizations employed. Third, we achieve rapid replacement without regression through comprehensive component-level and system-level tests.

Approach

The central principle of our implementation approach is to build a durable core over several years for storing and processing data using best practices from the software industry [2] while allowing for great flexibility in the bioinformatics codes that can at times progress in timeframes measured in months if not weeks.

The durable core, in fact, will build substantially on an existing system being used both by the Omix phase 1 prototype and the user-facing frontend for the NIH-funded KnowEnG project [1b]. The fundamental requirements of data-intensive applications have many commonalities that we are able to solve once and reuse on both large and small projects.

More generally, the ambition (and risk) of our software platform will be in the biological research, not in using bleeding-edge software technologies.

Infrastructure and Architecture

To satisfy our requirements for a durable core system, we will pursue an implementation strategy that integrates modern, off-the-shelf components that have:

A mature implementation
Easy set up process
Good default performance
Public support by a large community
Adaptability to our overall development process.

The technologies will be integrated using our own ‘ops’ application, which is essentially a set of python scripts that have evolved into a cohesive command-line application. The ops application contains functionality for building artifacts, initializing databases, deploying code either locally or to remote systems, etc. It can bring up a complete system with very few dependencies installed on the local machine in a few invocations of the appropriate ops commands. The ops codebase is considered a first-class citizen of the platform and is invested in accordingly.

Major Software Technology Components:

Structured Database: PostgreSQL [3]- A relational database system that is well-regarded in the community for both analytics and web workloads, both of which are requirements for the Omix application. It is also one of the most mature databases available, which we gravitate toward because problems with storage of user data can be some of the hardest problems to recover from.
Large File Storage: Amazon S3[4] - Provides cost-efficient storage of large files such as FASTA files. Also very mature with a proven track record, as problems with user data are (again) difficult to recover from.
Web Application Server: Flask[5] - Bare bones web server written in Python. Allows us to build our own simple REST API[6] layer using the Flask primitives without introducing an additional programming language (Python is the default language for analytics pipelines and database clients).
Application Packaging: Docker[7] - Provides a straightforward way to package the major application components and their dependencies (webserver, analytics pipelines, etc) from a source-code definition. The packaged ‘images’ can then be deployed at will in various environments. Furthermore, docker images can be archived indefinitely to address any future scientific reproducibility issues that may arise.
Orchestration: Kubernetes[8] - Manages a cluster of compute hardware and a set of docker containers provisioned over the hardware. Also manages security credentials that need to be provided to the applications (e.g. database passwords for the analytics pipelines) and resources such as temporary file space and ports. Most importantly, Kubernetes is the mechanism that allows us to deploy all application components, including large analytics pipelines, in a standard way (to a single Kubernetes installation) and then add additional compute resources to that installation as needed.
Virtual Machines and Firewall: Amazon EC2[9] - Industry standard virtual machine provider that will allow us to scale up our compute resources on demand (and back down, to save money). Also, critically provides a well-understood firewall around the VMs for server level security.
Web Application Framework: Angular[10] - A framework for building single-page web applications in Javascript (or the related language, Typescript). Sponsored by Google and has a large community that builds extensive off-the-shelf components and widgets. Provides robust functionality for managing the large amounts of data Omix will pull from the API on demand.

Data Centric Software Architecture

The main interface between the durable core system and specific bioinformatics pipelines is the structured data that are passed between them. The structured data are either raw data or results from analytics jobs. The structure of the data is formally declared once and then the subsystems (database storage, API to drive visualizations, machine learning toolkits, etc) are responsible for handling data of that form. To that end, the system will implement an OmixSchema, which allows a systems developer, data scientist, or visualization developer to define a table of data in a standardized way. An OmixSchema is essentially a simple table definition to represent columns that are either numerical data, categorical data, or identifiers of rows in other tables. The core system will then generate the following for an instance of an OmixSchema:

A table in a PostgreSQL database, with various data security considerations resolved.
A python client that supports basic Create, Read, Update, Delete (CRUD) methods directly against the Postgres database table. The client also provides a hook to the SqlAlchemy python library for doing arbitrary SQL queries against the database. This is the main interface for data science jobs and pipelines to access the data.
An RESTful endpoint in the Flask API that supports CRUD operations, simple queries of the data, and enforces data-access policies. This is the main interface for browser based user-interfaces to access the data on behalf of a user.
Swagger.io [12] definitions to generate standard API documentation of the endpoint. Note these definitions can also be used by third-parties to automatically build API clients in other programming languages.
A python client that supports CRUD operations on entries through the API. For use by remotely running jobs or third parties that wish to consume data from the Flask API in their own data science applications.
Methods for data science jobs to populate Pandas [13] matrices for for use by numpy and scikit-learn algorithms. These are used internally by data science jobs and pipelines along with the Postgres clients.
An Angular service client for working with data entries inside an (in browser) Angular user interface.

This upfront focus on the specific data structures to pass between subsystems is our novel adaptation of Domain Driven Design [20] to complex scientific data and pipelines. This will greatly improve several workflows we commonly see in the long path to the development of production-ready machine learning applications:

Data science practitioners get the durability and flexibility of an SQL database to store results as they do algorithm development, without necessarily being able to administer an RDBMS themselves. This greatly simplifies debugging of complex pipelines by providing transparency into the intermediate results using basic SQL.
Once a new analytics approach has been verified and accepted during a data science effort, it is already integrated with the production database and results can be accessed via a REST API. This is a great improvement on the typical situation where an analytics approach is developed in Python or R on a researcher’s laptop and must be overhauled to work with a webserver and database in order to be part of a web application.
Visualization development can be accelerated with standard business intelligence tools like Tableau [14] because the results of data science jobs are accessible in a Postgres database. This avoids the pitfall of building out a complete, full-stack visualization before seeing the real data. This avoids the pitfall of building out a complete, full-stack visualization before seeing it for the first time with real data.
Collaboration between medical researchers, data scientists, and user interface designers is easier when they are based on a specific set of data tables the medical researcher is handing off to the software practitioners.
As data science pipelines scale up to multiple threads or multiple machines, the transaction management features of the database can be used to solve the concurrency issues that come with parallel computation. This allows the job scheduling framework to remain relatively simple as mutlipart jobs can be run in parallel and the job components can manage their computational flow using shared control elements in the Postgres database.
As bioinformatics libraries change or are replaced, we are able to swap them out and merely provide transformations from new output to the standardized data forms the other components (such as visualizations) are built against. For instance, we will define our own specific data structure as an OmixSchema for bacterial taxa and populate it with the output of a Tornado pipeline. If we wish to use Qiime instead for this task, we can make the change in relative isolation.
Provides a consistent mechanism to relate all analytics results to the job that produced them by tracking identifiers of job runs. This yields deterministic provenance tracking back to the data inputs and job configuration (including the exact code version that was running).
OmixSchemas were designed based on our experience with real world visualization implementations and data science projects. This gives data that is mapped into OmixSchemas good default ergonomic fit to the programming tasks we typically undertake. Furthermore, improvements to the OmixSchema platform itself provides a great deal of leverage to accelerate solving relevant programming challenges and is a reliable way to transfer software engineering expertise to researchers with different backgrounds.

Figure 1. All components are isolated in Docker containers. The containers are then run by a Kubernetes orchestration system which provides the networking layer that the components communicate over. Kubernetes itself is run as a cluster over multiple hardware nodes. More docker containers can be added independently of more hardware resources and vice-versa. This provides great flexibility in scaling up the number of web servers and job workers as needed within Kubernetes while independently increasing the hardware resources when the Kubernetes cluster as a whole reaches capacity.

Figure 2 High-level overview of how the browser, API

server(s), database, and job workers communicate in response to user

interactions in the browser.

Figure 3. A simple three-step workflow where a user uploads data, runs an

analytics job over that data, and views the results. The data models for data

of type X, Y, and the configuration parameters for a Job "X->Y" are defined once as

OmiXschemas and the system handles the basic data transport functions between

components in the appropriate formats.

Security Considerations

While most (if not all) of the initial data to be imported into the Omix tool will be previously published, we still intend to build in security safeguards from the beginning to work through some of the issues unique to our architecture and the data to be stored.

The security layers will be:

Using an AWS Virtual Private Cloud [15] to isolate all access to the servers and databases except for public access to the web server.
Kubernetes internal networking to isolate servers and databases from each other, except where needed to access their dependent services.
Kubernetes management of security credentials (AWS credentials, database passwords, etc), so individual machines have access to the bare minimum of secrets to run their service.
SSL encryption on all web traffic between servers and the user’s browser.
Authentication using JWT security tokens [16] to validate users.
Passlib [17] library to uniquely salt and hash stored passwords using modern cryptography.
Data ownership built into all database tables. By default, when a user queries for data from the browser app, they can only access data created by their own user account. More sophisticated data sharing will be added using this basic mechanism when prioritized by users.

Development Process

Complementary to the overall architecture described above is our development process to make consistent, reliable progress toward a full system. Generally, our process is a cycle of stages that can run in parallel when appropriate:

Issue Prioritization -> UX Design -> Code Implementation -> Code Review -> Deployment -> User Feedback -> Issue Prioritization

Issue Prioritization: We use Atlassian’s JIRA issue tracking software to collect all user stories and bug reports. At least once every few weeks we prioritize and assign the most important issues to address next.

UX Design: Handoff of designs to the engineering team is facilitated by the use of industry standard tools, primarily Sketch App user interface design tool and InVision prototyping and style markup tool. InVision [18], specifically, provides developers with access to all of the interface mockups as well as a platform for documentation and discussion of details related to specific components. In addition, it can be used to obtain feedback at early stages of development from potential users and other stakeholders in order to better guide the design process

Code Implementation and Review: Our engineering process is based on git-flow[11] branching and merging (with code reviews). A continuous integration server runs an extensive test suite that exercises all of the architectural components and their integration before any code is merged into the mainline code repository.

Deployment: We fully automate the setup of new instances and deployment of code changes to those instances. This allows researchers with very little system-admin experience can get up and running quickly. Furthermore, iterative updates are made frequently and reliably to all instances, including a staging server that runs the latest code merged into the mainline and demo servers that are periodically updated when the mainline is considered stable enough to support end users.

User Support and Feedback: Users will be able to submit help requests, bug reports, and other feedback via a simple form within the application. These submissions will be automatically captured in JIRA [19], an industry-standard ticketing system, for follow-up that may include additional correspondence with the user, modifications to the application, and refinements to the on-screen help. Furthermore, JIRA’s reporting tools will allow us to track metrics such as ticket volume and time to resolution to ensure we remain responsive to user needs over the lifetime of the application.

Issue Prioritization (Development Cycle Restart): User support issues that need additional development time are added to the backlog of JIRA issues, which are again prioritized and the cycle repeats.

Bibliography:

[1] Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data3:160018 doi: 10.1038/sdata.2016.18 (2016).

[1b] KnowEnG: Big Data To Knowledge Center of Excellence [Internet]. KnowEnG BD2K center [2017]. Available from: https://knoweng.org/

[2] N-Tier / 3-Tier Architectural Style [Internet]. Microsoft MSDN. [2017]. Available from: https://msdn.microsoft.com/en-us/library/ee658117.aspx#NTier3TierStyle

[3] PostgreSQL 10 [Internet]. PostgreSQL Global Development Group. [2017]. Available from: https://www.postgresql.org.

[4] Amazon S3 [Internet]. Amazon Web Services [2017]. Available from: https://aws.amazon.com/s3/

[5] Flask [Internet]. Armin Ronacher. [2017]. Available from: http://flask.pocoo.org/

[6] Vazquez, Gonzalo. An Introduction to API’s. 2015 Aug 26 [cited 2017 Nov 9]. Available from: https://restful.io/an-introduction-to-api-s-cee90581ca1b

[7] Docker [Internet]. Docker, Inc. [2017]. Available from: https://www.docker.com/

[8] Kubernetes [Internet]. The Linux Foundation. [2017]. Available from: https://kubernetes.io/

[9] Amazon EC2 [Internet]. Amazon Web Services [2017]. Available from: https://aws.amazon.com/ec2/

[10] Angular [Internet]. Google. [2017]. Available from: https://angular.io/

[11] GitFlow Workflow [Internet]. Atlassian. [cited 2017 Nov 9] Available from: https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow

[12] Swagger [Internet]. SmartBear Software. [2017]. Available from: https://swagger.io/

[13] Python Data Analysis Library (pandas) [Internet]. NumFOCUS [2017]. Available from: http://pandas.pydata.org/

[14] Tableau [Internet]. Tableau Software. [2017]. Available from: https://www.tableau.com/

[15] Amazon Virtual Private Cloud (VPC) [Internet]. Amazon Web Services [2017]. Available from: https://aws.amazon.com/vpc/

[16] Introduction to JSON Web Tokens (JWT) [Internet]. Auth0 [2017]. Available from: https://jwt.io/introduction/

[17] Passlib 1.7.1 documentation [Internet]. Assurance Technologies, LLC. [2017]. Available from: http://passlib.readthedocs.io/en/stable/index.html

[18] InVision [Internet]. InVision [2017]. Available from: https://www.invisionapp.com/

[19] Jira Service Desk [Internet]. Atlassian [2017]. Available from: https://www.atlassian.com/software/jira/service-desk

[20] What is Domain Driven Design? [Internet]. Domain Language, Inc. [2016]. Available from: http://dddcommunity.org/learning-ddd/what_is_ddd/

VI-BIO - Visual Intelligence for Biology

NCSA | UNIVERSITY OF ILLINOIS

CONTACT: cbushell@illinois.edu