Vi-BIO
Visual Intelligence for Biology
OmiXschema and Durable Core
Peter Groves - pgroves@illinois.edu
Matthew Berry (mberry@illinois.edu)
Visual Intelligence for Biology (http://visari.org)
National Center for Supercomputing Applications (http://ncsa.illinois.edu)
White Paper, November 2017
Motivation
The VI-BIO group (Visual Intelligence for Biology) at NCSA is building a durable-core software platform that will support its current and future projects involving the development of data-intensive software applications. The primary system challenge of this undertaking is to define a supporting architecture that invests heavily in a data modeling layer so that new and modified data types are automatically integrated into all aspects of the system: storage, querying, web access, user interface data modeling, machine learning consumption, and data documentation.
We aim to build a system of enduring value to the community, which entails many factors:
All of these present challenges, and our approach is a carefully considered strategy to meet them.
Obsolescence, always a threat to software longevity, is a particular concern here because of the fast pace of innovation in the relatively young fields of microbiome informatics and containerized cloud computing. We mitigate that risk with several strategies already proven during our previous projects built upon the same core software. First, we minimize the software affected by any technology change by composing our system of loosely coupled modules that communicate over standards-based, well-defined interfaces. Second, we minimize the data affected by any technology change with our data model, which is designed around commonalities observed over decades of visual analytics projects and is agnostic to changes in upstream data sources and the analytical algorithms and visualizations employed. Third, we achieve rapid replacement without regression through comprehensive component-level and system-level tests.
Approach
The central principle of our implementation approach is to build a durable core over several years for storing and processing data using best practices from the software industry [2] while allowing for great flexibility in the bioinformatics codes that can at times progress in timeframes measured in months if not weeks.
The durable core, in fact, will build substantially on an existing system being used both by the Omix phase 1 prototype and the user-facing frontend for the NIH-funded KnowEnG project [1b]. The fundamental requirements of data-intensive applications have many commonalities that we are able to solve once and reuse on both large and small projects.
More generally, the ambition (and risk) of our software platform will be in the biological research, not in using bleeding-edge software technologies.
Infrastructure and Architecture
To satisfy our requirements for a durable core system, we will pursue an implementation strategy that integrates modern, off-the-shelf components that have:
The technologies will be integrated using our own ‘ops’ application, which is essentially a set of python scripts that have evolved into a cohesive command-line application. The ops application contains functionality for building artifacts, initializing databases, deploying code either locally or to remote systems, etc. It can bring up a complete system with very few dependencies installed on the local machine in a few invocations of the appropriate ops commands. The ops codebase is considered a first-class citizen of the platform and is invested in accordingly.
Major Software Technology Components:
Data Centric Software Architecture
The main interface between the durable core system and specific bioinformatics pipelines is the structured data that are passed between them. The structured data are either raw data or results from analytics jobs. The structure of the data is formally declared once and then the subsystems (database storage, API to drive visualizations, machine learning toolkits, etc) are responsible for handling data of that form. To that end, the system will implement an OmixSchema, which allows a systems developer, data scientist, or visualization developer to define a table of data in a standardized way. An OmixSchema is essentially a simple table definition to represent columns that are either numerical data, categorical data, or identifiers of rows in other tables. The core system will then generate the following for an instance of an OmixSchema:
This upfront focus on the specific data structures to pass between subsystems is our novel adaptation of Domain Driven Design [20] to complex scientific data and pipelines. This will greatly improve several workflows we commonly see in the long path to the development of production-ready machine learning applications:
Figure 1. All components are isolated in Docker containers. The containers are then run by a Kubernetes orchestration system which provides the networking layer that the components communicate over. Kubernetes itself is run as a cluster over multiple hardware nodes. More docker containers can be added independently of more hardware resources and vice-versa. This provides great flexibility in scaling up the number of web servers and job workers as needed within Kubernetes while independently increasing the hardware resources when the Kubernetes cluster as a whole reaches capacity.
Figure 2 High-level overview of how the browser, API
server(s), database, and job workers communicate in response to user
interactions in the browser.
Figure 3. A simple three-step workflow where a user uploads data, runs an
analytics job over that data, and views the results. The data models for data
of type X, Y, and the configuration parameters for a Job "X->Y" are defined once as
OmiXschemas and the system handles the basic data transport functions between
components in the appropriate formats.
Security Considerations
While most (if not all) of the initial data to be imported into the Omix tool will be previously published, we still intend to build in security safeguards from the beginning to work through some of the issues unique to our architecture and the data to be stored.
The security layers will be:
Development Process
Complementary to the overall architecture described above is our development process to make consistent, reliable progress toward a full system. Generally, our process is a cycle of stages that can run in parallel when appropriate:
Issue Prioritization -> UX Design -> Code Implementation -> Code Review -> Deployment -> User Feedback -> Issue Prioritization
Issue Prioritization: We use Atlassian’s JIRA issue tracking software to collect all user stories and bug reports. At least once every few weeks we prioritize and assign the most important issues to address next.
UX Design: Handoff of designs to the engineering team is facilitated by the use of industry standard tools, primarily Sketch App user interface design tool and InVision prototyping and style markup tool. InVision [18], specifically, provides developers with access to all of the interface mockups as well as a platform for documentation and discussion of details related to specific components. In addition, it can be used to obtain feedback at early stages of development from potential users and other stakeholders in order to better guide the design process
Code Implementation and Review: Our engineering process is based on git-flow[11] branching and merging (with code reviews). A continuous integration server runs an extensive test suite that exercises all of the architectural components and their integration before any code is merged into the mainline code repository.
Deployment: We fully automate the setup of new instances and deployment of code changes to those instances. This allows researchers with very little system-admin experience can get up and running quickly. Furthermore, iterative updates are made frequently and reliably to all instances, including a staging server that runs the latest code merged into the mainline and demo servers that are periodically updated when the mainline is considered stable enough to support end users.
User Support and Feedback: Users will be able to submit help requests, bug reports, and other feedback via a simple form within the application. These submissions will be automatically captured in JIRA [19], an industry-standard ticketing system, for follow-up that may include additional correspondence with the user, modifications to the application, and refinements to the on-screen help. Furthermore, JIRA’s reporting tools will allow us to track metrics such as ticket volume and time to resolution to ensure we remain responsive to user needs over the lifetime of the application.
Issue Prioritization (Development Cycle Restart): User support issues that need additional development time are added to the backlog of JIRA issues, which are again prioritized and the cycle repeats.
Bibliography:
[1] Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data3:160018 doi: 10.1038/sdata.2016.18 (2016).
[1b] KnowEnG: Big Data To Knowledge Center of Excellence [Internet]. KnowEnG BD2K center [2017]. Available from: https://knoweng.org/
[2] N-Tier / 3-Tier Architectural Style [Internet]. Microsoft MSDN. [2017]. Available from: https://msdn.microsoft.com/en-us/library/ee658117.aspx#NTier3TierStyle
[3] PostgreSQL 10 [Internet]. PostgreSQL Global Development Group. [2017]. Available from: https://www.postgresql.org.
[4] Amazon S3 [Internet]. Amazon Web Services [2017]. Available from: https://aws.amazon.com/s3/
[5] Flask [Internet]. Armin Ronacher. [2017]. Available from: http://flask.pocoo.org/
[6] Vazquez, Gonzalo. An Introduction to API’s. 2015 Aug 26 [cited 2017 Nov 9]. Available from: https://restful.io/an-introduction-to-api-s-cee90581ca1b
[7] Docker [Internet]. Docker, Inc. [2017]. Available from: https://www.docker.com/
[8] Kubernetes [Internet]. The Linux Foundation. [2017]. Available from: https://kubernetes.io/
[9] Amazon EC2 [Internet]. Amazon Web Services [2017]. Available from: https://aws.amazon.com/ec2/
[10] Angular [Internet]. Google. [2017]. Available from: https://angular.io/
[11] GitFlow Workflow [Internet]. Atlassian. [cited 2017 Nov 9] Available from: https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow
[12] Swagger [Internet]. SmartBear Software. [2017]. Available from: https://swagger.io/
[13] Python Data Analysis Library (pandas) [Internet]. NumFOCUS [2017]. Available from: http://pandas.pydata.org/
[14] Tableau [Internet]. Tableau Software. [2017]. Available from: https://www.tableau.com/
[15] Amazon Virtual Private Cloud (VPC) [Internet]. Amazon Web Services [2017]. Available from: https://aws.amazon.com/vpc/
[16] Introduction to JSON Web Tokens (JWT) [Internet]. Auth0 [2017]. Available from: https://jwt.io/introduction/
[17] Passlib 1.7.1 documentation [Internet]. Assurance Technologies, LLC. [2017]. Available from: http://passlib.readthedocs.io/en/stable/index.html
[18] InVision [Internet]. InVision [2017]. Available from: https://www.invisionapp.com/
[19] Jira Service Desk [Internet]. Atlassian [2017]. Available from: https://www.atlassian.com/software/jira/service-desk
[20] What is Domain Driven Design? [Internet]. Domain Language, Inc. [2016]. Available from: http://dddcommunity.org/learning-ddd/what_is_ddd/
VI-BIO - Visual Intelligence for Biology
NCSA | UNIVERSITY OF ILLINOIS
CONTACT: cbushell@illinois.edu