Cam Topic: Combating the Life Science Data Avalanche

Big data has become a growing issue in science, as these data sets are so large and complex that traditional data processing applications are inadequate. This is especially true for the life science industry, where the growing size of data hasn’t been met with tools for analyzing and interpreting this data at the same rate, leading to what many call a “data avalanche.”

Life science researchers are seeing more next-generation sequencing data generation, more samples and deeper sequencing, and this data is increasing in complexity as researchers move from targeted panels to whole-exome and whole-genome sequencing. However, the tools appearing in the market often fail to incorporate system-level interpretation, leading to further issues.

In enters bioinformatics, an interdisciplinary field—including computer science, statistics, mathematics and engineering—which develops methods and software tools for understanding biological data.

As a result, to address the growing size of data, there’s an expansion of tools to take advantage of Cloud computing and storage. Many vendors are developing their own Cloud-enabled software platforms capable of hosting a variety of analysis applications. And while life science researchers want to leverage the Cloud, they only want to do so for the additional functionality it brings, such as access to large data sets, common annotation sources, data sharing and scalable computing—they don’t want to overcome the hassle of uploading unless there’s a real benefit.

The other downside to open source solutions, like Cloud computing, is “tool overload”. “Tool overload occurs as open source tools rapidly evolve and create downstream problems with version control, validation, security, governance, reporting and data reproducibility,” says Narges Bani Asadi, PhD, Founder and CEO for Bina Technologies.

However, with the onset of bioinformatics, multi-omic data sets are increasingly common and tools are being developed to integrate this data, “sometimes across extremely different file types and software applications,” says Antoni Wandycz, Director, Bioinformatics Solutions, Software & Informatics Div., Agilent Technologies.

There has also been an increase in the use of tools for modeling and simulation. As predictive analytics is a standard in research, it’s finding its way into regulatory submissions. “This shift toward more in silico experimentation has provided several benefits to the life science industry,” says Tim Moran, Director of Life Science Research Marketing, BIOVIA.

The benefits of bioinformatics
In life science, whether for basic research or applications—bioinformatics offers major advantages, including measuring hidden values and seeing invisible patterns—at a low cost. However, the truth still remains that the best technology on the market can’t measure everything of interest to this field. “An example of this is the fluctuating concentration of any given metabolite in a live cell,” says Wandycz.

However, good bioinformatics models allow researchers to interpret those values based on what they can measure. “Furthermore, a researcher may not see the pattern in a large, complex model of data sets, whereas bioinformatics algorithms excel as such a task,” says Wandycz. With bioinformatics tools on the market, thousands, and sometimes millions, of virtual experiments can be done faster and cheaper than a single experiment at the lab bench. Yet, bioinformatics technology doesn’t replace reality, “and therefore doesn’t replace traditional laboratory experiments, rather it complements them, and is complemented in turn,” says Wandycz.

In the past, bioinformatics has been a barrier to translational research, and a known bottleneck between laboratory scientists who do sequencing and clinical researchers who do annotation and interpretation. “Empowering bioinformatics with a Genomic Management Solution (GMS) to more efficiently process and manage large volumes of genomic data, removes this bottleneck,” says Bani Asadi. “With GMS we also enable researchers with better access to the data through collaborative user interfaces, to more effectively leverage the data across research teams, organizations and consortia.”

By providing collaborative access to genomic data, clinical and translational research can progress towards a deeper understanding of human disease, mechanisms, treatment and diagnosis.

Bioinformatics tools also enable modeling and simulation that can reduce the number of experiments that need to be performed. This meets the repeatability standards of life science experiments. “These same capabilities allow researchers to identify undesirable pharmacological and/or biological development issues early in discovery, before progress to development—essentially failing poor candidates faster and providing higher-quality candidates downstream,” says Moran.

The changing landscape of life science research
The changing landscape of research today is forcing the bioinformatics community to seek a new level of data sharing and collaboration only made possible with new platforms. Data volumes are growing due to larger population studies, deeper sequencing coverage to detect variants with greater sensitivity and new instrumentation systems that run simultaneously and increase the data output of sequencing cores.

Data integration from internal and external sources and data management are enabling better use of available data and the elimination of what vendors call “dark data”—“data that was collected once, then lost, so that it can’t be re-used for decision-making,” says Moran. Through scientific collaboration, scientific content federation and semantic search, and virtual modeling and predictive simulations, pharmaceutical researchers can quickly and more efficiently conduct drug discovery strategies for delivering new, better targeted therapeutic solutions.

Dassault Systemes BIOVIA, in collaboration with the BioIntelligence Consortium, has developed a comprehensive suite of BioPLM (product lifecycle management) solutions. These solutions, according to Moran, provide a scientifically intelligent environment in which researchers can filter and combine complex data, browse and visualize information from multiple data sources in real time and perform extensive scientific calculations and advanced analytics to improve the efficiency and effectiveness of drug discovery and development.

Other software tools, such as Agilent’s GeneSpring, also handle the complexity of data behind the scenes by importing data in various formats, managing it internally and offering a common interface for multiple-omics workflows that are designed for ease-of-use.

“Multi-omic integration is more than just bundling various applications into a product suite,” says Wandycz.

In the case of Agilent’s GeneSpring, workflows for genomics, transcriptomics, proteomics and metabolomics have been developed in parallel, using a common platform and a single user interface. As a result, researchers who incorporate a new technology into their laboratory don’t have to learn an entirely new tool, with new terminology and idioms, in order to analyze data. GeneSpring can also merge results, as in a meta- and multi-omic correlation analysis, or as overlays on metabolic pathway maps. Agilent’s OpenLAB platform provides a set of turnkey solutions that facilitate data management and collaboration.

Thermo Fisher Scientific, with their Ion Reporter, has recently introduced versioned workflows. “Customers want to be able to control the software version in hosted Cloud software applications,” says Mike Leilvelt, Senior Director of Bioinformatics, Thermo Fisher Scientific. With the introduction of versioning to workflow in their Ion Reporter software, this allows laboratories to adopt new, improved analysis methods in a controlled manner with tools to help compare the results between workflows.

Rapid upgrades needed
Although tools for bioinformatics continue to grow in number and scientific complexity, there’s still a lack of advanced tools that move beyond the genomic level to the cell, tissue and systems of the body and beyond to include the environment. “In addition to the lack of tools capable of systemic modeling and analysis, the industry is still lacking tools that capture the decision-making process,” says Moran.

The reason a decision is made has been linked to the success of certain programs in therapeutic discovery. However, some tools are emerging, such as ScienceCloud, that mimic popular online social networking tools. “These tools provide and maintain support for metadata surrounding the decision-making process,” says Moran.

Future bioinformatics tools must also tackle the issues of sharing data, repeating analysis, accommodating personal preferences and adapting to change. “Today’s large data sets take too long to pass around the Internet,” says Wandycz. “Future tools could take the analysis to the data, running the same computation on the remote machines where the data is stored and shared.”

Furthering this idea, data can’t reproduce results, and future tools should provide methods for sharing the methods and algorithms that were used. “Personal preferences in bioinformatics workflows should be accommodated, by importing and exporting standard formats using open protocols,” says Wandycz. As research is naturally unpredictable, future software platforms need to be flexible enough to adapt to new technologies and methods. This can be achieved by adopting plug-in frameworks and APIs that enable an ecosystem of interconnected software applications and allow users to incorporate, or replace, features using modern scripting languages.

The future
There are many opportunities to coordinate, aggregate and organize all available data for evidence-based clinical decision support that save and enhance lives. “This future that we all aspire to will only be made possible by new technologies that enhance interoperability, data standardization and compatibility for future data utilities,” says Bani Asadi. This will require more transparent reproducible bioinformatics solutions, more open access sharable bioinformatics resources and data sets, the integration of different organized and systematic procedures, new standards for defining metadata and phenotypic information, as well as standards for defining associations and leveraging data aggregation and artificial intelligence and interpretation.

Overall, our understanding of underlying biology is far from complete, and understanding is often a prerequisite for engineering applications.

Cam Topic

Thursday, August 20, 2015

Combating the Life Science Data Avalanche

No comments:

Post a Comment