Advertisement
Cell Systems
This journal offers authors two options (open access or subscription) to publish research

The GenePattern Notebook Environment

Open ArchivePublished:August 16, 2017DOI:https://doi.org/10.1016/j.cels.2017.07.003

      Highlights

      • We integrated the GenePattern genomics platform with the Jupyter Notebook environment
      • Notebooks interleave text, graphics, and analyses into complete “research narratives”
      • Users can embed genomic analyses into notebooks without the need to write code
      • GenePattern Notebook is freely available at http://www.genepattern-notebook.org

      Summary

      Interactive analysis notebook environments promise to streamline genomics research through interleaving text, multimedia, and executable code into unified, sharable, reproducible “research narratives.” However, current notebook systems require programming knowledge, limiting their wider adoption by the research community. We have developed the GenePattern Notebook environment (http://www.genepattern-notebook.org), to our knowledge the first system to integrate the dynamic capabilities of notebook systems with an investigator-focused, easy-to-use interface that provides access to hundreds of genomic tools without the need to write code.

      Graphical Abstract

      Keywords

      Main Text

      The ongoing explosion of “omics” datasets and the promise of scientific discovery arising from their analysis have given rise to software systems that aim to provide easy access to advanced methods for nonprogramming scientists. These “bioinformatics tool aggregation portals,” e.g., Galaxy (
      • Afgan E.
      • Baker D.
      • Van den Beek M.
      • Blankenberg D.
      • Bouvier D.
      • Čech M.
      • Chilton J.
      • Clements D.
      • Coraor N.
      • Eberhard C.
      • Grüning B.
      The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update.
      ), GenePattern (
      • Reich M.
      • Liefeld T.
      • Gould J.
      • Lerner J.
      • Tamayo P.
      • Mesirov J.P.
      GenePattern 2.0.
      ), and KNIME (
      • Berthold M.R.
      • Cebron N.
      • Dill F.
      • Gabriel T.R.
      • Kötter T.
      • Meinl T.
      • Ohl P.
      • Thiel K.
      • Wiswedel B.
      KNIME-the Konstanz information miner: version 2.0 and beyond.
      ), also provide for the creation and encapsulation of analytic workflows, transparent access to scalable compute resources, and removal of software installation and implementation concerns from the scientific user.
      Alternatively, analysis notebook environments, inspired by the “literate programming” philosophy (
      • Knuth D.E.
      Literate programming.
      ), integrate the exposition of a scientific project with the associated code. They aim to create an “executable document” that ideally serves as a complete description of a research project and which could also be run to reproduce the author's results. Examples include SWEAVE (
      • Leisch F.
      Sweave: dynamic generation of statistical reports using literate data analysis.
      ), Jupyter Notebook (
      • Ragan-Kelley M.
      • Perez F.
      • Granger B.
      • Kluyver T.
      • Ivanov P.
      • Frederic J.
      • Bussonnier M.
      The Jupyter/IPython architecture: a unified view of computational research, from interactive exploration to communication and publication.
      ), Beaker (beakernotebook.com), and Zeppelin (zeppelin.apache.org).
      Each of these two types of system brings significant value to its targeted user base yet has limitations that prevent wider adoption. Notebook environments model their interface around the annotation of sections of code, and therefore assume that the user is fluent in a programming language such as Python or R. Bioinformatics tool aggregation portals successfully remove the requirement for coding expertise but to date have had limited ability to incorporate the variety of rich text and media formats required to represent the full scientific narrative surrounding each analysis step.
      We have developed GenePattern Notebook (Figure 1), an environment that integrates the capabilities of both types of system, allowing users to incorporate encapsulated analysis tools, complete with their user-friendly interface, from a bioinformatics aggregation portal into an interactive analysis notebook. The environment is based on two long-standing software projects: the GenePattern platform for integrative genomics and the Jupyter Notebook environment for interactive computing.
      Figure thumbnail gr1
      Figure 1GenePattern Notebook Environment Components
      The GenePattern Notebook environment consists of (A) an online environment, powered by JupyterHub, where users can create, share, and publish GenePattern Notebooks; (B) a GenePattern server that provides hundreds of pre-packaged genomic and machine-learning analyses, all accessible through (C) a Web browser.
      GenePattern (www.genepattern.org), first released in 2004, consists of a repository of hundreds of bioinformatics analysis and visualization methods (“modules”), as well as utilities for data formatting, preprocessing, and other auxiliary functions that provide important “glue” between analysis steps. The user interface is point and click with no programming required. The public GenePattern server, hosted at www.genepattern.org since 2008, has over 40,000 registered users and runs 2,000–5,000 analysis jobs per week. Additional public servers are available at Indiana University (gp.indiana.edu/gp) and the Garvan Institute (pwbc.garvan.org.au/gp). The software has also been downloaded for local installation by over 17,000 bioinformatics core facilities, research laboratories, and individual scientists.
      The Jupyter Notebook environment (www.jupyter.org) provides a laboratory notebook metaphor in which researchers build a step-by-step scientific narrative out of “cells” that interleaves code, formatted text, mathematical formulae, plots, and multimedia. The resulting notebooks can be shared, edited, executed, and published as complete encapsulations of in silico research.
      The GenePattern Notebook functionality takes the Jupyter Notebook interface one step further, adding analysis, login, and rich text input components that present the GenePattern interface to provide code-free analysis and visualization (Figure S1). All cell types interact seamlessly with existing Jupyter cell types. Within a Python code cell, programming users can easily reference analysis results from a previous GenePattern analysis cell, and in a GenePattern analysis cell, programmers can use Python variables as inputs.
      We integrated GenePattern with Jupyter through the use of Jupyter's ipywidgets package, which provides a framework for the creation of new user interface objects within Jupyter Notebooks, and GenePattern's Web services interface, which exposes all of the functionality of GenePattern (e.g., searching for and obtaining module information or querying for the execution status of an analysis) to programmatic access. This combination is a design pattern that has general applicability to the class of Web service-based tools, and the Jupyter development team is incorporating our approach into the currently evolving design of the Jupyter interfaces for graphical input (Dr. Fernando Perez, personal communication, September 26, 2016).
      To promote the development and dissemination of GenePattern Notebooks with minimal installation requirements, we have released an online GenePattern Notebook repository and workspace where researchers can collaboratively develop and publish notebook documents. It provides a complete Jupyter environment, connections to several GenePattern servers, and for programmers, the common Python packages used in bioinformatics analysis (numpy, pandas, matplotlib, scikit, etc.). We seeded the repository with notebooks that provide commonly used machine-learning methods: clustering, classification, and prediction, as well as dimension reduction and differential expression analysis.
      Those who wish to run the GenePattern Notebook environment on their own compute resources have two options. (1) Non-programmers can install the Kitematic Docker (kitematic.com) application and use it to run the GenePattern Notebook Docker image, available on the standard Docker Hub repository (hub.docker.com). This provides a complete, ready-to-run notebook environment with all dependencies preinstalled. (2) Programmers may install the GenePattern Notebook and its dependencies through the pip or conda package manager interfaces.
      To our knowledge GenePattern Notebook is the first integration of a bioinformatics tool aggregation portal with an analysis notebook environment. This approach benefits both nonprogramming and programming investigators alike. For the nonprogrammer, GenePattern Notebook provides the user-friendly GenePattern genomic analysis capabilities within a publishable notebook format. For the programmer already using the Jupyter environment, it affords easy access to the entire GenePattern library of analysis and visualization modules that can be supplemented with the investigator's own coded routines.
      The GenePattern Notebook environment, along with an introductory demonstration video, documentation, and tutorials, is available at www.genepattern-notebook.org. The software is freely available under a BSD-style open source license.

      STAR★Methods

      Key Resources Table

      Tabled 1
      REAGENT or RESOURCESOURCEIDENTIFIER
      Software and Algorithms
      GenePattern
      • Reich M.
      • Liefeld T.
      • Gould J.
      • Lerner J.
      • Tamayo P.
      • Mesirov J.P.
      GenePattern 2.0.
      www.genepattern.org
      Jupyter Notebook environment
      • Ragan-Kelley M.
      • Perez F.
      • Granger B.
      • Kluyver T.
      • Ivanov P.
      • Frederic J.
      • Bussonnier M.
      The Jupyter/IPython architecture: a unified view of computational research, from interactive exploration to communication and publication.
      www.jupyter.org
      Kitematic Docker applicationN/Akitematic.com
      GenePattern Notebook web site and workspaceThis paperwww.genepattern-notebook.org

      Contact for Reagent and Resource Sharing

      Further information and requests for resources should be directed to and will be fulfilled by the Lead Contact, mmreich@cloud.ucsd.edu .

      Data and Software Availability

      GenePattern Notebook web site and online repository: http://www.genepattern-notebook.org

      Additional Resources

      GenePattern web site: http://www.genepattern.org
      Jupyter Notebook environment: http://www.jupyter.org
      Kitematic web site: https://kitematic.com

      Author Contributions

      Conceptualization: M.R., T.T., P.T., J.P.M.; Software, T.T., T.L., Writing – Original Draft: M.R., T.T., T.L., H.T., J.P.M.; Writing – Review & Editing: M.R., J.P.M.; Validation: B.H.; Project Administration: M.R., H.T.; Funding Acquisition: J.P.M.

      Acknowledgments

      This work was funded by NIH grants R01-GM074024 and U24-CA194107 . We thank Fernando Perez and Brian Granger for their technical advice.

      Supplemental Information

      References

        • Afgan E.
        • Baker D.
        • Van den Beek M.
        • Blankenberg D.
        • Bouvier D.
        • Čech M.
        • Chilton J.
        • Clements D.
        • Coraor N.
        • Eberhard C.
        • Grüning B.
        The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update.
        Nucleic Acids Res. 2016; 44: W3-W10
        • Berthold M.R.
        • Cebron N.
        • Dill F.
        • Gabriel T.R.
        • Kötter T.
        • Meinl T.
        • Ohl P.
        • Thiel K.
        • Wiswedel B.
        KNIME-the Konstanz information miner: version 2.0 and beyond.
        ACM SIGKDD Explorations Newsletter. 2009; 11: 26-31
        • Knuth D.E.
        Literate programming.
        Computer J. 1984; 27: 97-111
        • Leisch F.
        Sweave: dynamic generation of statistical reports using literate data analysis.
        in: Härdle W. Rönz B. Compstat. Physica, 2002: 575-580
        • Ragan-Kelley M.
        • Perez F.
        • Granger B.
        • Kluyver T.
        • Ivanov P.
        • Frederic J.
        • Bussonnier M.
        The Jupyter/IPython architecture: a unified view of computational research, from interactive exploration to communication and publication.
        in: AGU Fall Meeting Abstracts. Vol. 1. American Geophysical Union, 2014 (H44D-07)
        • Reich M.
        • Liefeld T.
        • Gould J.
        • Lerner J.
        • Tamayo P.
        • Mesirov J.P.
        GenePattern 2.0.
        Nat. Genet. 2006; 38: 500-501