ihop-reach

View the Project on GitHub cannin/ihop-reach

26 August 2019

Final Report | Work Product Submission

by Rohit R Chattopadhyay

REACH Web Application

An application for users to access biological data extracted from biomedical literature.


Broadly the project can be divided into the following components:

  1. GraphQL API with MongoDB as database
  2. Web Application
  3. Data Generation pipeline

The source code of the project is hosted on GitHub and it has been deployed at servers provided by University of California San Diego and can be accessed using the following links:


Table of Contents

  1. Introduction
  2. Technology Stack
  3. Application Programming Interface
      a. GraphQL API
      b. MongoDB Database
  4. Web Application Frontend
  5. Web Server
  6. Database Generation Pipeline
      a. MongoDB Import
      b. Analyzing PubMed XML files
      c. The Pipeline
  7. Docker Images
  8. Pull Requests
  9. Work Left
  10. Important Links
  11. Conclusion

1. Introduction

The project started under Google Summer of Code 2019 under the mentorship of Augustin Luna. Rohit Rajat Chattopadhyay was selected as the student developer under the program.
The application allows its users to get access to the vital sentences (evidence) present in the biomedical literature describing molecular interactions. The source of these papers is PubMed Central.

2. Technology Stack

The web application and the GraphQL API are built using JavaScript and for the data generation pipeline, Python3 has been used.

Other than these, Shell scripting has been used in Docker images.

3. Application Programming Interface

Source Code, Pull Request, Docker Image

Our objective was to allow our users to interact and use the extracted information present in our MongoDB based database. We have used GraphQL for this, as it allows the users to cherry-pick whatever information they need. This also reduces the load on our servers thus improving efficiency.

Another major motive for choosing GraphQL was, the poor performance of REST API while building our static site using GatsbyJS due to the large size of our database.

a. GraphQL API

GraphQL is a new technology and it has a positive reception from the community. It allows a user to get whatever information they need from single endpoint thus reducing the complexity of the user’s code and also reducing the payload of the responses thus increasing efficiency.

Being a tool for the BioInformatics industry, GraphiQL, an in-browser GraphQL IDE allows our users to interact with our API without much efforts to set up the environment on their machines.

We have used GraphiQL Explorer as our GraphiQL IDE. This helped us to implement the feature of Explorer, which lets the users make queries by simply clicking the required fields.

To improve the response time, the responses are gzipped using compression package.

b. MongoDB Database

Since the raw data are in JSON format and the number of documents is quite high, we need a scalable solution for database, hence we wanted a NoSQL Database and MongoDB was the chosen due to its extensive community and presence of well-tested libraries in Python as well as JavaScript.

Our Database is named iHOP and consists following collections:

4. Web Application Frontend

Source Code, Pull Request, Docker Image

The purpose of the web application is to present the information in a user-friendly way. Each identifier having its page consisting of all the evidence (sentence) extracted from medical papers present in PubMed Central.

The application is developed using a ReactJS based static site generator GatsbyJS. GatsbyJS has made its reputation to be capable of building blazing-fast websites and apps.

There are three major components as follows:

5. Web Server

Source Code, Pull Request, Docker Image

To serve the static files generated by GatsbyJS, we are using ExpressJS based on NodeJS. The responses are gzipped using compression package.
We have implemented Let’s Encrypt SSL certification using ngnix-proxy-companion.

6. Database Generation Pipeline

a. MongoDB Import

Source Code, Pull Request

Scripts are written in Python3 to imports all the JSON files into the MongoDB in iHOP.articles collection and create a mapping in iHOP.identifier_mapping collection.
importJSON.py, for importing JSON to iHOP.articles
mapping.py, for creating mapping iHOP.identifier_mapping

b. Analyzing PubMed XML files

Source Code, Pull Request

Analysis of PubMed XML files is broken down into two steps:
  1. extractor.py is used to traverse the XML files downloaded from PubMed FTP and generate a CSV having our required data.
  2. mongoPubmedImport.py traverses the generated CSV files, creates objects and imports in MongoDB in iHOP.pubmed collection.

c. The Pipeline

Source Code, Pull Request, Docker Image

The PubMed central maintains several archives, each archive is to be extracted and processed using CLULAB/REACH to get the JSON files as output. These JSON files are to be imported to our database using the import scripts. The following image shows a basic outline of the process.

Pipeline Flowchart

At present we are processing our first archive file.

7. Docker Images

The application extensively uses Docker to run the containers in the server. Following Docker images hosted in Docker Hub are used to run the application:
  a. rchattopadhyay/reach-api, for GraphQL API and MongoDB
  b. rchattopadhyay/reach-webapp, for building GatsbyJS static site
  c. rchattopadhyay/reach-webapp-server, for serving the GatsbyJS generated static files
  d. rchattopadhyay/pubmed-ftp-processor, for the pipeline to update the database with latest articles in PubMed Central.

8. Pull Requests

Following are the Pull Requests containing work done during Google Summer of Code 2019:

  1. PubMed XML data extraction branch
  2. Docker Image for processing PubMed Central archive files
  3. GraphQL API and MongoDB
  4. Server to serve the static pages
  5. Static Site Generator
  6. MongoDB Import Script
  7. Update master branch with documentation
  8. Add publication_date in literature/pubmed_client.py
  9. Add hypothesis and context.species information in index_cards

9. Work Left

Work on the data-generation pipeline remains. The prototype is ready, and we have successfully tested it on the first archive file. Due to the large size of these files, each one of them takes at least two weeks to process. We have eight such files, so the process is time taking.
Once the pipeline is streamlined, our database will be regularly updated with the latest articles, thus keeping our application up to date. I hope that we can process all the files by the year-end.

11. Conclusion

In these last three months, I got to know BioInformatics can help to improve life. The project aims to make it easy for researchers to find molecular interactions. I hope the tool becomes the favourite tool of the researchers in BioInformatics and related fields.

I never expected to learn so much in such a short time, the project has helped me to understand how things work at the production level and the level of code and documentation it demands. The program taught me the power of Open Source and why it is important to the community. Thanks to the program and my mentor for making me confident enough to contribute to repositories where I would never think of forking.

I would like to thank my parents and brother for their constant support. I would also like to thank my friend, Priti Shaw for constantly supporting me, especially during the application and community bonding period. I am grateful to William Markuske for providing the computational requirements for the project.

Any flight cannot fly in the right direction without its Captain, my mentor, Augustin Luna did the same for the project. He was calm, patient and helped me whenever I was stuck. He has taken some major decisions and now I understand their importance, one of them being scrapping the REST API in favour of GraphQL API. He is the perfect mentor a student can get.

The free sharing and teaching of open source is incompatible with the notion of the solitary genius.
~Golan Levin

tags: gsoc - 2019 - coding period - third phase - final evaluation - final report