26 August 2019

Final Report | Work Product Submission

by Rohit R Chattopadhyay

REACH Web Application

An application for users to access biological data extracted from biomedical literature.

Broadly the project can be divided into the following components:

GraphQL API with MongoDB as database
Web Application
Data Generation pipeline

The source code of the project is hosted on GitHub and it has been deployed at servers provided by University of California San Diego and can be accessed using the following links:

GraphQL API: https://reach-api.nrnb-docker.ucsd.edu
Web Application: https://reach.nrnb-docker.ucsd.edu

Introduction
Technology Stack
Application Programming Interface
a. GraphQL API
b. MongoDB Database
Web Application Frontend
Web Server
Database Generation Pipeline
a. MongoDB Import
b. Analyzing PubMed XML files
c. The Pipeline
Docker Images
Pull Requests
Work Left
Important Links
Conclusion

1. Introduction

The project started under Google Summer of Code 2019 under the mentorship of Augustin Luna. Rohit Rajat Chattopadhyay was selected as the student developer under the program.
The application allows its users to get access to the vital sentences (evidence) present in the biomedical literature describing molecular interactions. The source of these papers is PubMed Central.

2. Technology Stack

The web application and the GraphQL API are built using JavaScript and for the data generation pipeline, Python3 has been used.

Web Application
- GatsbyJS, for frontend
- NodeJS and ExpressJS, for serving the static files
Application Program Interface
- GraphQL, as web service
- MongoDB, as database
Data Generation pipeline
- Python3

Other than these, Shell scripting has been used in Docker images.

3. Application Programming Interface

Source Code, Pull Request, Docker Image

Our objective was to allow our users to interact and use the extracted information present in our MongoDB based database. We have used GraphQL for this, as it allows the users to cherry-pick whatever information they need. This also reduces the load on our servers thus improving efficiency.

Another major motive for choosing GraphQL was, the poor performance of REST API while building our static site using GatsbyJS due to the large size of our database.

a. GraphQL API

GraphQL is a new technology and it has a positive reception from the community. It allows a user to get whatever information they need from single endpoint thus reducing the complexity of the user’s code and also reducing the payload of the responses thus increasing efficiency.

Being a tool for the BioInformatics industry, GraphiQL, an in-browser GraphQL IDE allows our users to interact with our API without much efforts to set up the environment on their machines.

We have used GraphiQL Explorer as our GraphiQL IDE. This helped us to implement the feature of Explorer, which lets the users make queries by simply clicking the required fields.

To improve the response time, the responses are gzipped using compression package.

b. MongoDB Database

Since the raw data are in JSON format and the number of documents is quite high, we need a scalable solution for database, hence we wanted a NoSQL Database and MongoDB was the chosen due to its extensive community and presence of well-tested libraries in Python as well as JavaScript.

Our Database is named iHOP and consists following collections:

articles , stores all the documents output from CLULAB/REACH after processing files from PubMed repository. Indexes:
- entityNameA : extracted_information.participant_a.entity_text" : 1
- entityNameB : extracted_information.participant_b.entity_text" : 1
- identifier ($text) : extracted_information.participant_a.identifier" : 1, extracted_information.participant_b.identifier" : 1
identifier_mapping , stores mapping between identifier (iden),matched terms (syn),entity type (typ) among the documents in articles collection.
- identifier : iden : 1
pubmed , stores PUBMED paper information extracted from NXML files downloaded from PUBMED FTP.
- pmc_index : pmcid : -1
- year : year : -1

4. Web Application Frontend

Source Code, Pull Request, Docker Image

The purpose of the web application is to present the information in a user-friendly way. Each identifier having its page consisting of all the evidence (sentence) extracted from medical papers present in PubMed Central.

The application is developed using a ReactJS based static site generator GatsbyJS. GatsbyJS has made its reputation to be capable of building blazing-fast websites and apps.

There are three major components as follows:

Programmatically create pages
Static pages are generated using the MongoDB database using the GraphQL API as the interface. The list of unique identifier, using which the pages are created using a template using createPages method here.
Sentence Highlighting
This component highlights the keywords present in the sentences. Code is available here
Please Note that UAZID identifiers are ignored as they do not have proper mapping.
Search and Typeahead
For search functionality we have used Lunr. We are indexing the Matches for entity_name here. The search feature has been implemented here.
The typeahead feature has been implemented by adding wildcard and it can resist one typographical error in the searched term. Implementation of the same can be found here.

5. Web Server

Source Code, Pull Request, Docker Image

To serve the static files generated by GatsbyJS, we are using ExpressJS based on NodeJS. The responses are gzipped using compression package.
We have implemented Let’s Encrypt SSL certification using ngnix-proxy-companion.

6. Database Generation Pipeline

a. MongoDB Import

Source Code, Pull Request

Scripts are written in Python3 to imports all the JSON files into the MongoDB in iHOP.articles collection and create a mapping in iHOP.identifier_mapping collection.
importJSON.py, for importing JSON to iHOP.articles
mapping.py, for creating mapping iHOP.identifier_mapping

b. Analyzing PubMed XML files

Source Code, Pull Request

Analysis of PubMed XML files is broken down into two steps:
1. extractor.py is used to traverse the XML files downloaded from PubMed FTP and generate a CSV having our required data.
2. mongoPubmedImport.py traverses the generated CSV files, creates objects and imports in MongoDB in iHOP.pubmed collection.

c. The Pipeline

Source Code, Pull Request, Docker Image

The PubMed central maintains several archives, each archive is to be extracted and processed using CLULAB/REACH to get the JSON files as output. These JSON files are to be imported to our database using the import scripts. The following image shows a basic outline of the process.

Pipeline Flowchart

At present we are processing our first archive file.

7. Docker Images

The application extensively uses Docker to run the containers in the server. Following Docker images hosted in Docker Hub are used to run the application:
a. rchattopadhyay/reach-api, for GraphQL API and MongoDB
b. rchattopadhyay/reach-webapp, for building GatsbyJS static site
c. rchattopadhyay/reach-webapp-server, for serving the GatsbyJS generated static files
d. rchattopadhyay/pubmed-ftp-processor, for the pipeline to update the database with latest articles in PubMed Central.

8. Pull Requests

Following are the Pull Requests containing work done during Google Summer of Code 2019:

9. Work Left

Work on the data-generation pipeline remains. The prototype is ready, and we have successfully tested it on the first archive file. Due to the large size of these files, each one of them takes at least two weeks to process. We have eight such files, so the process is time taking.
Once the pipeline is streamlined, our database will be regularly updated with the latest articles, thus keeping our application up to date. I hope that we can process all the files by the year-end.

10. Important Links

11. Conclusion

In these last three months, I got to know BioInformatics can help to improve life. The project aims to make it easy for researchers to find molecular interactions. I hope the tool becomes the favourite tool of the researchers in BioInformatics and related fields.

I never expected to learn so much in such a short time, the project has helped me to understand how things work at the production level and the level of code and documentation it demands. The program taught me the power of Open Source and why it is important to the community. Thanks to the program and my mentor for making me confident enough to contribute to repositories where I would never think of forking.

I would like to thank my parents and brother for their constant support. I would also like to thank my friend, Priti Shaw for constantly supporting me, especially during the application and community bonding period. I am grateful to William Markuske for providing the computational requirements for the project.

Any flight cannot fly in the right direction without its Captain, my mentor, Augustin Luna did the same for the project. He was calm, patient and helped me whenever I was stuck. He has taken some major decisions and now I understand their importance, one of them being scrapping the REST API in favour of GraphQL API. He is the perfect mentor a student can get.

The free sharing and teaching of open source is incompatible with the notion of the solitary genius.
~Golan Levin

tags: gsoc - 2019 - coding period - third phase - final evaluation - final report

ihop-reach

Final Report | Work Product Submission

REACH Web Application

Table of Contents

1. Introduction

2. Technology Stack

3. Application Programming Interface

a. GraphQL API

b. MongoDB Database

4. Web Application Frontend

5. Web Server

6. Database Generation Pipeline

a. MongoDB Import

b. Analyzing PubMed XML files

c. The Pipeline

7. Docker Images

8. Pull Requests

9. Work Left

10. Important Links

11. Conclusion