by Rohit R Chattopadhyay
An application for users to access biological data extracted from biomedical literature.
Broadly the project can be divided into the following components:
The source code of the project is hosted on GitHub and it has been deployed at servers provided by University of California San Diego and can be accessed using the following links:
The project started under Google Summer of Code 2019 under the mentorship of Augustin Luna. Rohit Rajat Chattopadhyay was selected as the student developer under the program.
The application allows its users to get access to the vital sentences (evidence
) present in the biomedical literature describing molecular interactions. The source of these papers is PubMed Central.
The web application and the GraphQL API are built using JavaScript and for the data generation pipeline, Python3 has been used.
Other than these, Shell scripting has been used in Docker images.
Source Code, Pull Request, Docker Image
Our objective was to allow our users to interact and use the extracted information present in our MongoDB based database. We have used GraphQL for this, as it allows the users to cherry-pick whatever information they need. This also reduces the load on our servers thus improving efficiency.
Another major motive for choosing GraphQL was, the poor performance of REST API while building our static site using GatsbyJS due to the large size of our database.
GraphQL is a new technology and it has a positive reception from the community. It allows a user to get whatever information they need from single endpoint thus reducing the complexity of the user’s code and also reducing the payload of the responses thus increasing efficiency.
Being a tool for the BioInformatics industry, GraphiQL, an in-browser GraphQL IDE allows our users to interact with our API without much efforts to set up the environment on their machines.
We have used GraphiQL Explorer as our GraphiQL IDE. This helped us to implement the feature of Explorer, which lets the users make queries by simply clicking the required fields.
To improve the response time, the responses are gzipped using compression
package.
Since the raw data are in JSON format and the number of documents is quite high, we need a scalable solution for database, hence we wanted a NoSQL Database and MongoDB was the chosen due to its extensive community and presence of well-tested libraries in Python as well as JavaScript.
Our Database is named iHOP
and consists following collections:
articles
, stores all the documents output from CLULAB/REACH after processing files from PubMed repository.
Indexes:
extracted_information.participant_a.entity_text" : 1
extracted_information.participant_b.entity_text" : 1
extracted_information.participant_a.identifier" : 1, extracted_information.participant_b.identifier" : 1
identifier_mapping
, stores mapping between identifier (iden
),matched terms (syn
),entity type (typ
) among the documents in articles
collection.
iden : 1
pubmed
, stores PUBMED paper information extracted from NXML
files downloaded from PUBMED FTP.
pmcid : -1
year : -1
Source Code, Pull Request, Docker Image
The purpose of the web application is to present the information in a user-friendly way. Each identifier
having its page consisting of all the evidence
(sentence) extracted from medical papers present in PubMed Central.
The application is developed using a ReactJS based static site generator GatsbyJS. GatsbyJS has made its reputation to be capable of building blazing-fast websites and apps.
There are three major components as follows:
Programmatically create pages
Static pages are generated using the MongoDB database using the GraphQL API as the interface. The list of unique identifier
, using which the pages are created using a template using createPages
method here.
Sentence Highlighting
This component highlights the keywords present in the sentences. Code is available here
Please Note that UAZID
identifiers are ignored as they do not have proper mapping.
Search and Typeahead
For search functionality we have used Lunr. We are indexing the Matches
for entity_name
here. The search feature has been implemented here.
The typeahead feature has been implemented by adding wildcard and it can resist one typographical error in the searched term. Implementation of the same can be found here.
Source Code, Pull Request, Docker Image
To serve the static files generated by GatsbyJS, we are using ExpressJS based on NodeJS. The responses are gzipped using compression
package.
We have implemented Let’s Encrypt SSL certification using ngnix-proxy-companion.
Scripts are written in Python3 to imports all the JSON files into the MongoDB in iHOP.articles
collection and create a mapping in iHOP.identifier_mapping
collection.
importJSON.py
, for importing JSON to iHOP.articles
mapping.py
, for creating mapping iHOP.identifier_mapping
Analysis of PubMed XML files is broken down into two steps:
1. extractor.py
is used to traverse the XML
files downloaded from PubMed FTP and generate a CSV
having our required data.
2. mongoPubmedImport.py
traverses the generated CSV
files, creates objects and imports in MongoDB in iHOP.pubmed
collection.
Source Code, Pull Request, Docker Image
The PubMed central maintains several archives, each archive is to be extracted and processed using CLULAB/REACH to get the JSON files as output. These JSON files are to be imported to our database using the import scripts. The following image shows a basic outline of the process.
At present we are processing our first archive file.
The application extensively uses Docker to run the containers in the server. Following Docker images hosted in Docker Hub are used to run the application:
a. rchattopadhyay/reach-api
, for GraphQL API and MongoDB
b. rchattopadhyay/reach-webapp
, for building GatsbyJS static site
c. rchattopadhyay/reach-webapp-server
, for serving the GatsbyJS generated static files
d. rchattopadhyay/pubmed-ftp-processor
, for the pipeline to update the database with latest articles in PubMed Central.
Following are the Pull Requests containing work done during Google Summer of Code 2019:
Work on the data-generation pipeline remains. The prototype is ready, and we have successfully tested it on the first archive file. Due to the large size of these files, each one of them takes at least two weeks to process. We have eight such files, so the process is time taking.
Once the pipeline is streamlined, our database will be regularly updated with the latest articles, thus keeping our application up to date. I hope that we can process all the files by the year-end.
In these last three months, I got to know BioInformatics can help to improve life. The project aims to make it easy for researchers to find molecular interactions. I hope the tool becomes the favourite tool of the researchers in BioInformatics and related fields.
I never expected to learn so much in such a short time, the project has helped me to understand how things work at the production level and the level of code and documentation it demands. The program taught me the power of Open Source and why it is important to the community. Thanks to the program and my mentor for making me confident enough to contribute to repositories where I would never think of forking.
I would like to thank my parents and brother for their constant support. I would also like to thank my friend, Priti Shaw for constantly supporting me, especially during the application and community bonding period. I am grateful to William Markuske for providing the computational requirements for the project.
Any flight cannot fly in the right direction without its Captain, my mentor, Augustin Luna did the same for the project. He was calm, patient and helped me whenever I was stuck. He has taken some major decisions and now I understand their importance, one of them being scrapping the REST API in favour of GraphQL API. He is the perfect mentor a student can get.
tags: gsoc - 2019 - coding period - third phase - final evaluation - final reportThe free sharing and teaching of open source is incompatible with the notion of the solitary genius.
~Golan Levin