by Rohit R Chattopadhyay
An application for users to access biological data extracted from biomedical literature.
Broadly the project can be divided into the following components:
The project started under Google Summer of Code 2019 under the mentorship of Augustin Luna. Rohit Rajat Chattopadhyay was selected as the student developer under the program.
The application allows its users to get access to the vital sentences (
evidence) present in the biomedical literature describing molecular interactions. The source of these papers is PubMed Central.
Other than these, Shell scripting has been used in Docker images.
Our objective was to allow our users to interact and use the extracted information present in our MongoDB based database. We have used GraphQL for this, as it allows the users to cherry-pick whatever information they need. This also reduces the load on our servers thus improving efficiency.
Another major motive for choosing GraphQL was, the poor performance of REST API while building our static site using GatsbyJS due to the large size of our database.
GraphQL is a new technology and it has a positive reception from the community. It allows a user to get whatever information they need from single endpoint thus reducing the complexity of the user’s code and also reducing the payload of the responses thus increasing efficiency.
Being a tool for the BioInformatics industry, GraphiQL, an in-browser GraphQL IDE allows our users to interact with our API without much efforts to set up the environment on their machines.
We have used GraphiQL Explorer as our GraphiQL IDE. This helped us to implement the feature of Explorer, which lets the users make queries by simply clicking the required fields.
To improve the response time, the responses are gzipped using
Our Database is named
iHOP and consists following collections:
articles, stores all the documents output from CLULAB/REACH after processing files from PubMed repository. Indexes:
extracted_information.participant_a.entity_text" : 1
extracted_information.participant_b.entity_text" : 1
extracted_information.participant_a.identifier" : 1, extracted_information.participant_b.identifier" : 1
identifier_mapping, stores mapping between identifier (
iden),matched terms (
syn),entity type (
typ) among the documents in
iden : 1
pubmed, stores PUBMED paper information extracted from
NXMLfiles downloaded from PUBMED FTP.
pmcid : -1
year : -1
The purpose of the web application is to present the information in a user-friendly way. Each
identifier having its page consisting of all the
evidence (sentence) extracted from medical papers present in PubMed Central.
There are three major components as follows:
Programmatically create pages
Static pages are generated using the MongoDB database using the GraphQL API as the interface. The list of unique
identifier, using which the pages are created using a template using
createPages method here.
This component highlights the keywords present in the sentences. Code is available here
Please Note that
UAZID identifiers are ignored as they do not have proper mapping.
Search and Typeahead
For search functionality we have used Lunr. We are indexing the
entity_name here. The search feature has been implemented here.
The typeahead feature has been implemented by adding wildcard and it can resist one typographical error in the searched term. Implementation of the same can be found here.
To serve the static files generated by GatsbyJS, we are using ExpressJS based on NodeJS. The responses are gzipped using
We have implemented Let’s Encrypt SSL certification using ngnix-proxy-companion.
Scripts are written in Python3 to imports all the JSON files into the MongoDB in
iHOP.articles collection and create a mapping in
importJSON.py, for importing JSON to
mapping.py, for creating mapping
Analysis of PubMed XML files is broken down into two steps:
extractor.py is used to traverse the
XML files downloaded from PubMed FTP and generate a
CSV having our required data.
mongoPubmedImport.py traverses the generated
CSV files, creates objects and imports in MongoDB in
The PubMed central maintains several archives, each archive is to be extracted and processed using CLULAB/REACH to get the JSON files as output. These JSON files are to be imported to our database using the import scripts. The following image shows a basic outline of the process.
At present we are processing our first archive file.
The application extensively uses Docker to run the containers in the server. Following Docker images hosted in Docker Hub are used to run the application:
rchattopadhyay/reach-api, for GraphQL API and MongoDB
rchattopadhyay/reach-webapp, for building GatsbyJS static site
rchattopadhyay/reach-webapp-server, for serving the GatsbyJS generated static files
rchattopadhyay/pubmed-ftp-processor, for the pipeline to update the database with latest articles in PubMed Central.
Following are the Pull Requests containing work done during Google Summer of Code 2019:
Work on the data-generation pipeline remains. The prototype is ready, and we have successfully tested it on the first archive file. Due to the large size of these files, each one of them takes at least two weeks to process. We have eight such files, so the process is time taking.
Once the pipeline is streamlined, our database will be regularly updated with the latest articles, thus keeping our application up to date. I hope that we can process all the files by the year-end.
In these last three months, I got to know BioInformatics can help to improve life. The project aims to make it easy for researchers to find molecular interactions. I hope the tool becomes the favourite tool of the researchers in BioInformatics and related fields.
I never expected to learn so much in such a short time, the project has helped me to understand how things work at the production level and the level of code and documentation it demands. The program taught me the power of Open Source and why it is important to the community. Thanks to the program and my mentor for making me confident enough to contribute to repositories where I would never think of forking.
I would like to thank my parents and brother for their constant support. I would also like to thank my friend, Priti Shaw for constantly supporting me, especially during the application and community bonding period. I am grateful to William Markuske for providing the computational requirements for the project.
Any flight cannot fly in the right direction without its Captain, my mentor, Augustin Luna did the same for the project. He was calm, patient and helped me whenever I was stuck. He has taken some major decisions and now I understand their importance, one of them being scrapping the REST API in favour of GraphQL API. He is the perfect mentor a student can get.
tags: gsoc - 2019 - coding period - third phase - final evaluation - final report
The free sharing and teaching of open source is incompatible with the notion of the solitary genius.