Intranet Areál web

Fyziologický ústav AV ČR

Špičková věda pro zdraví

Chemoinformatics resources


We run local instances of several services focused on Metabolomics, Lipidomics, Statistical analysis and Chemometrics. As the hardware resources are limited, these tools are available only within campus LAN.

All services are running in separate Docker containers:

Contents:

  1. Metaboanalyst - statistical, functional and integrative analysis of metabolomics data
  2. ChemRICH - Chemical Similarity Enrichment Analysis for Metabolomics
  3. CRediT (Contributor Roles Taxonomy) generator
  4. MS-DIAL data processing benchmark

Metaboanalyst - statistical, functional and integrative analysis of metabolomics data

MetaboAnalyst in a container.... from https://github.com/xia-lab/MetaboAnalyst_Docker

# modify Docker file
# prefer Java 8 (Oracle flavor)
...

ENV METABOANALYST_VERSION 4.93
ENV METABOANALYST_LINK https://www.dropbox.com/s/9xo4yy3gzqsvyj9/MetaboAnalyst-4.93.war?dl=0
ENV METABOANALYST_FILE_NAME MetaboAnalyst.war
...
# Build the Dockerfile
docker build -t metab_docker .

# Run the Dockerfile in the interactive mode
docker run -ti --rm --name METAB_DOCKER -p 8080:8080 metab_docker

# Execute R script loading libraries etc. inside the container
Rscript /metab4script.R 

# Deploy Java cargo inside the container
java -jar /opt/payara/payara-micro.jar --deploymentDir /opt/payara/deployments

# Look for running MetaboAnalyst at http://localhost:8080/MetaboAnalyst/

ChemRICH - Chemical Similarity Enrichment Analysis for Metabolomics

  • Local version based on latest git version, modified to run in Ubuntu 20.04 LTS (R 4.0+)
  • Corrected output formats, fixed compatibility issues and added SVG export option
  • Official service website: http://chemrich.fiehnlab.ucdavis.edu
# Dockerfile
FROM opencpu/base
MAINTAINER OK
LABEL Description = "ChemRICH 0.1.1 container"
...
# prepare Ubuntu for compilations as needed
...
# setup Java, jdk contains jre, set PATH
RUN apt-get -y install openjdk-11-jdk
# check where is Java
#RUN update-java-alternatives -l
#RUN java -version
ENV JAVA_HOME="/usr/lib/jvm/java-1.11.0-openjdk-amd64"
ENV PATH $JAVA_HOME/bin:$PATH

# configure Java for R
RUN R CMD javareconf
RUN R -e "install.packages('rJava', repos='http://cran.rstudio.com/')"    

# install all R packages via R script, wait few hours
# XLConnect works with Java 8 till Java 11, no more
ADD install_package.R /install_package.R
RUN Rscript install_package.R

# these packages require special attention
# RCurl needs re-installation if ....rcurl.so... error appears
RUN R -e "install.packages('devtools', repos='http://cran.rstudio.com/')"
RUN R -e "install.packages('RCurl', repos='http://cran.rstudio.com/')"
RUN R -e "install.packages('unix', repos='http://cran.rstudio.com/')"

# run local installation of the package
COPY ChemRICH_0.1.1.tar.gz /ChemRICH_0.1.1.tar.gz 
RUN R -e "install.packages('ChemRICH_0.1.1.tar.gz', repos = NULL)"

# opencpu needs more time for POST and more memory, upload modified configurations (timelimit.post": 900, etc.)
COPY defaults.conf /usr/local/lib/R/site-library/opencpu/config/defaults.conf
COPY server.conf /etc/opencpu/server.conf

# make sure Java can be found in rApache and other daemons not looking in R ldpaths
# otherwise RJava loading error will appear
RUN echo "/usr/lib/jvm/java-1.11.0-openjdk-amd64/lib/server/" > /etc/ld.so.conf.d/rJava.conf
RUN /sbin/ldconfig

# add R script if needed in the interactive session
ADD run-opencpu-server.R /run-opencpu-server.R

# start the service
CMD service cron start && /usr/lib/rstudio-server/bin/rserver && apachectl -DFOREGROUND

# just in case
# ENTRYPOINT ["bin/bash"]

Build and run the container

docker run -t -p 80:80 -p 8004:8004 opencpu/rstudio

# help: https://hub.docker.com/r/opencpu/rstudio
# help: https://opencpu.github.io/server-manual/opencpu-server.pdf

# browse
# http://localhost:8004/ocpu/library/ChemRICH/www/
# http://localhost:8004/ocpu/info

local installation of ChemRICH


CRediT (Contributor Roles Taxonomy) generator

This web generator helps to summarize contributions of individual authors and prepare the Author contributions paragraph for a scientific journal


MS-DIAL data processing benchmark

The purpose of this test was to select an appropriate hardware for LC-MS data processing using MS-DIAL 4.20.

Input data

  • Metabolomics profiling using LIMeX5D workflow. Only HILICn part was selected.
  • Q Exactive Plus; R17,500; 714 *.abf files, each ~5 MB
  • Time windows of 6 minutes selected for data processing.
  • Custom MSP metabolomics library, 1.5 GB

HW resources

Machine A - old office computer

  • Intel Core i3 4130 (2 CPU/4 threads)
  • DDR3 8 GB
  • SSD 6 Gb/s
  • Windows 10 Pro 64bit

Machine B - Workstation

  • AMD Ryzen 7 7200X (8 CPU/16 threads)
  • DDR4 64 GB
  • NVMe 4x 8.0 GT/s
  • Windows 10 Pro 64bit

Machine C - Virtual machine (OpenStack)#

  • virtual Intel Xeon-2296 (16 vCPUs)
  • 64 GB
  • QEMU HDD
  • Windows Server 2019

Results

Machine MS-DIAL threads Intensity threshold Library loading... Peak detection... Alignment... Total time Peak spots
A 4 20,000 cps 20 min 90 min 260 min 6.2 hrs 23,840
B 4 20,000 cps 20 min 34 min 37 min 1.5 hrs 23,840
B 8 20,000 cps 20 min 27 min 42 min 1.5 hrs 23,840
B 16 20,000 cps 20 min 23 min 40 min 1.4 hrs 23,840
B 16 20,000 cps 1 min* 23 min 36 min 1.0 hrs 23,840
C 16 20,000 cps 23 min 14 min 46 min 1.4 hrs 23,840
C 16 20,000 cps 1 min* 14 min 45 min 1.0 hrs 23,840
C 16 1,000 cps 23 min 30 min 920 min 16.2 hrs 27,026
  • Processor: More threads means faster data processing, mainly in the middle part. New CPUs beat older versions. Number of available threads per physical CPU is not equivalent to the number of VCPUs as it depends on the virtualization.
  • Memory: Peak memory usage was ~ 25 GB during the alignment phase. The final alignmentResult_2020_6_6_22_21_59.EIC.aef file was ~14 GB. At least 1.5 GB RAM per thread was needed. If RAM amount was limited, memory caching to disk prolonged the processing significantly (not shown, run scratched after 6 hrs).
  • Disk: no major difference was observed between SSD, NVMe, and emulated virtual drive. The final part of data merging run at ~35 Mb/s writing.
  • If no Threadripper workstation is available, virtualization could help. Physical machine B and virtual machine C performance was comparable. However, B was faster in parts where a direct hardware access was needed - and C was faster in parallel processing thanks to dedicated 16vCPUs. A general estimation is that 1 vCPU = 1 Physical CPU Core. However, this is not entirely correct, as the vCPU is made up of time slots across all available physical cores, so in general 1vCPU is actually more powerful than a single core, especially if the physical CPUs have 8 cores. Therefore, AMD workstation with 8 CPUs kept pace with 16vCPUs.
  • When peak picking threshold was set to very low level (1,000 cps), only ~3,200 more spots were found for extra 14.8 hours. Most of the extra spots were unreliable and within the noise level.
  • * The initial Library loading takes a constant amount of time. However, once processed and serialized, an *.msp2 binary file is created. This could be re-used in an independent project as an MSP library that is loaded within a minute. Therefore, the minimal time of analysis was 1.0 hr.
  • # Computational resources were supplied by the project "e-Infrastruktura CZ" (e-INFRA LM2018140) provided within the program Projects of Large Research, Development and Innovations Infrastructures.

Conclusion

Dedicated physical workstation with many CPUs (Threadripper or better) is the best option for processing large datasets with large libraries using MS-DIAL software.