Tutorial

ToxiVerse tutorial

This page provides a step-by-step process for using all three modules of ToxiVerse. Please scroll below for the contents.

ToxiVerse tutorial cover
ToxiVerse modules.
Introduction

Introduction

Computational toxicology plays a significant role in identifying hazardous compounds to protect human health and the environment in a cost-effective manner. A major challenge in this field is the lack of publicly available and user-friendly computational tools that can be used for chemical risk assessment with a user-provided dataset, especially by users with limited computational expertise.

To address this need, we developed Toxicology Universe (ToxiVerse), a web portal designed to assist toxicologists, pharmaceutical researchers, and chemists in assessing chemical safety. Please check the figure below for the available options.

The main modules and sub-options available in ToxiVerse
The options in ToxiVerse. They are explained in detail with the step-by-step tutorial below.

ToxiVerse offers the following modules:

  • Bioprofiler: Provides chemical descriptors by profiling PubChem bioactivity results for chemicals of interest, once experimental data gaps are filled using machine-learning models built from key assays.
  • Database: Download and visualize curated toxicological datasets. The integrated database includes over 50,000 chemicals across 50 endpoints, mostly related to toxicity, compiled from various sources.
  • Cheminformatics: Create QSAR models using either user-uploaded datasets or datasets retrieved from PubChem by providing an Assay ID. A variety of molecular descriptors and machine learning algorithms are supported. Predict toxicity for new chemicals using the models developed. Options for chemical curation and space visualization are also provided.
Available options under Database and Cheminformatics modules
The three main modules, plus the available options under the Database and Cheminformatics modules.
Bioprofiler

Bioprofiler

Users can upload up to 500 chemicals to profile them using PubChem bioactivity data. Bioprofiler provides chemical descriptors by profiling PubChem bioactivity results for chemicals of interest, once experimental data gaps are filled using machine-learning models built from all 35 key assays if they have at least 100 active and inactive chemicals.

Bioprofiler upload screen with PubChem Compound ID sample file
Upload a file with PubChem Compound IDs, for example Bioprofile_sample_dataset.txt, and click the start button.

A sample file for Bioprofiler contains one PubChem CID per line:

4091
2548
5417
1254
1493
1981
14403
2141
3061
6049
3467
Bioprofiler results available to download
Bioprofiler results available to download.
Initial Bioprofile matrix with assays as columns and chemicals as rows
Initial Bioprofile exports the bioprofile matrix with chemicals as rows and bioassays as columns. Activity values are encoded as 1 for Active or Probe, -1 for Inactive, and 0 for Inconclusive, Unspecified, or empty.

The values of inconclusive data, shown as 0s in the Initial Bioprofile, are replaced with either 1 or -1 in the Complete Bioprofile.

Clustered heatmap of the bioprofile matrix
Heatmap: a clustered heatmap of the bioprofile matrix to spot activity patterns across assays and chemicals. It supports up to 2000 assay columns.
Model metrics table for Random Forest models
Model Metrics: detailed performance statistics including accuracy, precision, recall, F1-score, and ROC AUC for Random Forest models on top informative assays.

The top assays were selected through Mutual Information (MI) score. MI measures how informative assay outcomes are about the overall activity of chemicals. First, the bioactivity data collected from PubChem assays for target chemicals were transformed into a chemical-bioactivity matrix, the Initial Bioprofile, where active values were encoded as 1 and inactive or inconclusive values were encoded as 0. Next, each chemical was assigned a binary overall activity label, defined as active if it was active in at least one assay. Finally, the MI between the activity outcome of each assay and the overall activity label was calculated.

An assay achieves a high MI score when both its actives and inactives closely match the overall activity labels of as many chemicals as possible. The more consistently an assay distinguishes overall active from inactive chemicals, the higher its MI is. From each assay, a maximum of 500 active and 500 inactive chemicals were collected to build a QSAR model, with a minimum of 100 chemicals in both classes required for the assay to be selected.

Box plot summarizing Bioprofiler model performance
Metrics Plot: box plot summarizing model performance for quick comparison.
Complete Bioprofile matrix with imputed bioactivity values
Complete Bioprofile returns the complete bioprofile matrix with missing bioactivity values imputed using trained models.

Some of the Complete Bioprofile values are replaced with either 1 or -1 if they were earlier 0s in the Initial Bioprofile.

Database

Database

Toxicity Datasets

The Toxicity Datasets option allows users to download and visualize curated toxicological datasets, including chemical space and endpoint distributions, and provides relevant bioassays from PubChem for the selected endpoints. The dataset contains over 50,000 chemical records across 50 endpoints mainly related to toxicity, collected from various sources.

Toxicity Datasets screen with PCA plot, histogram, relevant bioassays, and CSV export
Select an endpoint to display the PCA plot, histogram, and relevant bioassays. The selected dataset can also be exported to a CSV file.

Bioassays are ranked by active rate, calculated as Active divided by Active plus Inactive, for the selected endpoint. Assays with Active + Inactive ≥500 are included. To stabilize small-sample assays, Bayesian-adjusted scores are used and up to the top 500 assays per endpoint are retained. Inconclusive results are shown but excluded from calculations.

Resources

The Resources option provides details of the available datasets, including references.

First rows of the Resources table with hyperlinks
First rows of the Resources table, including hyperlinks.
Cheminformatics

Cheminformatics

The Cheminformatics module supports uploading or retrieving datasets, curation, chemical space visualization, QSAR model building, and QSAR prediction.

  • Upload or Retrieve dataset
  • Curator
  • Chemical Space Visualization
  • QSAR Builder
  • QSAR Predictor

Upload or Retrieve dataset

Data can be in Comma-Separated Values (CSV) or Structure Data Format (SDF) format to upload. Sample files are provided. You can upload up to 2000 chemicals or retrieve up to 1000 chemicals. Instead of uploading a dataset, you may also import chemicals with structure-activity information from PubChem by entering the PubChem Assay Identifier (AID).

Upload or Retrieve dataset overview
Upload or Retrieve dataset overview.
Upload dataset form with format, file, CID, activity, SMILES, and dataset type fields
Select a format, upload a file such as sample_dataset.csv, enter the CID, activity, and SMILES column names, choose the dataset type, and click Upload dataset.
Please check the sample files provided. Upload accepts up to 2000 chemicals.
Sample file for Upload Dataset
A sample file for Upload Dataset.
Uploaded dataset display with download and remove actions
Uploaded files are displayed with options to download or remove the selected dataset.
Retrieve dataset from PubChem by entering an AID
Enter an AID and click to retrieve the dataset from PubChem data stored in the local database. The option retrieves up to 1000 random chemicals, including 500 actives and 500 inactives.

Curator

Curator cleans chemical structures and prepares them for next steps such as model generation by the following steps:

  • Check and clean chemical structures.
  • Standardize chemical structure representation, such as updating valencies and removing charges.
  • Strip salts and solvents, and remove mixtures by retaining the parent non-salt/non-solvent component.
  • Merge or remove duplicated structures.
Curator screen showing dataset selection and curation options
Select your dataset and options, then click to run curation. If Create new dataset is chosen instead of Replace dataset, a file named sample_classification_curated will be generated.

Chemical Space Visualization

Principal Component Analysis (PCA) is a dimension reduction technique used for visualizing chemical space.

Chemical Space Visualization screen showing dataset selection and PCA plot generation
Select your dataset and click to generate a PCA plot. The PCA figure is then generated.

QSAR Builder

QSAR Builder creates QSAR models using a variety of descriptors and algorithms with either a user-uploaded dataset or a dataset obtained by providing a PubChem Assay ID.

QSAR Builder overview
QSAR Builder overview.
QSAR Builder steps showing dataset, features, algorithms, activity type, and model metrics
Choose your dataset, choose features, algorithms, and activity type, click to generate model or models, then check the evaluation metrics and access the generated models.
QSAR Builder dataset selections and performance metrics
Dataset selection, multiple QSAR model generation options, and performance metrics of classification models.

If all fingerprints and algorithms are selected, the performance metrics are shown as a line plot as above. The Classification metrics figure option is used to download the plot, and a CSV format of those scores can be obtained from the Classification metrics file.

Tasks

Tasks

You can monitor the status, including live updates, of Cheminformatics module jobs here.

Tasks page for monitoring Cheminformatics jobs
Tasks page for monitoring Cheminformatics module jobs.
Cheminformatics

QSAR Predictor

QSAR Predictor predicts toxicity for new chemicals using models developed in the QSAR Builder option.

QSAR Predictor overview
QSAR Predictor overview.
QSAR Predictor model selection and input options
Select your QSAR model, select input and output formats, then either choose a file such as Prediction_dataset.csv and enter the SMILES column name, or paste SMILES and click Predict.
The Predictor accepts up to 100 chemicals. Sample files can be downloaded. Multiple QSAR models can also be selected.
QSAR Predictor results with Prediction column added to the model name
Prediction results are returned with _Prediction added to the model name. The output includes prediction scores for the chemicals.

An output file such as Prediction_dataset_predicted.csv is downloaded with a new prediction column, for example sample_dataset_Binary-ECFP6-RF-Classification_Prediction, containing predicted scores for the chemicals.

Contact us

Contact us

Rowan University: 201 Mullica Hill Rd, Robinson Hall, Glassboro, NJ 08028

Tulane University: Hutchinson Memorial Building, School of Medicine, 1415 Tulane Ave, New Orleans, LA 70112

Questions, comments, and general inquiries can be emailed to toxiverse.help@gmail.com.

About us

About us

The Zhu Lab uses cheminformatics algorithms, workflows, and other computational tools to model chemical toxicity, ADME (Absorption, Distribution, Metabolism, and Excretion), and other biological activities. These models support regulatory chemical toxicity assessments and the computer-aided drug discovery (CADD) process.

Zhu Lab and Tulane University School of Medicine logos
Zhu Lab and Tulane University School of Medicine.
Please Wait, It may take several minutes...