Curated Datasets
Dataset Name | Total # of Compounds | Data Type | Dataset Source | Dataset Description |
---|---|---|---|---|
Acute toxicity | 7332 | log10 mol/kg-bw | Zhu et al. | Used publicly available rat oral acute toxicity data, with LD₅₀ (Rat oral) values classified into toxicity categories (e.g., high, moderate, low) according to Globally Harmonized System (GHS) thresholds. |
Aquatic toxicity | 675 | -log10 of Conc. (µmol/L) | Klopman et al. | Collected from standardized 96-hour LC₅₀ test data for Pimephales promelas (fathead minnow), sourced from the EPA’s ECOTOX database and additional public toxicology resources. |
BBB (Blood Brain Barrier) |
438 | logBB | Wang et al. | Compounds with experimental logBB values was compiled and curated using ChemAxon and CASE Ultra tools. |
Bioavailability | 1141 | oral bioavailability (%F) |
Kim et al. Moda et al. |
Compiled across public and literature sources. Chemical structures were standardized, and %F values were harmonized to resolve discrepancies. |
Carcinogenicity | 342 | Binary, 0=Non-Carcinogen; 1=Carcinogen | Chung et al. | 342 unique organic compounds from the EPA’s IRIS database, labeled as carcinogenic or noncarcinogenic based on oral slope factor (OSF), a quantitative measure for oral cancer risk. |
Cosmetics | 4129 | Activity_Cosmetics (All chemicals fall under the cosmetic category) | Chung et al. | Cosmetic dataset collected from COSMOS Cosmetics Inventory knowledge base. |
DART (Developmental and Reproductive Toxicity) |
1999 | Oral Developmental, Inhalation Maternal, ToxRefDB Maternal, Binary Labels (1:Safe; 0: Teratogen) |
Ciallella et al. Aljarf et al. |
Collected from U.S. EPA’s in vivo prenatal developmental toxicity studies in rats and rabbits based on oral or inhalation studies. Embryotox dataset was Collected from FDA drug labeling data and literature annotations for known teratogens. Drugs with strong evidence of teratogenicity were classified as positives. Non-teratogenic drugs were chosen from non-reproductive risk categories to avoid mislabeling. |
Drugbank | 8055 | Activity_Drugbank (All chemicals fall under the drug category) | Chung et al. | Collected from DrugBank database. |
Endocrine disruption | 2103 | Agonist, Antagonist, Binding, and Uterotrophic class | Ciallella et al. | The data was collected from the Collaborative Estrogen Receptor Activity Prediction Project (CERAPP), which provides in vitro data on estrogen receptor agonism, antagonism, and binding, and from the Estrogenic Activity Database (EADB), which contains in vivo rodent uterotrophic assay data. |
Hepatotoxicity | 5177 | Several classification endpoints for hepatotoxicity are provided at standard or dose-based thresholds. Activity is the main endpoint. Endpoints labeled H-* represent Human data and PC-* represent Pre-Clinical rat data. The abbreviations used include: CC – clinical chemistry, HC – hepatocellular injury, HT – hepatotoxicity, HB – hepatobiliary injury, and MF – morphological findings. | Mulliner et al. | Compiled from multiple public toxicology databases, including the U.S. FDA’s Liver Toxicity Knowledge Base (LTKB), EMEA, LiverTox, and published scientific literature, with a focus on liver toxicity endpoints in both humans and animals. |
High Production Volume chemicals | 1672 | Activity_HPV (All chemicals fall under the HPV category) | Chung et al. | U.S. EPA HPV Challenge Program's chemical database was used for collection. |
Httk_ADME_Parameters | 1449 | Human CLint (intrinsic clearance) in µL/min/10^6 cells, Human Fu (fraction unbound), Rat CLint (intrinsic clearance) in µL/min/10^6 cells, and Rat Fu (fraction unbound) | High-Throughput Toxicokinetics | The HTTK dataset, developed by the U.S. EPA, contains high-throughput toxicokinetic data and models covering pharmaceuticals and environmental chemicals. It includes in vitromeasurements like plasma protein binding and hepatic clearance rates, as well as species-specific physiological data such as tissue volumes and blood flow rates. |
Natural products | 2479 | Activity_NP (All chemicals fall under the natural product category) | Chung et al. | The natural products dataset from the traditional Chinese medicine systems pharmacology database and analysis platform (TCMSP) database, was curated. |
Pesticides | 1009 | Activity_Pesticides (All chemicals fall under the pesticide category) | Chung et al. | Collected from literature and public databases including U.S. EPA CompTox Chemistry Dashboard. |
Transporter-mediated toxicity | 10,875 |
Proteins: • P-glycoprotein (P-gp; MDR1) • Breast cancer resistance protein (BCRP) • Multidrug resistance–associated proteins (MRP1, MRP2) • Bile salt export pump (BSEP) Activity values: • Substrate and Inhibitor calls • Evidence of inhibition thresholds. |
Daood et al. (Manuscript Submitted) Sedykh et al. Zhao et al. |
Consolidates curated transporter data across P-gp, BCRP, MRP1, MRP2, and BSEP. Records were standardized and manually reviewed for cited literature sources, noting details such as cell lines, substrates, substrate concentrations, and positive controls where available. Sources include large-scale ChEMBL bioactivity extractions, PubChem, and Metrabase data. High-confidence entries from the Intestinal Transporter Database are included for MDR1. Substrate/inhibitor calls are harmonized, and inhibition evidence follows study-specific thresholds (10 µM for P-gp/BCRP; 100 µM for BSEP). |