Curated Datasets
Dataset Name | Total # of Compounds | Data Type | Dataset Source | Dataset Description |
BBB (Blood Brain Barrier) | 438 | logBB | Wang et al. | Compounds with experimental logBB values was compiled and curated using ChemAxon and CASE Ultra tools. |
BCRP (Breast Cancer Resistance Protein) | 395 | µM (evidence of inhibition at 10 µM) | Sedykh et al. Zhao et al. | The BCRP dataset was curated for experimental consistency and structural quality, and filtered to include only reliable binary classification labels for substrates and inhibitors. |
Bioavailability | 1159 | oral bioavailability (%F) | Kim et al. | Compiled across public and literature sources. Chemical structures were standardized, and %F values were harmonized to resolve discrepancies. |
BSEP (Bile Salt Export Pump) | 725 | µM (evidence of inhibition at 100 µM) | Zhao et al. | Collected from publicly available experimental data. Structures were curated and standardized to ensure consistency and dataset includes binary labels. |
Cancer (Human Oral Carcinogenicity) | 342 | Binary, 0=Non-Carcinogen; 1=Carcinogen | Chung et al. | 342 unique organic compounds from the EPA’s IRIS database, labeled as carcinogenic or noncarcinogenic based on oral slope factor (OSF), a quantitative measure for oral cancer risk. |
Cosmetics | 4129 | --- | Chung et al. | Cosmetic dataset collected from COSMOS Cosmetics Inventory knowledge base. |
DART (Developmental and Reproductive Toxicity) | 1452 | Oral Developmental, Inhalation Maternal, ToxRefDB Maternal | Ciallella et al. | Collected from U.S. EPA’s in vivo prenatal developmental toxicity studies in rats and rabbits based on oral or inhalation studies. |
Drugbank | 8055 | --- | Chung et al. | Collected from DrugBank database. |
Embryotox | 766 | Binary Labels (1:Safe; 0: Teratogen) | Aljarf et al. | Collected from FDA drug labeling data and literature annotations for known teratogens. Drugs with strong evidence of teratogenicity were classified as positives. Non-teratogenic drugs were chosen from non-reproductive risk categories to avoid mislabeling. |
Estrogen | 2144 | Agonist, Antagonist, Binding, and Uterotrophic class | Ciallella et al. | Collected from the Tox21 screening program using high-throughput in vitro assays that assess estrogen receptor (ER) activation and inhibition. |
FM (Fathead Minnow) | 675 | -log10 of Conc. (µmol/L) | Klopman et al. | Collected from standardized 96-hour LC₅₀ test data for Pimephales promelas (fathead minnow), sourced from the EPA’s ECOTOX database and additional public toxicology resources. |
Hepatotoxicity | 7502 | Several classification endpoints for hepatotoxicity at standard or dose-based thresholds | Mulliner et al. | Compiled from multiple public toxicology databases, including the U.S. FDA’s Liver Toxicity Knowledge Base (LTKB), EMEA, LiverTox, and published scientific literature, with a focus on liver toxicity endpoints in both humans and animals. |
High Production Volume | 1672 | --- | Chung et al. | U.S. EPA HPV Challenge Program's chemical database was used for collection. |
Httk_ADME_Parameters | 1610 | --- | High-Throughput Toxicokinetics | The HTTK dataset, developed by the U.S. EPA, contains high-throughput toxicokinetic data and models covering pharmaceuticals and environmental chemicals. It includes in vitro measurements like plasma protein binding and hepatic clearance rates, as well as species-specific physiological data such as tissue volumes and blood flow rates. |
LD50 (Rat oral) | 7332 | log10 mol/kg-bw | Zhu et al. | Used publicly available rat oral acute toxicity data, with LD₅₀ values classified into toxicity categories (e.g., high, moderate, low) according to Globally Harmonized System (GHS) thresholds. |
MDR1 (Multidrug Resistance 1 transporter) | 1585 | µM (evidence of inhibition at 10 µM) | Sedykh et al. | Collected from the Intestinal Transporter Database using high-confidence experimental data sourced from literature and public databases. |
Natural Products | 2479 | --- | Chung et al. | The natural products dataset from the traditional Chinese medicine systems pharmacology database and analysis platform (TCMSP) database, was curated. |
Pesticides | 1009 | --- | Chung et al. | Collected from literature and public databases including U.S. EPA CompTox Chemistry Dashboard. |