Recent advancements in machine learning show promise for accurately predicting the solubility of salicylic acid, a compound widely used in pharmaceuticals, in various solvents. Researchers utilized a robust dataset consisting of 217 observations along with 15 input features, including pressure, temperature, and solvents such as ethanol, water, and methanol, to explore how these factors influence solubility levels.
This innovative study, conducted by a team of researchers at King Saud University, leverages machine learning techniques that include Convolutional Neural Networks (CNNs), Polynomial Regression (PR), and Kernel Ridge Regression (KRR) to model solubility predictions while incorporating temperature settings between 243.15 and 323.15 K and pressure ranges between 90 and 101.32 kPa. The selection of solvent mixtures included 13 options, providing a comprehensive basis for the solubility analysis.
To enhance the quality of their analysis, the researchers developed a pre-processing phase that involved normalizing the data using a Min–Max Scaler. Following this step, they applied the k-Nearest Neighbors Outlier Detection (KNNOD) technique to remove any outliers from the dataset. This meticulous process laid the foundation for applying the machine learning models.
The effectiveness of each model was evaluated based on metrics such as R2 scores, Mean Squared Error (MSE), and Mean Absolute Error (MAE). Results revealed a striking accuracy from the CNN model, which achieved an R2 score of 0.989, MSE of 4.161203E−05, and MAE of 3.760119 E−03. In contrast, the KRR model showed an R2 score of 0.913873, highlighting the impressive capabilities of CNNs in this scenario.
Notably, the predictive ability of these models paves the way for advancements in the pharmaceutical industry, especially in optimizing drug crystallization processes. This is essential since the crystallization of active pharmaceutical ingredients (APIs) must be carefully controlled during production to ensure medication efficacy.
The researchers indicated that their study not only highlights the robust nature of machine learning methodologies but also suggests a shift away from traditional thermodynamic models for estimating solubility data. The authors wrote, "The results of this study underline the robustness of preprocessing methods, model selection, and hyper-parameter tuning for the attainment of accurate predictions, making useful contributions to the area of solubility prediction by salicylic acid in various solvent environments.”
This transition towards machine learning techniques offers scientists a powerful new tool for analyzing the complexities of drug solubility. As these algorithms learn from the varying characteristics of the dataset, they can effectively predict solubility even at unobserved data points, substantially enhancing the drug development process.
Taking note of the significance of each variable in the predictive models, the study identified water content as the most influential factor affecting salicylic acid solubility. The findings emphasize that increasing water content can negatively impact solubility since it acts as an anti-solvent. On the other hand, the presence of solvents like polyethylene glycol (PEG 300) enhances salicylic acid's solubility, underscoring the importance of solubilizers in pharmaceutical formulations.
The study’s findings were published on March 24, 2025, in Scientific Reports, highlighting the potential of machine learning to revolutionize solubility predictions in chemical and pharmaceutical engineering fields. Their rigorous approach showcases how sophisticated models can lead to breakthroughs in drug formulation and delivery.
Overall, this research positions machine learning not merely as a supplementary analytical tool but potentially as a cornerstone of future solubility models, paving the way for deeper insights into pharmaceutical solutions.