Applying AI techniques from drug discovery to LLMs to reduce hallucinations

December 5, 2024

Revolutionary GitHub projects: Automated drug discovery with AI

The integration of artificial intelligence (AI) into drug discovery is revolutionizing the pharmaceutical industry. Open source projects on GitHub play a crucial role in this. Below we present some of the most innovative projects that are driving automated drug discovery using AI.

DeepChem: Open Platform for Deep Learning in Chemistry

DeepChem is a leading open source library that makes deep learning accessible for chemical applications. It provides tools for:

Advertising

Through its user-friendly interface, DeepChem enables researchers to implement complex AI models without in-depth programming knowledge. This accelerates the discovery of new drugs and promotes innovation in the industry.

MoleculeNet: Benchmarking for AI in Chemistry

MoleculeNet is a comprehensive benchmarking system specifically designed for machine learning in chemical research. It offers:

By providing consistent benchmarks, MoleculeNet facilitates the comparison of different AI models and thus promotes progress in drug discovery.

ATOM Modeling PipeLine (AMPL): Accelerated drug discovery

The ATOM Modeling PipeLine is a project of the ATOM consortium that aims to accelerate drug development using machine learning. AMPL offers:

With AMPL, researchers can efficiently build complex models and thus shorten the time from discovery to market launch of new drugs.

Chemprop: Molecular property prediction with deep learning

Chemprop uses graphical neural networks to predict molecular properties. Its features include:

Chemprop has achieved outstanding results in several competitions and is a valuable tool for AI-assisted chemistry.

DeepPurpose: Universal Toolkit for Drug Discovery

DeepPurpose is a comprehensive deep learning toolkit for drug discovery. It offers:

Through its versatility, DeepPurpose enables researchers to quickly and efficiently identify new therapeutic candidates.

OpenChem: Special deep learning framework for chemical applications

OpenChem is a deep learning framework tailored to chemistry. It is characterized by:

OpenChem promotes the development of new methods in chemical AI and helps accelerate research.

The open source community on GitHub is pushing the boundaries of automated drug discovery with these projects. Combining AI and chemistry opens up new opportunities to develop therapeutic solutions more efficiently and precisely. These innovations have the potential to change the future of medicine for the long term.

Application of AI research models from drug research to the distillation of AI models

TheThe AI ​​models and methods used offer innovative approaches that can be transferred to the distillation of AI models. Although the two fields appear different at first glance, they share common techniques and challenges that make them useful.

Use of Application

Applying research models from drug discovery to AI model distillation makes sense because:

How it can be applied

1. Graph Neural Networks (GNNs) for structural understanding

In drug research, Graph Neural Networks are used to analyze molecular structures. These techniques can be used in model distillation to understand the structure of large models and extract essential features for the smaller model.

2. Transfer Learning and Feature Extraction

The models from projects such as DeepChem or Chemprop use transfer learning to learn from existing data sets. Similarly, in distillation, a large pre-trained model can serve as a starting point from which essential features are transferred to the smaller model.

3. Multi-task learning for versatile models

Projects such as MoleculeNet use multi-task learning to train models that can handle multiple tasks simultaneously. This method can be used in distillation to create compact models that still perform versatile functions.

4. Optimization techniques from drug discovery

Optimization approaches from drug discovery, such as fine-tuning hyperparameters or using evolutionary algorithms, can be applied to make distilled models more efficient.

5. Data augmentation and generation

Generating synthetic data is key in projects like DeepPurpose. Similar techniques can be used to improve the training process of the student model in distillation, especially when limited data is available.

Practical implementation steps

The integration of methods from automated drug discovery into the distillation of AI models opens up new ways to increase efficiency and reduce complexity. By transferring proven techniques, powerful, compact models can be developed that meet the requirements of modern AI applications. This interdisciplinary approach promotes innovation and accelerates progress in both research fields.

Extension: Application of AI techniques from drug discovery to LLMs to reduce hallucinations

Advances in artificial intelligence have revolutionized both drug discovery and the development of Large Language Models (LLMs). An interesting question is whether the techniques from automated drug discovery can help to increase the prediction accuracy of LLMs and reduce hallucinations. In the following, we explore this possibility and analyze whether such an application makes sense and whether these techniques are already used in LLMs.

Connection between AI-Techtechniques in chemistry and LLMs

1. Graph Neural Networks (GNNs) and structural analysis

In drug research, Graph Neural Networks are used to understand and predict the complex structures of molecules. GNNs model data as graphs, which is natural in chemistry since molecules are made up of atoms (nodes) and bonds (edges).

Application to LLMs:

2. Fuzziness and uncertainty estimation

In drug discovery, uncertainty estimation is crucial to assess the reliability of predictions.

Application to LLMs:

3. Multi-task learning and transfer learning

Projects like MoleculeNet use multi-task learning to train models that predict multiple properties simultaneously.

Application to LLMs:

4. Data augmentation and synthetic data generation

In chemistry, synthetic data is used to improve models, especially when real-world data is limited.

Application to LLMs:

Does the application make sense?

Transferring techniques from AI-assisted drug discovery to LLMs theoretically makes sense, as both fields use complex data structures and machine learning. Some reasons are:

Challenges

Are these techniques already used in LLMs?

Many of the techniques mentioned are already used in LLMs in some form integrated:

Potential innovative Approaches

Despite the existing techniques, there is potential for new approaches:

Applying techniques from automated drug discovery to LLMs offers exciting opportunities to improve prediction accuracy and reduce hallucinations. While some methods are already used in LLMs, there is room for further innovation through an interdisciplinary approach. The challenges lie mainly in the different data types and scalability. Nevertheless, collaboration between these two fields could lead to significant advances in AI research.

Short thought experiment: Does it make sense?

Chemistry and natural language are different at first glance, but both are systems with complex rules and structures. The techniques for modeling and prediction in chemistry could therefore provide valuable input for natural language processing. It is important to be open to interdisciplinary approaches, as innovation often arises at the interfaces of different disciplines.

Integrating AI techniques from drug discovery into the development of LLMs could be a promising way to further increase the performance of these models. By learning from each other, both areas can benefit from each other and together open up new horizons in AI research.

Implementation to reduce hallucinations in LLMs using Hugging Face

Below, we show how to create a language model with uncertainty estimation using Hugging Face and Python to reduce hallucinations. We use techniques inspired by methods used in automated drug discovery, in particular uncertainty estimation by Monte Carlo dropout.

Requirements

You can install the required libraries using the following command:

pip install transformers torch datasets

Code implementation

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch.nn.functional as F
import numpy as np

# Loading the tokenizer and model
model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Enabling dropout in evaluation mode too
def enable_dropout(model):
""Enables dropout layers in the model during evaluation."""
for module in model.modules():
if isinstance(module, torch.nn.Dropout):
module.train()

# Function for generating with uncertainty estimation
def generate_with_uncertainty(model, tokenizer, prompt, num_samples=5, max_length=50):
model.eval()
enable_dropout(model)
inputs = tokenizer(prompt, return_tensors='pt')
input_ids = inputs['input_ids']

# Multiple predictions for uncertainty estimation
outputs = []
for _ in range(num_samples):
with torch.no_grad():
output = model.generate(
input_ids=input_ids,
max_length=max_length,
do_sample=True,
top_k=50,
top_p=0.95
)
outputs.append(output)

# Decoding the generated sequences
sequences = [tokenizer.decode(output[0], skip_special_tokens=True) for output in outputs]

# Calculating the uncertainty (entropy)
probs = [] 
for output in outputs:
 with torch.no_grad():
 logits = model(output)['logits']
 prob = F.softmax(logits, dim=-1)
 probs.append(prob.cpu().numpy())

 # Calculate average entropy
 entropies = []
 for prob in probs:
 entropy = -np.sum(prob * np.log(prob + 1e-8)) / prob.size
 entropies.append(entropy)

 avg_entropy = np.mean(entropies)
 uncertainty = avg_entropy

 # Selection of the most frequently occurring sequence
 from collections import Counter
 sequence_counts = Counter(sequences)
 most_common_sequence = sequence_counts.most_common(1)[0][0]

 return {
'generated_text': most_common_sequence,
'uncertainty': uncertainty
}

# Example usage
prompt = "The impact of artificial intelligence on medicine is"

result = generate_with_uncertainty(model, tokenizer, prompt)
print("Generated text:")
print(result['generated_text'])
print("nEstimated uncertainty:", result['uncertainty'])

Code explanation

Using GitHub repositories

For extended functionality and advanced methods, the following GitHub repositories may be useful:

Extension options

Conclusion

By applying uncertainty estimates and techniques from automated drug discovery, we can Increase reliability of language models and reduce unwanted hallucinations. The provided implementation serves as a starting point and can be further developed to meet specific requirements.

Note: The implementation shown above is a simplified example. In a production environment, other aspects such as efficiency, scalability and ethical considerations should be taken into account.

Author: Thomas Poschadel

Transfer Chemical learning to LLMs