Research & Development Diary

Voglio applicare il metodo DPO a una rete RNN. Creerò una struttura composta da tre elementi che interagiscono tra loro:

una RNN che ha appreso solo le regole del gioco (banalmente sequenze di partite random)
una rete che apprende tramite DPO a discernere tra una mossa giusta e una sbagliata
una rete che apprende ad apprendere, che adatta i parametri della rete 1. in base alla situazione/stato corrente.

Research & Development Diary

February 16th, 2025

Progress: Successfully experimented with GRUs on the “adding problem” to understand RNN learning dynamics. Demonstrated that normalizing input/output to [0,1] significantly improves performance. Showed the model (with a hidden size of one) can learn to sum a variable number of masked inputs, even with variable sequence lengths (up to +/- 80% variation). This highlights the surprising capability of even tiny RNNs.

Technical Notes: - GRUs were used for all experiments. - Input/output normalization to [0,1] is essential for learning. - The model architecture is extremely small: a single-layer GRU with hidden size 1, followed by a linear layer. - The model successfully learned with a variable number of masked elements and variable sequence length. - The sequence length has little to no effect on the ability to learn.

Next Steps: - Begin investigating the incorporation of the DPO loss. - Define the structure of the meta-learning network. Consider the size constraint (meta-network should be smaller than the main network?). - Start thinking about how to design signals for “flag” and “value” to sum. - Explore whether an Energy-Based Model (EBM) is suitable for the meta-learning component (though stability could a concern).

February 27th 2025

Progress:

Successfully trained a GRU-based neural network to predict winning moves in the game Connect Four. Demonstrated that the network achieves a 67% accuracy in identifying the winning column when presented with a game state near completion. The initial training used a dataset of 200,000 randomly generated game states, highlighting the potential for relatively small networks to learn meaningful patterns in game scenarios. Discovered that the network learns to identify a win, but not construct a game strategy.

Technical Notes:

GRUs were used for the initial experiments.
Dataset consists of 200,000 randomly generated Connect Four games, where a “win state” is identified.
Training utilized 80% of the dataset for 1000 epochs.
Network accuracy on predicting winning moves reached 67%.
Realized that current training focuses on identifying existing wins, not developing a game-playing strategy.

Next Steps:

Begin fine-tuning the network using Direct Preference Optimization (DPO).
Investigate methods for generating more strategic training data to enable the network to construct complete games.
Explore different network architectures and training strategies to improve overall game-playing performance.

First Offline Demo: Play Against the AI

Welcome to the first offline demo of our Connect 4 AI project! In this demo, you can play a game of Connect 4 against an AI opponent. The AI has been trained using a GRU-based neural network to predict winning moves and identify game patterns. While the AI is still in its early stages, it demonstrates the potential of machine learning in game strategy development.

To play, simply click on the column where you want to drop your piece. The game will alternate turns between you and the AI. Enjoy the game and stay tuned for more updates as we continue to improve the AI’s capabilities!

Note: This demo is now connected to an AI model using ONNX Runtime. The AI has been trained using a GRU-based neural network to play Connect 4. You play as red and the AI plays as yellow. The AI will automatically make its move after you play. Have fun!

Loading AI model…

AI turn (Yellow)