Chronological Dating of Historical Texts Using Recurrent Neural Networks

Project Context and Purpose

In the summer of 2018, I participated in the Wolfram Summer Camp, an intensive coding program focused on Wolfram Language—developed by the creators of Wolfram Alpha. Within a two-week timeframe, I was tasked with conceptualizing and executing a final project. Intrigued by the possibilities in Natural Language Processing (NLP), I chose to investigate the application of Recurrent Neural Networks (RNNs) for estimating the historical era of a text. The premise was based on the evolutionary nature of language: certain words fall into disuse while others are adopted, thereby providing clues to a text's period of origin. Despite having no prior experience in NLP, the hands-on mentorship provided at the camp enabled me to rapidly grasp the intricacies of RNNs and Wolfram Language.

Kindly refer to the community blog I authored at the conclusion of the project for a detailed explanation of the code and its broader dissemination. Due to the expiration of my Wolfram membership, I no longer have access to the original project demo or the code folder. However, portions of the code can be reviewed in the blog post.

Project Overview

This project employs a hybrid Recurrent Neural Network (RNN) that operates at both the character and word levels to improve the accuracy of dating historical texts. Compared to word-level-only models, this hybrid approach has shown remarkable accuracy improvements. It effectively dates texts originating from the 19th century but faces challenges with older documents due to data limitations.

Data Collection and Pre-processing

Data is a critical element for any machine learning project. For this purpose, the training data consisted of public domain books obtained from Openlibrary. The Wolfram Language's ServiceConnect function allowed seamless interaction with the Openlibrary API for data collection. Various text genres, like religion, drama, adventure, and more, were downloaded to create a diverse dataset.

RNN Architecture

The project uses a dual RNN setup:

Word-level RNN: Built on the GloVe model, this captures the semantics of words in the form of vectors.

Character-level RNN: This supplements the word-level RNN by processing the individual characters, thereby capturing language nuances that might be overlooked by word-level models.

The Wolfram Language code defines the architecture using its NetGraph and NetTrain functions.

Training and Testing

The network was trained on the collected data using a GPU for computational speed. Periodic checkpoints ensured that the model could be recovered in case of interruptions. The model was then tested on well-known titles like "On the Origin of Species" and "Alice in Wonderland," and the results indicated an average error margin of around 25 years.

Challenges and Future Improvements

The project faces some limitations, including:

Data Distribution: The collected dataset is skewed towards more recent texts, affecting accuracy for older documents.

Hyper-parameter Tuning: Due to time constraints, further optimization of the network is required for higher accuracy.