Category: Programming

  • Kaggle Titanic Random Forest Model: Summary and Comments

    Kaggle Titanic Random Forest Model: Summary and Comments

    This month, I finished the final series of hyperparameter tuning and optimization on the Kaggle Titanic Random Forest Model. I constructed this model for the Kaggle Machine Learning from Disaster challenge, which is an introductory-level machine learning prediction competition. Users are supposed to construct a model that predicts whether or not a given person, among 800 people, survived the Titanic shipwreck. This was my first attempt at crafting a new AI model from (close to) scratch, given a fully-featured dataset. Although I probably could achieve higher Kaggle scores by trying out different architectures, I didn’t want to spend the entire year on one challenge, and left the model at a 0.788 public score.

    In this post, I explore some of the code and the inner workings behind the Kaggle Random Forest model. I also dive into a few of the challenges I encountered along the way. If you’d like to work off the model framework I’ve put on GitHub, you can feel free to do that, but I’d strongly recommend against it because I haven’t yet taken the time to make the code fully understandable and customizable. (Also, a serious Kaggle competitor will likely strive for a better model architecture that achieves a higher public score, so please find your own setup).

    Model Setup: From Logistic Regression to Random Forest

    Initially, I started off with a simple logistic regression model loosely based on a Kaggle starter tutorial. For the most part, the tutorial model didn’t involve any careful feature engineering, and primarily relied on the passenger name, class, sex, age, and fare (several of the most important features in the dataset). It used elementary techniques to learn patterns between these features and the survival, and this base model only achieved an accuracy of about 0.74 inside the code cell.

    Immediately realizing that I needed to do more with the model, I switched to a Random Forest architecture, as it seemed to be highly recommended for binary classification problems like this one. The important part about Random Forest is that the model automatically determines which features are the most important, regressing through all the available features in a large “tree”. So I was able to feed it a wide variety of featuers and have it discover the most important connections: the prominent features ended up being Sex, Pclass, and Fare/Age.

    The percentage of women who survived the Titanic shipwreck (in the sample data) was significantly higher than the percetange of men.

    As shown in the image above, I was able to discover that a significantly higher percentage of women survived than men (74% vs 18%). Historically, this was due to the “women and children first” directive on the ship. I used this basic pattern to train the original model, and the more advanced Random Forest model uncovered this connection as well. I prioritized the Sex feature (in addition to Pclass and Fare/Age) when feeding the features to the model.

    Advanced Feature Engineering: the Cabin Data

    When the Random Forest model based only on Sex, Pclass, and Fare/Age didn’t perform very well, I realized I was going to have to do a lot more feature engineering. After all, the data contains several features, ranging from the number of parents and children aboard the Titanic, to the passenger’s Ticket number, and the cabin number where the passenger stayed.

    Since the location of the passenger aboard the vessel seemed like an important data point, I constructed the next round of feature engineering around the Cabin column (and created a few new helpful features from the existing data, like IsAlone or FamilySize). I learned quite a bit about proper feature engineering practices in the process; with the help of Copilot and some Google searches, I was able to construct new engineered data for the model to train on.

    I also built a function to make observations on the correlation between people with no cabin data (we don’t know where they stayed) and passenger class (1st, 2nd, or 3rd class). I wrote these observations at the top of the rather long code cell shown below, and this info actually ended up being helpful in situations where the cabin data was missing.

    A sample of the very long code cell used to engineer the Cabin feature data

    Hyperparameter Tuning (with RandomizedSearchCV)

    After retraining the model with the engineered Cabin data, I was getting much-improved accuracy in the notebook (about 0.82 vs 0.79), but the Kaggle score wasn’t changing. I figured I was going to have to do some more serious hyperparameter tuning to boost the score, which usually only reacts to large-scale prediction changes.

    At first, I attempted to tune most of the important parameters (n_estimators, max_depth, max_features) by hand, doing Google searches to figure out the best values for a Random Forest model. But this quickly became tedious, as I had to retrain the model every time I made a minor change so see if it did anything. Instead, I switched to RandomizedSearchCV, which is an algorithmic method of finding a model’s best parameters povided by the Python library sklearn.

    Using RandomizedSearchCV was relatively simple (as shown in the code cell below). All I had to do was set up a parameter grid, where I told RandomizedSearchCV which hyperparameters I wanted it to optimize. Then, I asked the algorithm to fit on the training datasets of the model, and then output the best parameters using a simple print() statement. From there, I was able to go back and drop in the fine-tuned parameters to the final model. The parameter-discovery process did take a little while, so I had to set n_jobs = -1 to ensure the algorithm ran on all CPU cores.

    After the hyperparameter tuning, the model achieved 0.84 validation accuracy in the notebook, and a Kaggle public score of 0.788, which is in the mid-to-upper-tier for these kinds of Random Forest models. I haven’t been able to move above this score since then; however, if I manage to do so, I will update this blog post with those details.

    The model’s final accuracy, precision, and recall report after the final stage of hyperparameter tuning.

    To view this entire Titanic project (and the code files) on GitHub, click here. However, the code is primarily intended for reference, not for drop-in usage in a brand-new project.

  • AI & Machine Learning Project Update: A Fresh Start

    AI & Machine Learning Project Update: A Fresh Start

    Today I returned to the AI & Machine Learning project repository I’d worked on intermittently over the past few months. It’s a GitLab environment with some Jupyter Notebooks and AI projects created on an excursion in summer 2025. I’d managed to make some good progress towards a finished QuickDraw Webcam project last year, but after that, things sort of fell apart. Now, returning to it many months later after system updates and code changes, I found that many of the Jupyter Notebooks (or Stupyter Notebooks, if you will) no longer worked.

    There were issues like missing packages and unresolved imports, likely arising from the three conflicting Python environments I’d foolishly installed. In general, things were a mess. I first tried switching kernels, enabling the older but sometimes more trustworthy Python 3.11.4. When that didn’t work–and the problems only got worse–I decided to break out the Google Gemini AI, asking for assistance on installing pip (which had mysteriously uninstalled itself from the OS) and getting up to speed on the issues I was experiencing.

    Unfortunately, nothing worked. I ended up having to go through and create a new Python virtual environment in the ai-projects directory, in hopes that it would clear things up a little bit. However, I was dreadfully wrong. The problems again only got worse, and Python threw numerous errors in the terminal upon attempting to run the malfunctioning code cells in the Jupyter Notebooks.

    So I decided to start from scratch. I went to GitHub, logged in, and dug up an old, empty Machine Learning repository I’d created many months ago. Since it was empty, I figured it would be the perfect candidate for a new project. I copied and pasted one of the malfunctioning Jupyter Notebooks from the other environment, and this time decided to set up a new Python virtual environment running version 3.12.2. This was a fresh repository without three different kernels or multiple packages installed, so the virtual environment cleared things up immediately. A virtual environment of Python is separate from the system install on your PC, so I was able to manually install all required AI-related packages without causing corruptions and dependency issues anywhere else.

    And that’s where we are now. I intend to continue working on a product review sentiment analysis model, in which an AI is apparently supposed to predict if a review is positive or negative based on certain keywords. After that, I’ll probably use the repository to explore more advanced AI concepts, possibly with the assistance of an online course. It’s simply supposed to be a general space where I can experiment with machine learning, and so far everything seems to be working. (We’ll see how long that lasts).

    Site Updates & Other News

    I’ve been trying to get some more pages created on this website for the past few weeks. I actually did manage to make some progress on the Short Stories & Poetry page, but WordPress had been acting stupid and I haven’t yet settled on a good design. I’m thinking about simply creating a series of dropdowns containing short story content, but that seems prone to issues and not a very good setup. So, you might find yourself looking at a completely separate blog on that page, or a gallery-style grid of clickable images and media. We’ll just have to see.

    I also started on a wireframe of the concept NoteMaster software, which is supposed to be a high-quality and low-priced alternative to other music notation softwares. The wireframe is coming along in Figma, and it’s going pretty well so far. The trickiest part will be the design of the scoring interface, and this will really test the features in the Figma free plan. (When the wireframe is complete, you can expect to see some prototype design photos on here and the Rustler website).

    With the New York trip over, and no more serious events on the calendar until late March, things are going along pretty smoothly. There are a couple of piano recitals scheduled for mid-April, but other than that, musical activities have slowed down significantly. I’m working on a couple of essay-writing contests and the NowBeat Commission and Concert (form now submitted). Flames of Rebellion: The Reckoning of The Past (book 3.5) is now in review, and the book cover is pretty much done. Progress on book 4 has stalled, but that’s normal for the next book in this series.

    Stay tuned for more information on the AI projects. Hopefully, the virtual environment remains working.