Tag: ai & machine learning

  • Summer Update: AI Models, Flames of Rebellion, Chamber Music

    Summer Update: AI Models, Flames of Rebellion, Chamber Music

    With the academic year classes having come to a conclusion, the summer has officially started. I made a goal for myself back in late April, when things were still extremely busy, that I would try to push harder on the fourth and final book in the Flames of Rebellion series when classes ended. I had just published Book 3.5 on Amazon and was feeling motivated to wrap up the series; Book 4 already had about 9,000 words of content in it from the previews I had written.

    Now, as of late May, I’ve mostly achieved that goal. Over the past few days, I’ve put down over 3,000 words into Book 4, The Conquest of Piece, reaching over 12,000 words. Things have gotten significantly more interesting than the first few chapters, and I’m using a rather chaotic mind map with sticky notes to keep track of this book’s plot. In a couple of chapters, the plan is to create a serious plot twist by adding an evacuation situation of the Tranquility’s Ozridia base due to a bomb threat (or possible actual bomb). I’d also like to fully develop and flesh out Jonathan and Lily’s romance, as this is the last book now. I’ve already added some romantic cues in the first few chapters; I intend to make Jonathan’s upcoming birthday party the culmination of their relationship.

    On the subject of artificial intelligence and machine learning, I’ve moved onto a new Kaggle challenge: house price prediction. After failing horribly to get the sentiment analysis model to work, I stopped trying to extract answers from Claude and pivoted to something else. This house price prediction challenge involves the creation of a model to predict the prices of houses in the Iowa area based on various attributes, like number of bedrooms, pool quality, fence presence, square footage, etc. The dataset is quite large; there are seventy-nine features available, all with varying correlations to the central SalePrice target variable. So far, I’ve analyzed the data using some Seaborn scatterplots, generated correlation matrices to see which features to encode, and gotten started on cleaning the data (which has involved deleting outliers and imputing NA values). I can already tell that this challenge, while still labeled “introductory”, is slightly more intensive in terms of data preparation and analysis than the Kaggle Titanic challenge.

    I’m also loosely working on the virtual card deck project, where I’m attempting to create a virtual cards app that people can customize to fit their specific needs. Right now, I have a modal where you can input the number of decks, the name of each deck, the background art, and the titles/descriptions for the cards in the deck. Most of the data is persisted, and the modal seems to be behave as expected. However, I’ve been running into trouble getting all the data to save to local Storage, not just the number of decks. I’ve asked lots of questions of Microsoft Copilot, but no real results have come yet. This project is one of the most complex undertakings for me yet when it comes to web design, HTML, and JavaScript, so I’m not expecting it to work perfectly for many more months.

    Tomorrow, I’m starting a chamber music perspectives (CMP) camp, which will run for about a week and a half and take place from 1:00 PM to 5:00 PM. This camp involves not only a series of small ensemble performances (piano trio and string quartet size), but also some composition masterclasses and the opportunity to compose a piece of your own for the ensemble to play. This will be the “Final Project”, as it’s been dubbed; it looks like this camp will be very fast-paced and packed with activities. This final project needs to be started from scratch on day one and completed by the tenth day, giving us less than two weeks to compose a fleshed-out, playable, and refined 3-5 minute piece. For context, it usually takes me about two months to compose a high-quality 5-minute multi-instrument piece; however, I only work about 45 minutes every other day. At this camp, we’ll likely spending at least an hour and a half every day on this final project.

    Some other miscellaneous endeavors from the past couple of weeks include an AI radio show, which I just finished today. This is the third such show I’ve completed now (well, fourth, if you count Why You Should Be Afraid of Physics Class, a 20-minute-long drama), and I’m using Fish Audio to generate all the voices. These shows generally run for 25-28 minutes, and this latest episode contains the guest host Sal Khan. You’re probably wondering: how did I possibly get Sal Khan to appear on a low-level AI-produced radio show? Because this isn’t an AI-produced show: the voices are all AI. I do the editing and the generating of the voices. Sal Khan is a cloned voice available on Fish Audio, and I’ve cloned a few others for use in these episodes. It’s quite an interesting process, actually. I’ll be using DaVinci Resolve’s Fairlight studio instead of the clunky Audacity to edit this show together. Hopefully it won’t be too much of a shock to use. (The video editor portion of DaVinci Resolve is actually quite easy to learn. I’ve put together quite a few videos with it now).

    Well, tomorrow’s going to be quite a busy day, with the starting of the CMP Chamber Music Perspectives camp. I’ll try to work on the AI models this weekend if possible, in addition to the usual (shortened) BeamNG Roleplay sessions. Stay tuned for more updates.

  • Kaggle Titanic Random Forest Model: Summary and Comments

    Kaggle Titanic Random Forest Model: Summary and Comments

    This month, I finished the final series of hyperparameter tuning and optimization on the Kaggle Titanic Random Forest Model. I constructed this model for the Kaggle Machine Learning from Disaster challenge, which is an introductory-level machine learning prediction competition. Users are supposed to construct a model that predicts whether or not a given person, among 800 people, survived the Titanic shipwreck. This was my first attempt at crafting a new AI model from (close to) scratch, given a fully-featured dataset. Although I probably could achieve higher Kaggle scores by trying out different architectures, I didn’t want to spend the entire year on one challenge, and left the model at a 0.788 public score.

    In this post, I explore some of the code and the inner workings behind the Kaggle Random Forest model. I also dive into a few of the challenges I encountered along the way. If you’d like to work off the model framework I’ve put on GitHub, you can feel free to do that, but I’d strongly recommend against it because I haven’t yet taken the time to make the code fully understandable and customizable. (Also, a serious Kaggle competitor will likely strive for a better model architecture that achieves a higher public score, so please find your own setup).

    Model Setup: From Logistic Regression to Random Forest

    Initially, I started off with a simple logistic regression model loosely based on a Kaggle starter tutorial. For the most part, the tutorial model didn’t involve any careful feature engineering, and primarily relied on the passenger name, class, sex, age, and fare (several of the most important features in the dataset). It used elementary techniques to learn patterns between these features and the survival, and this base model only achieved an accuracy of about 0.74 inside the code cell.

    Immediately realizing that I needed to do more with the model, I switched to a Random Forest architecture, as it seemed to be highly recommended for binary classification problems like this one. The important part about Random Forest is that the model automatically determines which features are the most important, regressing through all the available features in a large “tree”. So I was able to feed it a wide variety of featuers and have it discover the most important connections: the prominent features ended up being Sex, Pclass, and Fare/Age.

    The percentage of women who survived the Titanic shipwreck (in the sample data) was significantly higher than the percetange of men.

    As shown in the image above, I was able to discover that a significantly higher percentage of women survived than men (74% vs 18%). Historically, this was due to the “women and children first” directive on the ship. I used this basic pattern to train the original model, and the more advanced Random Forest model uncovered this connection as well. I prioritized the Sex feature (in addition to Pclass and Fare/Age) when feeding the features to the model.

    Advanced Feature Engineering: the Cabin Data

    When the Random Forest model based only on Sex, Pclass, and Fare/Age didn’t perform very well, I realized I was going to have to do a lot more feature engineering. After all, the data contains several features, ranging from the number of parents and children aboard the Titanic, to the passenger’s Ticket number, and the cabin number where the passenger stayed.

    Since the location of the passenger aboard the vessel seemed like an important data point, I constructed the next round of feature engineering around the Cabin column (and created a few new helpful features from the existing data, like IsAlone or FamilySize). I learned quite a bit about proper feature engineering practices in the process; with the help of Copilot and some Google searches, I was able to construct new engineered data for the model to train on.

    I also built a function to make observations on the correlation between people with no cabin data (we don’t know where they stayed) and passenger class (1st, 2nd, or 3rd class). I wrote these observations at the top of the rather long code cell shown below, and this info actually ended up being helpful in situations where the cabin data was missing.

    A sample of the very long code cell used to engineer the Cabin feature data

    Hyperparameter Tuning (with RandomizedSearchCV)

    After retraining the model with the engineered Cabin data, I was getting much-improved accuracy in the notebook (about 0.82 vs 0.79), but the Kaggle score wasn’t changing. I figured I was going to have to do some more serious hyperparameter tuning to boost the score, which usually only reacts to large-scale prediction changes.

    At first, I attempted to tune most of the important parameters (n_estimators, max_depth, max_features) by hand, doing Google searches to figure out the best values for a Random Forest model. But this quickly became tedious, as I had to retrain the model every time I made a minor change so see if it did anything. Instead, I switched to RandomizedSearchCV, which is an algorithmic method of finding a model’s best parameters povided by the Python library sklearn.

    Using RandomizedSearchCV was relatively simple (as shown in the code cell below). All I had to do was set up a parameter grid, where I told RandomizedSearchCV which hyperparameters I wanted it to optimize. Then, I asked the algorithm to fit on the training datasets of the model, and then output the best parameters using a simple print() statement. From there, I was able to go back and drop in the fine-tuned parameters to the final model. The parameter-discovery process did take a little while, so I had to set n_jobs = -1 to ensure the algorithm ran on all CPU cores.

    After the hyperparameter tuning, the model achieved 0.84 validation accuracy in the notebook, and a Kaggle public score of 0.788, which is in the mid-to-upper-tier for these kinds of Random Forest models. I haven’t been able to move above this score since then; however, if I manage to do so, I will update this blog post with those details.

    The model’s final accuracy, precision, and recall report after the final stage of hyperparameter tuning.

    To view this entire Titanic project (and the code files) on GitHub, click here. However, the code is primarily intended for reference, not for drop-in usage in a brand-new project.

  • AI & Machine Learning Project Update: A Fresh Start

    AI & Machine Learning Project Update: A Fresh Start

    Today I returned to the AI & Machine Learning project repository I’d worked on intermittently over the past few months. It’s a GitLab environment with some Jupyter Notebooks and AI projects created on an excursion in summer 2025. I’d managed to make some good progress towards a finished QuickDraw Webcam project last year, but after that, things sort of fell apart. Now, returning to it many months later after system updates and code changes, I found that many of the Jupyter Notebooks (or Stupyter Notebooks, if you will) no longer worked.

    There were issues like missing packages and unresolved imports, likely arising from the three conflicting Python environments I’d foolishly installed. In general, things were a mess. I first tried switching kernels, enabling the older but sometimes more trustworthy Python 3.11.4. When that didn’t work–and the problems only got worse–I decided to break out the Google Gemini AI, asking for assistance on installing pip (which had mysteriously uninstalled itself from the OS) and getting up to speed on the issues I was experiencing.

    Unfortunately, nothing worked. I ended up having to go through and create a new Python virtual environment in the ai-projects directory, in hopes that it would clear things up a little bit. However, I was dreadfully wrong. The problems again only got worse, and Python threw numerous errors in the terminal upon attempting to run the malfunctioning code cells in the Jupyter Notebooks.

    So I decided to start from scratch. I went to GitHub, logged in, and dug up an old, empty Machine Learning repository I’d created many months ago. Since it was empty, I figured it would be the perfect candidate for a new project. I copied and pasted one of the malfunctioning Jupyter Notebooks from the other environment, and this time decided to set up a new Python virtual environment running version 3.12.2. This was a fresh repository without three different kernels or multiple packages installed, so the virtual environment cleared things up immediately. A virtual environment of Python is separate from the system install on your PC, so I was able to manually install all required AI-related packages without causing corruptions and dependency issues anywhere else.

    And that’s where we are now. I intend to continue working on a product review sentiment analysis model, in which an AI is apparently supposed to predict if a review is positive or negative based on certain keywords. After that, I’ll probably use the repository to explore more advanced AI concepts, possibly with the assistance of an online course. It’s simply supposed to be a general space where I can experiment with machine learning, and so far everything seems to be working. (We’ll see how long that lasts).

    Site Updates & Other News

    I’ve been trying to get some more pages created on this website for the past few weeks. I actually did manage to make some progress on the Short Stories & Poetry page, but WordPress had been acting stupid and I haven’t yet settled on a good design. I’m thinking about simply creating a series of dropdowns containing short story content, but that seems prone to issues and not a very good setup. So, you might find yourself looking at a completely separate blog on that page, or a gallery-style grid of clickable images and media. We’ll just have to see.

    I also started on a wireframe of the concept NoteMaster software, which is supposed to be a high-quality and low-priced alternative to other music notation softwares. The wireframe is coming along in Figma, and it’s going pretty well so far. The trickiest part will be the design of the scoring interface, and this will really test the features in the Figma free plan. (When the wireframe is complete, you can expect to see some prototype design photos on here and the Rustler website).

    With the New York trip over, and no more serious events on the calendar until late March, things are going along pretty smoothly. There are a couple of piano recitals scheduled for mid-April, but other than that, musical activities have slowed down significantly. I’m working on a couple of essay-writing contests and the NowBeat Commission and Concert (form now submitted). Flames of Rebellion: The Reckoning of The Past (book 3.5) is now in review, and the book cover is pretty much done. Progress on book 4 has stalled, but that’s normal for the next book in this series.

    Stay tuned for more information on the AI projects. Hopefully, the virtual environment remains working.