Anomaly Detection in Maritime Engine Data

Data Source: Real-world dataset submitted by Devabrat Mohakul
Rows: 19,535 | Features: 6 engine parameters
Tech Used: Python, Prinicpal Component Analysis (PCA), One-Class Support Vector Machines (SVM), Isolation Forest, IQR

Colab Notebook: Here

🧩 The Problem?

Imagine hundreds of cargo ships cruising the oceans, carrying everything from sneakers to server racks. Now imagine the chaos if just one engine fails mid-route. Delays. Downtime. Dollars lost.

That’s the risk the business wanted to eliminate.

🔍 The Challenge: Find the Outliers That Sink Ships

They needed a data-driven way to spot irregularities in engine behavior before breakdowns happened. The tricky part? These anomalies were subtle—just 1–5% of the data—but catching them could be the difference between smooth sailing and major disruption.

The goal? Detect these hidden anomalies so ships can be flagged for maintenance early—minimizing downtime and protecting revenue.

🛠️ The Tools: Methods for Flagging Anomalous Activity

📊 Interquartile Range (IQR)

The classic method of outlier detection. I used the classic 1.5×IQR rule to detect anomalies across six engine parameters, including:

Engine RPM
Fuel Pressure
Coolant Pressure
Lubrication Oil Pressure
Coolant Temperatures
Lubrication Oil Temperature

Boxplots of each feature, highlighting outliers detected by using a factor of 1.5 x IQR.

These features were found to be not normally distributed, as confirmed by Q-Q plots and the Shapiro-Wilk test.

The result of this method found 422 anomalies (2.2% of data)—right in the sweet spot. Bar charts made it easy to see which features were spiking, as seen below. Fuel Pressure and Lubrication Oil Temp were the biggest culprits.

Bar chart showing the number of outliers of each feature.

This simplicity makes it highly interpretable, and the use of boxplots makes it especially useful for communicating findings to stakeholders with no technical background.

However, IQR has limitations when it comes to handling more complex data. It operates on a univariate basis, meaning it evaluates each feature independently, which can overlook important interactions between variables. Additionally, it isn’t suited for detecting nonlinear relationships or anomalies that may emerge over time, which can be crucial in systems like maritime engines. Given the non-normal distribution of the data, the method may also struggle with over-flagging or under-flagging anomalies, leading to less reliable results.

🧠 One-Class SVM + PCA

I started by cleaning up and scaling the data so that all features were on the same playing field. Then, I used PCA (Principal Component Analysis)—think of it like compressing a high-res image to a simpler version—to shrink the data down to just two key dimensions for easier analysis.

From there, I used One-Class SVM, a machine learning model that learns what “normal” looks like and flags anything that doesn’t fit the pattern.

I fine-tuned the model's settings (called gamma and nu) to get a good balance between being too sensitive or too relaxed about what counts as weird.

The result is shown below - a plot with a clear red boundary: anything outside the circle was tagged as an anomaly.

One-Class SVM showing anomalies (marked x) outside of the decision boundary (red circular line).

Whilst the visuals look pretty, unfornuately, PCA only captured about 36% of the original data’s information.
When I reduced the dataset to two dimensions, the model could only retain 36% of the underlying patterns and complexity. That means a lot of important nuance was lost in the simplification — kind of like trying to understand a 3D object from a 2D shadow.

The result? Great visuals, but they come at the cost of accuracy. The model is working with an incomplete view of the data, which limits how well it can detect anomalies. Additionally, since PCA was used, we cannot tell which features lead to a datapoint being flagged as an anomaly.

🌲 Isolation Forest

Imagine you’re trying to spot someone acting weird in a crowd. The fewer questions it takes to single them out, the more likely they’re up to something. That’s how Isolation Forest works — it splits the data again and again, and if a point gets isolated quickly, it’s likely an anomaly.

🔧 What I did:

Set contamination to 0.05 → I told the model to expect around 5% of the data to be anomalies.
Used 100 estimators (trees) to build a strong enough forest for better accuracy.
Applied PCA again to reduce the features down to two dimensions so I could visualise the outliers.

Scatterplot of anomalies (red) identified by using Isolation Forest.

Since there was no indication of what “bad”/anonamlous looked like in this dataset, we can use Isolation Forest is to understand what that might look like, especially since it gives an anomaly score so we can rank how unusual each point is.

However, the visuals given as an output aren’t as intuitive as the One-Class SVM above. Additionally, since PCA was used again, we can’t see what features contributed to a data point being flagged as an anomaly.

Conclusion: It’s Best To Balance Simplicity with Precision

Whilst the visualisation for One-class SVM was the easiest to interpret and helps identify a decision boundary, it’s ability to explain variance in this dataset is poor.

Isolation Forest's ability to isolate anomalies through random partitions makes it better suited for detecting complex and subtle irregularities in engine performance data. While its visualizations are less intuitive compared to those generated using PCA with One-Class SVM and IQR, the method's accuracy in flagging anomalies outweighs this limitation. Moreover, its anomaly scoring system allows for further analysis, enabling engineers to investigate and confirm flagged anomalies effectively.

By adopting the Isolation Forest approach, the business can more accurately identify ships requiring maintenance, thereby reducing unplanned downtime and minimizing revenue loss. This method not only aligns with the company’s goal of mitigating risk but also provides a scalable and reliable solution for ongoing anomaly detection in engine performance monitoring.

Detecting The Anomalous Activity Of A Ship Engine