Project Rosetta

Finding Exoplanets in Light Curves

When a planet transits its host star, it blocks a tiny fraction of the starlight. Kepler and TESS have collected millions of light curves, each a time series of stellar brightness. Hidden in those curves are transit signals: periodic dips that indicate an orbiting planet. The problem is that most light curves contain no planet, the dips are tiny (often less than 1% of the star's brightness), and there is a lot of noise from stellar variability, instrument systematics, and cosmic rays. Project Rosetta is a multi model pipeline for detecting these transit signals, built during NASA Space Apps 2025.

The Class Imbalance Problem

This is the first thing that hits you when you start working with real Kepler data. The ratio of confirmed exoplanet candidates to non candidates is roughly 1:100. If your model just predicts "no planet" for everything, it gets 99% accuracy and is completely useless. Standard training on imbalanced data produces exactly this: a model that has learned to always say no.

We used SMOTE (Synthetic Minority Over sampling Technique) to address this. SMOTE generates synthetic positive examples by interpolating between existing positive samples in feature space. This is not just duplicating the minority class, which would lead to overfitting. It creates new synthetic examples that lie along the line segments connecting nearest neighbors in the positive class. After SMOTE, the training set is balanced, and the models can actually learn what a transit signal looks like instead of learning to always predict the majority class.

Multi Model Pipeline

We did not commit to a single model architecture. Instead, Rosetta runs three models in parallel and aggregates their predictions. The first is a Random Forest operating on hand crafted features: transit depth, duration, signal to noise ratio, periodicity score, secondary eclipse depth. These features encode domain knowledge from the astrophysics literature. Random Forest is interpretable and surprisingly competitive on tabular features.

The second is a Gaussian Naive Bayes classifier. This is the fast, lightweight baseline. It assumes feature independence, which is wrong in practice, but it still provides a useful complementary signal because its failure modes are different from the other two models. When RF and GNB agree on a prediction, confidence is high.

The third is a 1D CNN that operates directly on the raw light curve time series. No feature engineering. The CNN learns its own representations from the raw flux values. We used three convolutional blocks with batch normalization and max pooling, followed by dense layers. This is where the accuracy champion lives: the CNN hit 99.8% accuracy on the held out test set. It captures subtle temporal patterns in the transit shape that hand crafted features miss.

Explainability Engine

Black box predictions are not useful in science. If the model says "this light curve contains a planet," an astronomer needs to know why. We built a custom explainability engine that generates per prediction feature attribution. For the Random Forest, this is straightforward: feature importance from the tree ensemble. For the CNN, we use gradient weighted class activation mapping adapted for 1D time series. The output is a heatmap over the input light curve showing which temporal regions contributed most to the prediction. If the model is working correctly, the hot regions should align with the transit dips.

This is not just a nice to have. During development, the explainability engine helped us catch a bug where the CNN was keying on an instrument artifact that correlated with transit labels in the training set. The heatmap showed activation at the wrong time positions, which would have been invisible from accuracy metrics alone.

Full Stack Dashboard

The frontend is a React application with Chart.js for interactive light curve visualization. Users can upload a light curve, see the raw time series, get predictions from all three models with confidence scores, and view the explainability heatmap overlaid on the original data. The backend is a Flask API that handles model inference, SMOTE augmented retraining on new data, and the explainability computations.

We built this during NASA Space Apps 2025. The constraint of a hackathon forced us to be disciplined about scope. No unnecessary features, no over engineering. Three models, one ensemble, one explainability method, one clean dashboard. It works, it is interpretable, and the CNN's 99.8% accuracy is competitive with published results from teams that spent a lot longer on it.