This ARIMA (AutoRegressive Integrated Moving Average) forecasting implementation in C is crafted to be modular, mathematically sound, micro-controller friendly and adaptable, breaking the complex ARIMA process into distinct, mathematically grounded steps. Each component reflects deliberate design choices aimed at balancing accuracy, computational efficiency, and resilience to real-world data challenges (e.g., outliers in sampleData
). Below, I’ll walk through how the system works, why we made these choices, and the trade-offs involved, weaving in the improvements we iterated on together.
Our journey started with a goal: build an ARIMA model that’s not just theoretically sound but practical for noisy, non-ideal time series like the one with jumps from 10 to 40 in sampleData
. We employed for a blend of statistical methods (Huber loss), proven numerical techniques (QR decomposition), and data-driven automation (ADF tests, ACF/PACF order selection), all while keeping the code self-contained in C.
We begin with data preprocessing, a critical step to tame the wild swings in our input series. The adjustOutliers
function uses the Median Absolute Deviation (MAD) to detect and cap extreme values. For a series ( y_t ), we compute the median ( m ), then calculate the MAD as:
This is scaled by 0.6745 (the normal consistency constant) and a factor of 2.5 to set a threshold:
Values beyond ( m \pm \text{threshold} ) are clipped to these bounds. Why MAD over standard deviation? It’s robust—less influenced by outliers like the 40.229511 spike we’re targeting. Initially, we encountered an issue: qsort
used strcmp
(for strings) instead of a double comparison. We corrected this with:
int double_cmp(const void *a, const void *b) {
double da = *(const double*)a, db = *(const double*)b;
return (da > db) - (da < db);
}
This fix ensures accurate sorting, making the outlier adjustment mathematically valid. The upside is a cleaner series for ARIMA estimation; the downside is a fixed ( k = 2.5 ), which might be too strict or lenient depending on the data’s tail behavior.
Next, we tackle stationarity, the backbone of ARIMA, ensuring a stable mean and variance crucial for reliable modeling. The differenceSeries
function applies ( d )-th order differencing to remove trends:
Each iteration reduces the series length by 1. For our specification in main
with ( d = 1 ), the result is:
To automate the choice of ( d ), we use ensureStationary
, which relies on the Augmented Dickey-Fuller (ADF) test:
-
ADF Test (
ADFTestExtendedAutoLag
): Fits the regression:It selects the number of lags ( p ) by minimizing the Akaike Information Criterion (AIC), up to a maximum of:
where ( n ) is the series length. If the p-value satisfies:
the series is differenced again, up to a maximum ( d ) (default 2, or user-specified).
-
P-Value: Interpolates linearly between critical values (-3.43, -2.86, -2.57) for 1%, 5%, and 10% significance levels.
Why this method? ARIMA assumes stationarity, and ADF automation adapts ( d ) to the data, removing guesswork. The iterative differencing aligns with ARIMA’s “I” component—simple and effective. However, ADF’s limited power against near-unit roots and our basic p-value interpolation are trade-offs; more detailed statistical tables or a KPSS test could improve accuracy.
To reverse differencing, integrateSeries
reconstructs the original scale:
For subsequent steps:
Starting with the last observed value ( y_t ), it sums the differenced forecasts cumulatively. This works perfectly for ( d = 1 ), our focus, but for ( d > 1 ), it’d need prior values (e.g., ( y_{t-1} ))—a simplification we chose for practicality.
With a stationary series, we estimate AR and MA parameters. For AR, estimateARWithCMLE
models:
We chose Conditional Maximum Likelihood Estimation (CMLE) with Newton-Raphson:
-
Initialization: Yule-Walker via
yuleWalker
, solving:where ( R_{ij} = r_{|i-j|} ) (Toeplitz ACF matrix) and ( r = [r_1, ..., r_p] ). It’s fast and stable, a great starting point.
-
Loss: Huber loss for robustness:
with:
(95% efficiency under normality).
-
Optimization: Updates the parameter vector:
using:
where:
Why Huber? Gaussian MLE struggles with outliers like 40.229511; Huber’s linear tail reduces their impact, enhancing robustness—a key improvement we prioritized. The gradient uses raw errors rather than Huber’s ( \psi )-function—a simplification that might bias ( \phi ) slightly, but Newton-Raphson’s line search ensures convergence. The Hessian approximation assumes Gaussian curvature, a practical choice over exact computation.
For MA, estimateMAWithMLE
models:
-
Initialization:
from ACF, damped for stability.
-
Errors: Recursively:
assuming ( \epsilon_{t<q} = 0 ).
-
Loss and Optimization: Same Huber-based Newton-Raphson as AR, with:
This aligns with AR for consistency, leveraging Huber’s robustness. The initial error assumption is typical for conditional MLE but adds early bias—exact likelihood (e.g., Kalman filtering) was bypassed for simplicity.
Both use checkRoots
to enforce stationarity (AR) and invertibility (MA), computing eigenvalues of:
via QR iteration. Roots ( |\text{root}| \leq 1.001 ) are scaled by 0.95. QR replaced our initial power iteration for accuracy—a critical upgrade.
forecastARIMA
orchestrates the process:
-
Preprocessing: Applies
adjustOutliers
(fixed withdouble_cmp
). -
Stationarity: Sets ( d ) via
ensureStationary
, or auto-selects ( p, d, q ) withselectOrdersWithFeedback
(ACF/PACF) and ADF if ( -1 ) is passed. -
Estimation: Runs
estimateARWithCMLE
andestimateMAWithMLE
. -
Forecasting: Recursively:
where ( y_{t+h-j} = f_{t+h-j} ) if ( h-j > 0 ), ( \epsilon_{t+h-k} = 0 ) if ( h-k > 0 ).
-
Variance:
with ( \psi_j = \phi_j ) or ( \theta_j ), and:
-
Integration:
integrateSeries
for ( d > 0 ).
This mirrors standard ARIMA forecasting, enhanced by Huber and QR robustness. Auto-selection uses ACF (( q )) and PACF (( p )) with a ( \frac{2}{\sqrt{n}} ) threshold—simple but effective, though lacking BIC/AIC refinement. Variance is a basic psi-weight sum, underestimating long-term uncertainty without full MA(( \infty )) expansion.
We check residuals with:
These are correct, offering residual whiteness insights. Raw stats without thresholds give flexibility—users can compare to ( \frac{2}{\sqrt{n}} ) or chi-squared values.
Our approach prioritizes robustness (Huber, QR), automation (ADF, order selection), and simplicity (no external libraries). Key enhancements:
- Fixed
adjustOutliers
withdouble_cmp
. - Upgraded to QR from power iteration in
checkRoots
. - Chose Huber over Gaussian for outlier resilience.
Upsides:
- Excels with noisy data (e.g.,
sampleData
). - Self-contained, efficient for small ( p, q ).
- Adaptive and robust.
Downsides:
- Gradient approximations may bias estimates.
- Limited ( d > 1 ) integration support.
- Simplified variance underestimates long-term uncertainty.
This ARIMA(2,1,4) shines for noisy series, striking a practical balance between theory and application.