ARIMA Implementation – Design Decisions and Rationale

This ARIMA (AutoRegressive Integrated Moving Average) forecasting implementation in C is crafted to be modular, mathematically sound, micro-controller friendly and adaptable, breaking the complex ARIMA process into distinct, mathematically grounded steps. Each component reflects deliberate design choices aimed at balancing accuracy, computational efficiency, and resilience to real-world data challenges (e.g., outliers in sampleData). Below, I’ll walk through how the system works, why we made these choices, and the trade-offs involved, weaving in the improvements we iterated on together.

Our journey started with a goal: build an ARIMA model that’s not just theoretically sound but practical for noisy, non-ideal time series like the one with jumps from 10 to 40 in sampleData. We employed for a blend of statistical methods (Huber loss), proven numerical techniques (QR decomposition), and data-driven automation (ADF tests, ACF/PACF order selection), all while keeping the code self-contained in C.

Data preprocessing

We begin with data preprocessing, a critical step to tame the wild swings in our input series. The adjustOutliers function uses the Median Absolute Deviation (MAD) to detect and cap extreme values. For a series ( y_t ), we compute the median ( m ), then calculate the MAD as:

This is scaled by 0.6745 (the normal consistency constant) and a factor of 2.5 to set a threshold:

Values beyond ( m \pm \text{threshold} ) are clipped to these bounds. Why MAD over standard deviation? It’s robust—less influenced by outliers like the 40.229511 spike we’re targeting. Initially, we encountered an issue: qsort used strcmp (for strings) instead of a double comparison. We corrected this with:

int double_cmp(const void *a, const void *b) {
    double da = *(const double*)a, db = *(const double*)b;
    return (da > db) - (da < db);
}

This fix ensures accurate sorting, making the outlier adjustment mathematically valid. The upside is a cleaner series for ARIMA estimation; the downside is a fixed ( k = 2.5 ), which might be too strict or lenient depending on the data’s tail behavior.

Stationary

Next, we tackle stationarity, the backbone of ARIMA, ensuring a stable mean and variance crucial for reliable modeling. The differenceSeries function applies ( d )-th order differencing to remove trends:

Each iteration reduces the series length by 1. For our specification in main with ( d = 1 ), the result is:

To automate the choice of ( d ), we use ensureStationary, which relies on the Augmented Dickey-Fuller (ADF) test:

ADF Test (ADFTestExtendedAutoLag): Fits the regression:

It selects the number of lags ( p ) by minimizing the Akaike Information Criterion (AIC), up to a maximum of:

where ( n ) is the series length. If the p-value satisfies:

the series is differenced again, up to a maximum ( d ) (default 2, or user-specified).
P-Value: Interpolates linearly between critical values (-3.43, -2.86, -2.57) for 1%, 5%, and 10% significance levels.

Why this method? ARIMA assumes stationarity, and ADF automation adapts ( d ) to the data, removing guesswork. The iterative differencing aligns with ARIMA’s “I” component—simple and effective. However, ADF’s limited power against near-unit roots and our basic p-value interpolation are trade-offs; more detailed statistical tables or a KPSS test could improve accuracy.

To reverse differencing, integrateSeries reconstructs the original scale:

For subsequent steps:

Starting with the last observed value ( y_t ), it sums the differenced forecasts cumulatively. This works perfectly for ( d = 1 ), our focus, but for ( d > 1 ), it’d need prior values (e.g., ( y_{t-1} ))—a simplification we chose for practicality.

AR/MA Parameter Estimation

With a stationary series, we estimate AR and MA parameters. For AR, estimateARWithCMLE models:

We chose Conditional Maximum Likelihood Estimation (CMLE) with Newton-Raphson:

Initialization: Yule-Walker via yuleWalker, solving:

where ( R_{ij} = r_{|i-j|} ) (Toeplitz ACF matrix) and ( r = [r_1, ..., r_p] ). It’s fast and stable, a great starting point.
Loss: Huber loss for robustness:

with:

(95% efficiency under normality).
Optimization: Updates the parameter vector:

using:

where:
- Gradient components are:
  
  for the AR coefficients, and:
  
  for the intercept.
- Hessian elements are:
  
  for the AR terms, and:
  
  for the intercept.
- Line search halves step if the loss increases.

Why Huber? Gaussian MLE struggles with outliers like 40.229511; Huber’s linear tail reduces their impact, enhancing robustness—a key improvement we prioritized. The gradient uses raw errors rather than Huber’s ( \psi )-function—a simplification that might bias ( \phi ) slightly, but Newton-Raphson’s line search ensures convergence. The Hessian approximation assumes Gaussian curvature, a practical choice over exact computation.

For MA, estimateMAWithMLE models:

Initialization:

from ACF, damped for stability.
Errors: Recursively:

assuming ( \epsilon_{t<q} = 0 ).
Loss and Optimization: Same Huber-based Newton-Raphson as AR, with:

This aligns with AR for consistency, leveraging Huber’s robustness. The initial error assumption is typical for conditional MLE but adds early bias—exact likelihood (e.g., Kalman filtering) was bypassed for simplicity.

Both use checkRoots to enforce stationarity (AR) and invertibility (MA), computing eigenvalues of:

via QR iteration. Roots ( |\text{root}| \leq 1.001 ) are scaled by 0.95. QR replaced our initial power iteration for accuracy—a critical upgrade.

Forecasting

forecastARIMA orchestrates the process:

Preprocessing: Applies adjustOutliers (fixed with double_cmp).
Stationarity: Sets ( d ) via ensureStationary, or auto-selects ( p, d, q ) with selectOrdersWithFeedback (ACF/PACF) and ADF if ( -1 ) is passed.
Estimation: Runs estimateARWithCMLE and estimateMAWithMLE.
Forecasting: Recursively:

where ( y_{t+h-j} = f_{t+h-j} ) if ( h-j > 0 ), ( \epsilon_{t+h-k} = 0 ) if ( h-k > 0 ).
Variance:

with ( \psi_j = \phi_j ) or ( \theta_j ), and:
Integration: integrateSeries for ( d > 0 ).

This mirrors standard ARIMA forecasting, enhanced by Huber and QR robustness. Auto-selection uses ACF (( q )) and PACF (( p )) with a ( \frac{2}{\sqrt{n}} ) threshold—simple but effective, though lacking BIC/AIC refinement. Variance is a basic psi-weight sum, underestimating long-term uncertainty without full MA(( \infty )) expansion.

Diagnostics

We check residuals with:

Residual ACF (computeResidualACF):

reporting max ( |r_k| ).
Ljung-Box (computeLjungBox):

These are correct, offering residual whiteness insights. Raw stats without thresholds give flexibility—users can compare to ( \frac{2}{\sqrt{n}} ) or chi-squared values.

Design Decisions and Trade-Offs

Our approach prioritizes robustness (Huber, QR), automation (ADF, order selection), and simplicity (no external libraries). Key enhancements:

Fixed adjustOutliers with double_cmp.
Upgraded to QR from power iteration in checkRoots.
Chose Huber over Gaussian for outlier resilience.

Upsides:

Excels with noisy data (e.g., sampleData).
Self-contained, efficient for small ( p, q ).
Adaptive and robust.

Downsides:

Gradient approximations may bias estimates.
Limited ( d > 1 ) integration support.
Simplified variance underestimates long-term uncertainty.

This ARIMA(2,1,4) shines for noisy series, striking a practical balance between theory and application.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
arima.c		arima.c
arima.h		arima.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ARIMA Implementation – Design Decisions and Rationale

Data preprocessing

Stationary

AR/MA Parameter Estimation

Forecasting

Diagnostics

Design Decisions and Trade-Offs

About

Releases

Packages

Languages

License

Tugbars/ARIMA-in-C

Folders and files

Latest commit

History

Repository files navigation

ARIMA Implementation – Design Decisions and Rationale

Data preprocessing

Stationary

AR/MA Parameter Estimation

Forecasting

Diagnostics

Design Decisions and Trade-Offs

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages