Skip to content

This repository is a self-contained ARIMA (AutoRegressive Integrated Moving Average) implementation in C

License

Notifications You must be signed in to change notification settings

Tugbars/ARIMA-in-C

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ARIMA Implementation – Design Decisions and Rationale

This ARIMA (AutoRegressive Integrated Moving Average) forecasting implementation in C is crafted to be modular, mathematically sound, micro-controller friendly and adaptable, breaking the complex ARIMA process into distinct, mathematically grounded steps. Each component reflects deliberate design choices aimed at balancing accuracy, computational efficiency, and resilience to real-world data challenges (e.g., outliers in sampleData). Below, I’ll walk through how the system works, why we made these choices, and the trade-offs involved, weaving in the improvements we iterated on together.

Our journey started with a goal: build an ARIMA model that’s not just theoretically sound but practical for noisy, non-ideal time series like the one with jumps from 10 to 40 in sampleData. We employed for a blend of statistical methods (Huber loss), proven numerical techniques (QR decomposition), and data-driven automation (ADF tests, ACF/PACF order selection), all while keeping the code self-contained in C.


Data preprocessing

We begin with data preprocessing, a critical step to tame the wild swings in our input series. The adjustOutliers function uses the Median Absolute Deviation (MAD) to detect and cap extreme values. For a series ( y_t ), we compute the median ( m ), then calculate the MAD as:

equation

This is scaled by 0.6745 (the normal consistency constant) and a factor of 2.5 to set a threshold:

equation

Values beyond ( m \pm \text{threshold} ) are clipped to these bounds. Why MAD over standard deviation? It’s robust—less influenced by outliers like the 40.229511 spike we’re targeting. Initially, we encountered an issue: qsort used strcmp (for strings) instead of a double comparison. We corrected this with:

int double_cmp(const void *a, const void *b) {
    double da = *(const double*)a, db = *(const double*)b;
    return (da > db) - (da < db);
}

This fix ensures accurate sorting, making the outlier adjustment mathematically valid. The upside is a cleaner series for ARIMA estimation; the downside is a fixed ( k = 2.5 ), which might be too strict or lenient depending on the data’s tail behavior.


Stationary

Next, we tackle stationarity, the backbone of ARIMA, ensuring a stable mean and variance crucial for reliable modeling. The differenceSeries function applies ( d )-th order differencing to remove trends:

equation

Each iteration reduces the series length by 1. For our specification in main with ( d = 1 ), the result is:

equation

To automate the choice of ( d ), we use ensureStationary, which relies on the Augmented Dickey-Fuller (ADF) test:

  • ADF Test (ADFTestExtendedAutoLag): Fits the regression:

    equation

    It selects the number of lags ( p ) by minimizing the Akaike Information Criterion (AIC), up to a maximum of:

    equation

    where ( n ) is the series length. If the p-value satisfies:

    equation

    the series is differenced again, up to a maximum ( d ) (default 2, or user-specified).

  • P-Value: Interpolates linearly between critical values (-3.43, -2.86, -2.57) for 1%, 5%, and 10% significance levels.

Why this method? ARIMA assumes stationarity, and ADF automation adapts ( d ) to the data, removing guesswork. The iterative differencing aligns with ARIMA’s “I” component—simple and effective. However, ADF’s limited power against near-unit roots and our basic p-value interpolation are trade-offs; more detailed statistical tables or a KPSS test could improve accuracy.

To reverse differencing, integrateSeries reconstructs the original scale:

equation

For subsequent steps:

equation

Starting with the last observed value ( y_t ), it sums the differenced forecasts cumulatively. This works perfectly for ( d = 1 ), our focus, but for ( d > 1 ), it’d need prior values (e.g., ( y_{t-1} ))—a simplification we chose for practicality.


AR/MA Parameter Estimation

With a stationary series, we estimate AR and MA parameters. For AR, estimateARWithCMLE models:

equation

We chose Conditional Maximum Likelihood Estimation (CMLE) with Newton-Raphson:

  • Initialization: Yule-Walker via yuleWalker, solving:

    equation

    where ( R_{ij} = r_{|i-j|} ) (Toeplitz ACF matrix) and ( r = [r_1, ..., r_p] ). It’s fast and stable, a great starting point.

  • Loss: Huber loss for robustness:

    equation

    with:

    equation

    (95% efficiency under normality).

  • Optimization: Updates the parameter vector:

    equation

    using:

    equation

    where:

    • Gradient components are:

      equation

      for the AR coefficients, and:

      equation

      for the intercept.

    • Hessian elements are:

      equation

      for the AR terms, and:

      equation

      for the intercept.

    • Line search halves step if the loss increases.

Why Huber? Gaussian MLE struggles with outliers like 40.229511; Huber’s linear tail reduces their impact, enhancing robustness—a key improvement we prioritized. The gradient uses raw errors rather than Huber’s ( \psi )-function—a simplification that might bias ( \phi ) slightly, but Newton-Raphson’s line search ensures convergence. The Hessian approximation assumes Gaussian curvature, a practical choice over exact computation.

For MA, estimateMAWithMLE models:

equation

  • Initialization:

    equation

    from ACF, damped for stability.

  • Errors: Recursively:

    equation

    assuming ( \epsilon_{t<q} = 0 ).

  • Loss and Optimization: Same Huber-based Newton-Raphson as AR, with:

    equation

This aligns with AR for consistency, leveraging Huber’s robustness. The initial error assumption is typical for conditional MLE but adds early bias—exact likelihood (e.g., Kalman filtering) was bypassed for simplicity.

Both use checkRoots to enforce stationarity (AR) and invertibility (MA), computing eigenvalues of:

equation

via QR iteration. Roots ( |\text{root}| \leq 1.001 ) are scaled by 0.95. QR replaced our initial power iteration for accuracy—a critical upgrade.


Forecasting

forecastARIMA orchestrates the process:

  • Preprocessing: Applies adjustOutliers (fixed with double_cmp).

  • Stationarity: Sets ( d ) via ensureStationary, or auto-selects ( p, d, q ) with selectOrdersWithFeedback (ACF/PACF) and ADF if ( -1 ) is passed.

  • Estimation: Runs estimateARWithCMLE and estimateMAWithMLE.

  • Forecasting: Recursively:

    equation

    where ( y_{t+h-j} = f_{t+h-j} ) if ( h-j > 0 ), ( \epsilon_{t+h-k} = 0 ) if ( h-k > 0 ).

  • Variance:

    equation

    with ( \psi_j = \phi_j ) or ( \theta_j ), and:

    equation

  • Integration: integrateSeries for ( d > 0 ).

This mirrors standard ARIMA forecasting, enhanced by Huber and QR robustness. Auto-selection uses ACF (( q )) and PACF (( p )) with a ( \frac{2}{\sqrt{n}} ) threshold—simple but effective, though lacking BIC/AIC refinement. Variance is a basic psi-weight sum, underestimating long-term uncertainty without full MA(( \infty )) expansion.


Diagnostics

We check residuals with:

  • Residual ACF (computeResidualACF):

    equation

    reporting max ( |r_k| ).

  • Ljung-Box (computeLjungBox):

    equation

These are correct, offering residual whiteness insights. Raw stats without thresholds give flexibility—users can compare to ( \frac{2}{\sqrt{n}} ) or chi-squared values.


Design Decisions and Trade-Offs

Our approach prioritizes robustness (Huber, QR), automation (ADF, order selection), and simplicity (no external libraries). Key enhancements:

  • Fixed adjustOutliers with double_cmp.
  • Upgraded to QR from power iteration in checkRoots.
  • Chose Huber over Gaussian for outlier resilience.

Upsides:

  • Excels with noisy data (e.g., sampleData).
  • Self-contained, efficient for small ( p, q ).
  • Adaptive and robust.

Downsides:

  • Gradient approximations may bias estimates.
  • Limited ( d > 1 ) integration support.
  • Simplified variance underestimates long-term uncertainty.

This ARIMA(2,1,4) shines for noisy series, striking a practical balance between theory and application.