Troubleshooting Guide

This guide helps you diagnose and solve common issues you might encounter when working with the SensorAugmentor framework.

Installation Issues

PyTorch Installation Failures

Problem: Unable to install PyTorch or getting errors during installation.

Solution:

Verify you're using the command from the official PyTorch website for your specific OS/CUDA configuration

For CUDA compatibility issues:

# Check CUDA version
nvcc --version
# or
nvidia-smi

Install PyTorch with a specific CUDA version:

# For CUDA 11.7
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117

If GPU support isn't necessary, use CPU-only version:

pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu

Package Version Conflicts

Problem: Dependencies conflict with existing packages in your environment.

Solution:

Create a dedicated virtual environment:

python -m venv sensor_env
source sensor_env/bin/activate  # On Windows: sensor_env\Scripts\activate

Install with specific versions:
```
pip install -r requirements.txt
```
If conflicts persist, try installing packages one by one, starting with PyTorch

Import Errors

Problem: ImportError: No module named 'sensor_actuator_network' when trying to use the package.

Solution:

Verify the module is installed:
```
pip list | grep sensor
```
Check your Python path:
```
import sys
print(sys.path)
```
Ensure you're running Python from the correct environment
Install in development mode:
```
pip install -e .
```

Data Preparation Problems

Incorrect Dimension Errors

Problem: Getting dimension mismatch errors when feeding data to the model.

Solution:

Check the expected dimensions:

print(f"Model expects input shape: [batch_size, {model.sensor_dim}]")
print(f"Your data shape: {data.shape}")

Reshape your data correctly:

# For single samples, add batch dimension
if len(data.shape) == 1:
    data = data.unsqueeze(0)  # Add batch dimension

Transpose data if needed:

# If your data is [features, samples] instead of [samples, features]
data = data.T

Normalization Issues

Problem: Model performs poorly due to data not being normalized correctly.

Solution:

Verify normalization statistics:

print(f"Mean: {data.mean()}, Std: {data.std()}")

Apply correct normalization:
```
normalized_data = (data - mean) / std
```

Use the provided DataNormalizer class:

from sensor_actuator_network import DataNormalizer

normalizer = DataNormalizer().fit(train_data)
normalized_train = normalizer.normalize(train_data)
normalized_test = normalizer.normalize(test_data)

Make sure to save normalization parameters with the model for inference:

model_info = {
    "model_state_dict": model.state_dict(),
    "normalization": {
        "mean": normalizer.mean,
        "std": normalizer.std
    }
}
torch.save(model_info, "model.pt")

Data Type Issues

Problem: TypeError or unexpected behavior due to incorrect data types.

Solution:

Ensure data is the correct type for PyTorch:

# Convert numpy arrays to PyTorch tensors
if isinstance(data, np.ndarray):
    data = torch.from_numpy(data).float()

# Ensure correct data type
data = data.to(torch.float32)

Check for NaN or Inf values:

if torch.isnan(data).any() or torch.isinf(data).any():
    print("Warning: Data contains NaN or Inf values")
    # Handle by replacing with zeros or mean values
    data = torch.where(torch.isnan(data) | torch.isinf(data), 
                      torch.zeros_like(data), data)

Training Issues

Loss Not Decreasing

Problem: Training loss stays flat or doesn't decrease significantly.

Solution:

Check your learning rate:

# Try a different learning rate
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)  # Try lower value like 1e-4

Verify data is normalized properly (see Normalization Issues above)

Inspect gradients for vanishing/exploding issues:

# Add this to your training loop to monitor gradients
for name, param in model.named_parameters():
    if param.requires_grad:
        print(f"{name}: grad_min={param.grad.min()}, grad_max={param.grad.max()}")

Try a different optimization algorithm:

# Try SGD instead of Adam
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

Check if the model has enough capacity:

# Increase model capacity
model = SensorAugmentor(sensor_dim=32, hidden_dim=128, num_resblocks=4)

Training Divergence

Problem: Loss suddenly spikes or becomes NaN during training.

Solution:

Add gradient clipping:

# Add to your training loop right after loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Lower your learning rate:

optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

Use learning rate scheduling:

scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.5, patience=5
)
# In your validation loop
scheduler.step(val_loss)

Check for extreme values in your data

Out of Memory (OOM) Errors

Problem: CUDA out of memory errors during training.

Solution:

Reduce batch size:

# Try a smaller batch size
train_loader = DataLoader(dataset, batch_size=16)  # Reduce from default

Use gradient accumulation for effective larger batch sizes:

accumulation_steps = 4  # Effective batch size = batch_size * accumulation_steps
optimizer.zero_grad()
for i, (x_lq, x_hq, y_cmd) in enumerate(train_loader):
    # Forward pass and loss calculation
    loss = ...
    loss = loss / accumulation_steps  # Normalize loss
    loss.backward()
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Move some operations to CPU if necessary:

# Process data preparation on CPU
preprocessed_data = heavy_preprocessing(data.cpu())
# Move back to GPU for model
preprocessed_data = preprocessed_data.to(device)

Use mixed precision training (for Volta GPUs and newer):

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for x_lq, x_hq, y_cmd in train_loader:
    x_lq, x_hq, y_cmd = x_lq.to(device), x_hq.to(device), y_cmd.to(device)
    
    # Enables autocasting for this forward pass
    with autocast():
        reconstructed_hq, act_command, encoded_lq, encoded_hq = model(x_lq, x_hq)
        loss = calculate_loss(...)
    
    optimizer.zero_grad()
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Model Performance Issues

Poor Reconstruction Quality

Problem: Model doesn't reconstruct high-quality signals well.

Solution:

Adjust loss function weights:

# Increase weight for reconstruction loss
loss = 2.0 * loss_recon + 0.1 * loss_encoding + 0.5 * loss_act

Increase model capacity:

model = SensorAugmentor(
    sensor_dim=32, 
    hidden_dim=128,  # Increase from default 64
    num_resblocks=4  # Increase from default 2
)

Check if there's enough correlation between LQ and HQ signals:

correlation = np.corrcoef(
    dataset.x_lq.numpy().reshape(-1), 
    dataset.x_hq.numpy().reshape(-1)
)[0, 1]
print(f"LQ-HQ correlation: {correlation}")
# If close to 0, your sensors may be too different

Add more layers to the reconstructor:

class EnhancedSensorAugmentor(SensorAugmentor):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        
        # Replace hq_reconstructor with a deeper network
        self.hq_reconstructor = nn.Sequential(
            nn.Linear(self.hidden_dim, self.hidden_dim),
            nn.ReLU(),
            nn.Linear(self.hidden_dim, self.hidden_dim),
            nn.ReLU(),
            nn.Linear(self.hidden_dim, self.sensor_dim)
        )

Poor Actuator Command Prediction

Problem: Model doesn't predict actuator commands accurately.

Solution:

Adjust loss function weights:

# Increase weight for actuator loss
loss = 1.0 * loss_recon + 0.1 * loss_encoding + 2.0 * loss_act

Ensure actuator commands are properly normalized

Check the relation between sensor data and actuator commands:

# Create a simple model to check if sensor data predicts actuator commands
from sklearn.linear_model import LinearRegression

X = dataset.x_hq.numpy()
y = dataset.y_cmd.numpy()

reg = LinearRegression().fit(X, y)
print(f"Simple model score: {reg.score(X, y)}")
# If score is very low, there might not be enough signal in the data

Enhance the actuator head:

class EnhancedSensorAugmentor(SensorAugmentor):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        
        # Replace actuator_head with a deeper network
        self.actuator_head = nn.Sequential(
            nn.Linear(self.hidden_dim, self.hidden_dim),
            nn.ReLU(),
            nn.Linear(self.hidden_dim, self.output_dim)
        )

Overfitting

Problem: Model performs well on training data but poorly on validation data.

Solution:

Add regularization:

# Add weight decay to optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)

Add dropout:

class SensorAugmentorWithDropout(SensorAugmentor):
    def __init__(self, dropout_rate=0.2, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.dropout = nn.Dropout(dropout_rate)
        
    def forward(self, x_lq, x_hq=None):
        # Same as original but with dropout
        encoded_lq = self.encoder(x_lq)
        encoded_lq = self.post_encoding_resblock(encoded_lq)
        encoded_lq = self.dropout(encoded_lq)  # Add dropout
        
        # Rest of forward pass
        # ...

Use early stopping (already implemented in the library)

Increase training data or add data augmentation:

def augment_sensor_data(data, noise_level=0.05):
    """Add small random noise to sensor data for augmentation."""
    noise = noise_level * torch.randn_like(data)
    return data + noise

# In training loop
x_lq_augmented = augment_sensor_data(x_lq)
x_hq_augmented = augment_sensor_data(x_hq)

Inference Problems

Shape Mismatch During Inference

Problem: Getting shape mismatch errors during inference.

Solution:

Check input shape:

print(f"Expected input shape: [batch_size, {model.sensor_dim}]")
print(f"Your input shape: {input_data.shape}")

Ensure your input is correctly batched:

# For single sample, add batch dimension
if len(input_data.shape) == 1:
    input_data = input_data.unsqueeze(0)

For multi-sample prediction, use proper batching:

batch_size = 32
predictions = []

for i in range(0, len(input_data), batch_size):
    batch = input_data[i:i+batch_size]
    with torch.no_grad():
        batch_predictions = model(batch)
    predictions.append(batch_predictions[0])  # Assuming you want reconstructed_hq

# Concatenate results
all_predictions = torch.cat(predictions, dim=0)

Missing Normalization During Inference

Problem: Poor results because input data isn't normalized correctly.

Solution:

Always normalize inputs using the same statistics as during training:

# Load model with normalization parameters
checkpoint = torch.load("model.pt")
model.load_state_dict(checkpoint["model_state_dict"])
mean = checkpoint["normalization"]["mean"]
std = checkpoint["normalization"]["std"]

# Normalize input
normalized_input = (input_data - mean) / std

# Inference
with torch.no_grad():
    output = model(normalized_input)

Use the provided DataNormalizer class for consistency:

# During training
normalizer = DataNormalizer().fit(train_data)
torch.save({
    "model_state_dict": model.state_dict(),
    "normalizer_mean": normalizer.mean,
    "normalizer_std": normalizer.std
}, "model.pt")

# During inference
checkpoint = torch.load("model.pt")
model.load_state_dict(checkpoint["model_state_dict"])
normalizer = DataNormalizer(
    mean=checkpoint["normalizer_mean"],
    std=checkpoint["normalizer_std"]
)

# Normalize and predict
normalized_input = normalizer.normalize(input_data)
output = model(normalized_input)

Slow Inference

Problem: Model inference is too slow for your application.

Solution:

Use batch processing (see Shape Mismatch solution #3)

Optimize model for inference:

# Set model to evaluation mode
model.eval()

# Use torch.no_grad() to disable gradient calculation
with torch.no_grad():
    output = model(input_data)

Try model quantization for faster CPU inference:

# Quantize model
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

# Use quantized model
with torch.no_grad():
    output = quantized_model(input_data)

Export to TorchScript for C++ deployment:

# Export to TorchScript
scripted_model = torch.jit.script(model)
scripted_model.save("model_scripted.pt")

Use GPU if available:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
input_data = input_data.to(device)

Deployment Challenges

Model Size Issues

Problem: Model is too large for your deployment environment.

Solution:

Quantize the model to reduce size:

# Dynamic quantization
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

# Save quantized model
torch.save(quantized_model.state_dict(), "model_quantized.pt")

# Check size reduction
import os
original_size = os.path.getsize("model.pt") / 1024
quantized_size = os.path.getsize("model_quantized.pt") / 1024
print(f"Original size: {original_size:.2f} KB")
print(f"Quantized size: {quantized_size:.2f} KB")
print(f"Reduction: {(1 - quantized_size/original_size)*100:.1f}%")

Prune unnecessary parameters:

# Simple magnitude-based pruning
from torch.nn.utils import prune

# Prune 20% of smallest weights in all linear layers
for name, module in model.named_modules():
    if isinstance(module, torch.nn.Linear):
        prune.l1_unstructured(module, name='weight', amount=0.2)

# Make pruning permanent
for name, module in model.named_modules():
    if isinstance(module, torch.nn.Linear):
        prune.remove(module, 'weight')

Use a smaller model architecture:

# Reduce model size
smaller_model = SensorAugmentor(
    sensor_dim=32,
    hidden_dim=32,  # Reduced from default 64
    num_resblocks=1  # Reduced from default 2
)

API Integration Issues

Problem: Difficulties integrating the model into a REST API.

Solution:

Use FastAPI for easy integration:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List
import torch
import numpy as np
from sensor_actuator_network import SensorAugmentor

app = FastAPI()

# Load model (do this outside the request handlers for efficiency)
checkpoint = torch.load("model.pt", map_location=torch.device('cpu'))
model = SensorAugmentor(
    sensor_dim=checkpoint["config"]["sensor_dim"],
    hidden_dim=checkpoint["config"]["hidden_dim"],
    output_dim=checkpoint["config"]["output_dim"]
)
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()

# Get normalization parameters
mean = checkpoint["normalization"]["mean"]
std = checkpoint["normalization"]["std"]

class SensorData(BaseModel):
    data: List[float]

@app.post("/predict")
def predict(sensor_data: SensorData):
    try:
        # Convert to tensor
        data = torch.tensor([sensor_data.data], dtype=torch.float32)
        
        # Normalize
        normalized_data = (data - mean) / std
        
        # Inference
        with torch.no_grad():
            reconstructed_hq, actuator_command, _, _ = model(normalized_data)
        
        return {
            "reconstructed_hq": reconstructed_hq[0].tolist(),
            "actuator_command": actuator_command[0].tolist()
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

For high-throughput applications, use a queue:

# requirements.txt
# fastapi
# uvicorn
# redis
# rq

# worker.py
import torch
from sensor_actuator_network import SensorAugmentor
from redis import Redis
from rq import Worker, Queue, Connection

# Load model
model = SensorAugmentor(...)
model.load_state_dict(torch.load("model.pt")["model_state_dict"])
model.eval()

def process_prediction(data):
    tensor_data = torch.tensor(data, dtype=torch.float32)
    with torch.no_grad():
        output = model(tensor_data)
    return output[0].tolist()  # Return reconstructed_hq

# Start worker
redis_conn = Redis()
with Connection(redis_conn):
    worker = Worker(Queue('sensor_predictions'))
    worker.work()

# api.py
from fastapi import FastAPI
from redis import Redis
from rq import Queue

app = FastAPI()
q = Queue('sensor_predictions', connection=Redis())

@app.post("/predict_async")
def predict_async(data: dict):
    job = q.enqueue('worker.process_prediction', data["sensor_data"])
    return {"job_id": job.id}

@app.get("/result/{job_id}")
def get_result(job_id: str):
    job = q.fetch_job(job_id)
    if job.is_finished:
        return {"result": job.result}
    elif job.is_failed:
        return {"status": "failed", "error": job.exc_info}
    else:
        return {"status": "pending"}

Containerization Issues

Problem: Issues with Docker containerization.

Solution:

Use the official PyTorch Docker image as a base:

FROM pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["python", "api_server.py"]

For deployment size issues, use multi-stage builds:

# Build stage
FROM pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime AS builder

WORKDIR /build

COPY requirements.txt .
RUN pip install --no-cache-dir --target=/install -r requirements.txt

# Runtime stage
FROM python:3.8-slim

WORKDIR /app

COPY --from=builder /install /usr/local/lib/python3.8/site-packages
COPY models/ /app/models/
COPY sensor_actuator_network.py /app/
COPY api_server.py /app/

CMD ["python", "api_server.py"]

Ensure model files are correctly included in the image:

# Make sure models directory exists in the image
RUN mkdir -p /app/models

# Copy model files
COPY models/sensor_model.pt /app/models/

Platform-Specific Issues

CUDA Issues on Windows

Problem: CUDA errors when running on Windows.

Solution:

Ensure matching CUDA toolkit and PyTorch versions:

# Check PyTorch CUDA version
python -c "import torch; print(torch.version.cuda)"

# Check system CUDA version
nvcc --version

Add CUDA DLLs to PATH:

# Add to Windows PATH
set PATH=%PATH%;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1\bin

Try CPU-only version for debugging:

# Force CPU usage
model = model.to('cpu')
data = data.to('cpu')

macOS Deployment

Problem: Issues deploying on macOS.

Solution:

For macOS, use CPU-only version as CUDA is not supported

For Apple Silicon (M1/M2), use PyTorch with MPS support:

if torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")

model = model.to(device)
data = data.to(device)

Handle Metal Performance Shaders (MPS) specific issues:

# Some operations might not be supported on MPS
# Fall back to CPU for these
try:
    output = model(data.to(device))
except RuntimeError as e:
    if "not implemented for" in str(e) and device.type == "mps":
        print("Operation not supported on MPS, falling back to CPU")
        model = model.to("cpu")
        output = model(data.to("cpu"))
        model = model.to(device)  # Move back to MPS
    else:
        raise

Linux Server Deployment

Problem: Issues deploying on Linux servers.

Solution:

Ensure correct library versions:

# Check CUDA compatibility
ldconfig -p | grep cuda

# Install required libraries for PyTorch
sudo apt-get install -y libopenblas-dev libomp-dev

Set environment variables:

# Set number of OpenMP threads
export OMP_NUM_THREADS=4

# Disable NUMA balancing for better performance
echo 0 | sudo tee /proc/sys/kernel/numa_balancing

Handle headless servers (no display):

# Set matplotlib to use a non-GUI backend
import matplotlib
matplotlib.use('Agg')

Getting Help

If you're still experiencing issues after trying the solutions in this guide:

Check the GitHub Issues to see if someone has reported a similar problem
Search the Discussions forum for related topics
Create a detailed issue report including:
- SensorAugmentor version
- Python/PyTorch versions
- OS and hardware information
- Complete error message and stack trace
- Minimal reproducible example

For urgent issues or commercial support, contact support@sensoraugmentor.ai.

This troubleshooting guide covers common issues you might encounter. For more specific problems, please refer to the documentation or reach out to the community.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.md

index.md

Troubleshooting Guide

Table of Contents

Installation Issues

PyTorch Installation Failures

Package Version Conflicts

Import Errors

Data Preparation Problems

Incorrect Dimension Errors

Normalization Issues

Data Type Issues

Training Issues

Loss Not Decreasing

Training Divergence

Out of Memory (OOM) Errors

Model Performance Issues

Poor Reconstruction Quality

Poor Actuator Command Prediction

Overfitting

Inference Problems

Shape Mismatch During Inference

Missing Normalization During Inference

Slow Inference

Deployment Challenges

Model Size Issues

API Integration Issues

Containerization Issues

Platform-Specific Issues

CUDA Issues on Windows

macOS Deployment

Linux Server Deployment

Getting Help

Files

index.md

Latest commit

History

index.md

File metadata and controls

Troubleshooting Guide

Table of Contents

Installation Issues

PyTorch Installation Failures

Package Version Conflicts

Import Errors

Data Preparation Problems

Incorrect Dimension Errors

Normalization Issues

Data Type Issues

Training Issues

Loss Not Decreasing

Training Divergence

Out of Memory (OOM) Errors

Model Performance Issues

Poor Reconstruction Quality

Poor Actuator Command Prediction

Overfitting

Inference Problems

Shape Mismatch During Inference

Missing Normalization During Inference

Slow Inference

Deployment Challenges

Model Size Issues

API Integration Issues

Containerization Issues

Platform-Specific Issues

CUDA Issues on Windows

macOS Deployment

Linux Server Deployment

Getting Help