This guide helps you diagnose and solve common issues you might encounter when working with the SensorAugmentor framework.
- Installation Issues
- Data Preparation Problems
- Training Issues
- Model Performance Issues
- Inference Problems
- Deployment Challenges
- Platform-Specific Issues
- Getting Help
Problem: Unable to install PyTorch or getting errors during installation.
Solution:
- Verify you're using the command from the official PyTorch website for your specific OS/CUDA configuration
- For CUDA compatibility issues:
# Check CUDA version nvcc --version # or nvidia-smi
- Install PyTorch with a specific CUDA version:
# For CUDA 11.7 pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
- If GPU support isn't necessary, use CPU-only version:
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu
Problem: Dependencies conflict with existing packages in your environment.
Solution:
- Create a dedicated virtual environment:
python -m venv sensor_env source sensor_env/bin/activate # On Windows: sensor_env\Scripts\activate
- Install with specific versions:
pip install -r requirements.txt
- If conflicts persist, try installing packages one by one, starting with PyTorch
Problem: ImportError: No module named 'sensor_actuator_network'
when trying to use the package.
Solution:
- Verify the module is installed:
pip list | grep sensor
- Check your Python path:
import sys print(sys.path)
- Ensure you're running Python from the correct environment
- Install in development mode:
pip install -e .
Problem: Getting dimension mismatch errors when feeding data to the model.
Solution:
- Check the expected dimensions:
print(f"Model expects input shape: [batch_size, {model.sensor_dim}]") print(f"Your data shape: {data.shape}")
- Reshape your data correctly:
# For single samples, add batch dimension if len(data.shape) == 1: data = data.unsqueeze(0) # Add batch dimension
- Transpose data if needed:
# If your data is [features, samples] instead of [samples, features] data = data.T
Problem: Model performs poorly due to data not being normalized correctly.
Solution:
- Verify normalization statistics:
print(f"Mean: {data.mean()}, Std: {data.std()}")
- Apply correct normalization:
normalized_data = (data - mean) / std
- Use the provided
DataNormalizer
class:from sensor_actuator_network import DataNormalizer normalizer = DataNormalizer().fit(train_data) normalized_train = normalizer.normalize(train_data) normalized_test = normalizer.normalize(test_data)
- Make sure to save normalization parameters with the model for inference:
model_info = { "model_state_dict": model.state_dict(), "normalization": { "mean": normalizer.mean, "std": normalizer.std } } torch.save(model_info, "model.pt")
Problem: TypeError or unexpected behavior due to incorrect data types.
Solution:
- Ensure data is the correct type for PyTorch:
# Convert numpy arrays to PyTorch tensors if isinstance(data, np.ndarray): data = torch.from_numpy(data).float() # Ensure correct data type data = data.to(torch.float32)
- Check for NaN or Inf values:
if torch.isnan(data).any() or torch.isinf(data).any(): print("Warning: Data contains NaN or Inf values") # Handle by replacing with zeros or mean values data = torch.where(torch.isnan(data) | torch.isinf(data), torch.zeros_like(data), data)
Problem: Training loss stays flat or doesn't decrease significantly.
Solution:
- Check your learning rate:
# Try a different learning rate optimizer = torch.optim.Adam(model.parameters(), lr=1e-4) # Try lower value like 1e-4
- Verify data is normalized properly (see Normalization Issues above)
- Inspect gradients for vanishing/exploding issues:
# Add this to your training loop to monitor gradients for name, param in model.named_parameters(): if param.requires_grad: print(f"{name}: grad_min={param.grad.min()}, grad_max={param.grad.max()}")
- Try a different optimization algorithm:
# Try SGD instead of Adam optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
- Check if the model has enough capacity:
# Increase model capacity model = SensorAugmentor(sensor_dim=32, hidden_dim=128, num_resblocks=4)
Problem: Loss suddenly spikes or becomes NaN during training.
Solution:
- Add gradient clipping:
# Add to your training loop right after loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
- Lower your learning rate:
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
- Use learning rate scheduling:
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau( optimizer, mode='min', factor=0.5, patience=5 ) # In your validation loop scheduler.step(val_loss)
- Check for extreme values in your data
Problem: CUDA out of memory errors during training.
Solution:
- Reduce batch size:
# Try a smaller batch size train_loader = DataLoader(dataset, batch_size=16) # Reduce from default
- Use gradient accumulation for effective larger batch sizes:
accumulation_steps = 4 # Effective batch size = batch_size * accumulation_steps optimizer.zero_grad() for i, (x_lq, x_hq, y_cmd) in enumerate(train_loader): # Forward pass and loss calculation loss = ... loss = loss / accumulation_steps # Normalize loss loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad()
- Move some operations to CPU if necessary:
# Process data preparation on CPU preprocessed_data = heavy_preprocessing(data.cpu()) # Move back to GPU for model preprocessed_data = preprocessed_data.to(device)
- Use mixed precision training (for Volta GPUs and newer):
from torch.cuda.amp import autocast, GradScaler scaler = GradScaler() for x_lq, x_hq, y_cmd in train_loader: x_lq, x_hq, y_cmd = x_lq.to(device), x_hq.to(device), y_cmd.to(device) # Enables autocasting for this forward pass with autocast(): reconstructed_hq, act_command, encoded_lq, encoded_hq = model(x_lq, x_hq) loss = calculate_loss(...) optimizer.zero_grad() scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()
Problem: Model doesn't reconstruct high-quality signals well.
Solution:
- Adjust loss function weights:
# Increase weight for reconstruction loss loss = 2.0 * loss_recon + 0.1 * loss_encoding + 0.5 * loss_act
- Increase model capacity:
model = SensorAugmentor( sensor_dim=32, hidden_dim=128, # Increase from default 64 num_resblocks=4 # Increase from default 2 )
- Check if there's enough correlation between LQ and HQ signals:
correlation = np.corrcoef( dataset.x_lq.numpy().reshape(-1), dataset.x_hq.numpy().reshape(-1) )[0, 1] print(f"LQ-HQ correlation: {correlation}") # If close to 0, your sensors may be too different
- Add more layers to the reconstructor:
class EnhancedSensorAugmentor(SensorAugmentor): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) # Replace hq_reconstructor with a deeper network self.hq_reconstructor = nn.Sequential( nn.Linear(self.hidden_dim, self.hidden_dim), nn.ReLU(), nn.Linear(self.hidden_dim, self.hidden_dim), nn.ReLU(), nn.Linear(self.hidden_dim, self.sensor_dim) )
Problem: Model doesn't predict actuator commands accurately.
Solution:
- Adjust loss function weights:
# Increase weight for actuator loss loss = 1.0 * loss_recon + 0.1 * loss_encoding + 2.0 * loss_act
- Ensure actuator commands are properly normalized
- Check the relation between sensor data and actuator commands:
# Create a simple model to check if sensor data predicts actuator commands from sklearn.linear_model import LinearRegression X = dataset.x_hq.numpy() y = dataset.y_cmd.numpy() reg = LinearRegression().fit(X, y) print(f"Simple model score: {reg.score(X, y)}") # If score is very low, there might not be enough signal in the data
- Enhance the actuator head:
class EnhancedSensorAugmentor(SensorAugmentor): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) # Replace actuator_head with a deeper network self.actuator_head = nn.Sequential( nn.Linear(self.hidden_dim, self.hidden_dim), nn.ReLU(), nn.Linear(self.hidden_dim, self.output_dim) )
Problem: Model performs well on training data but poorly on validation data.
Solution:
- Add regularization:
# Add weight decay to optimizer optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
- Add dropout:
class SensorAugmentorWithDropout(SensorAugmentor): def __init__(self, dropout_rate=0.2, *args, **kwargs): super().__init__(*args, **kwargs) self.dropout = nn.Dropout(dropout_rate) def forward(self, x_lq, x_hq=None): # Same as original but with dropout encoded_lq = self.encoder(x_lq) encoded_lq = self.post_encoding_resblock(encoded_lq) encoded_lq = self.dropout(encoded_lq) # Add dropout # Rest of forward pass # ...
- Use early stopping (already implemented in the library)
- Increase training data or add data augmentation:
def augment_sensor_data(data, noise_level=0.05): """Add small random noise to sensor data for augmentation.""" noise = noise_level * torch.randn_like(data) return data + noise # In training loop x_lq_augmented = augment_sensor_data(x_lq) x_hq_augmented = augment_sensor_data(x_hq)
Problem: Getting shape mismatch errors during inference.
Solution:
- Check input shape:
print(f"Expected input shape: [batch_size, {model.sensor_dim}]") print(f"Your input shape: {input_data.shape}")
- Ensure your input is correctly batched:
# For single sample, add batch dimension if len(input_data.shape) == 1: input_data = input_data.unsqueeze(0)
- For multi-sample prediction, use proper batching:
batch_size = 32 predictions = [] for i in range(0, len(input_data), batch_size): batch = input_data[i:i+batch_size] with torch.no_grad(): batch_predictions = model(batch) predictions.append(batch_predictions[0]) # Assuming you want reconstructed_hq # Concatenate results all_predictions = torch.cat(predictions, dim=0)
Problem: Poor results because input data isn't normalized correctly.
Solution:
- Always normalize inputs using the same statistics as during training:
# Load model with normalization parameters checkpoint = torch.load("model.pt") model.load_state_dict(checkpoint["model_state_dict"]) mean = checkpoint["normalization"]["mean"] std = checkpoint["normalization"]["std"] # Normalize input normalized_input = (input_data - mean) / std # Inference with torch.no_grad(): output = model(normalized_input)
- Use the provided
DataNormalizer
class for consistency:# During training normalizer = DataNormalizer().fit(train_data) torch.save({ "model_state_dict": model.state_dict(), "normalizer_mean": normalizer.mean, "normalizer_std": normalizer.std }, "model.pt") # During inference checkpoint = torch.load("model.pt") model.load_state_dict(checkpoint["model_state_dict"]) normalizer = DataNormalizer( mean=checkpoint["normalizer_mean"], std=checkpoint["normalizer_std"] ) # Normalize and predict normalized_input = normalizer.normalize(input_data) output = model(normalized_input)
Problem: Model inference is too slow for your application.
Solution:
- Use batch processing (see Shape Mismatch solution #3)
- Optimize model for inference:
# Set model to evaluation mode model.eval() # Use torch.no_grad() to disable gradient calculation with torch.no_grad(): output = model(input_data)
- Try model quantization for faster CPU inference:
# Quantize model quantized_model = torch.quantization.quantize_dynamic( model, {torch.nn.Linear}, dtype=torch.qint8 ) # Use quantized model with torch.no_grad(): output = quantized_model(input_data)
- Export to TorchScript for C++ deployment:
# Export to TorchScript scripted_model = torch.jit.script(model) scripted_model.save("model_scripted.pt")
- Use GPU if available:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) input_data = input_data.to(device)
Problem: Model is too large for your deployment environment.
Solution:
- Quantize the model to reduce size:
# Dynamic quantization quantized_model = torch.quantization.quantize_dynamic( model, {torch.nn.Linear}, dtype=torch.qint8 ) # Save quantized model torch.save(quantized_model.state_dict(), "model_quantized.pt") # Check size reduction import os original_size = os.path.getsize("model.pt") / 1024 quantized_size = os.path.getsize("model_quantized.pt") / 1024 print(f"Original size: {original_size:.2f} KB") print(f"Quantized size: {quantized_size:.2f} KB") print(f"Reduction: {(1 - quantized_size/original_size)*100:.1f}%")
- Prune unnecessary parameters:
# Simple magnitude-based pruning from torch.nn.utils import prune # Prune 20% of smallest weights in all linear layers for name, module in model.named_modules(): if isinstance(module, torch.nn.Linear): prune.l1_unstructured(module, name='weight', amount=0.2) # Make pruning permanent for name, module in model.named_modules(): if isinstance(module, torch.nn.Linear): prune.remove(module, 'weight')
- Use a smaller model architecture:
# Reduce model size smaller_model = SensorAugmentor( sensor_dim=32, hidden_dim=32, # Reduced from default 64 num_resblocks=1 # Reduced from default 2 )
Problem: Difficulties integrating the model into a REST API.
Solution:
- Use FastAPI for easy integration:
from fastapi import FastAPI, HTTPException from pydantic import BaseModel from typing import List import torch import numpy as np from sensor_actuator_network import SensorAugmentor app = FastAPI() # Load model (do this outside the request handlers for efficiency) checkpoint = torch.load("model.pt", map_location=torch.device('cpu')) model = SensorAugmentor( sensor_dim=checkpoint["config"]["sensor_dim"], hidden_dim=checkpoint["config"]["hidden_dim"], output_dim=checkpoint["config"]["output_dim"] ) model.load_state_dict(checkpoint["model_state_dict"]) model.eval() # Get normalization parameters mean = checkpoint["normalization"]["mean"] std = checkpoint["normalization"]["std"] class SensorData(BaseModel): data: List[float] @app.post("/predict") def predict(sensor_data: SensorData): try: # Convert to tensor data = torch.tensor([sensor_data.data], dtype=torch.float32) # Normalize normalized_data = (data - mean) / std # Inference with torch.no_grad(): reconstructed_hq, actuator_command, _, _ = model(normalized_data) return { "reconstructed_hq": reconstructed_hq[0].tolist(), "actuator_command": actuator_command[0].tolist() } except Exception as e: raise HTTPException(status_code=500, detail=str(e))
- For high-throughput applications, use a queue:
# requirements.txt # fastapi # uvicorn # redis # rq # worker.py import torch from sensor_actuator_network import SensorAugmentor from redis import Redis from rq import Worker, Queue, Connection # Load model model = SensorAugmentor(...) model.load_state_dict(torch.load("model.pt")["model_state_dict"]) model.eval() def process_prediction(data): tensor_data = torch.tensor(data, dtype=torch.float32) with torch.no_grad(): output = model(tensor_data) return output[0].tolist() # Return reconstructed_hq # Start worker redis_conn = Redis() with Connection(redis_conn): worker = Worker(Queue('sensor_predictions')) worker.work() # api.py from fastapi import FastAPI from redis import Redis from rq import Queue app = FastAPI() q = Queue('sensor_predictions', connection=Redis()) @app.post("/predict_async") def predict_async(data: dict): job = q.enqueue('worker.process_prediction', data["sensor_data"]) return {"job_id": job.id} @app.get("/result/{job_id}") def get_result(job_id: str): job = q.fetch_job(job_id) if job.is_finished: return {"result": job.result} elif job.is_failed: return {"status": "failed", "error": job.exc_info} else: return {"status": "pending"}
Problem: Issues with Docker containerization.
Solution:
- Use the official PyTorch Docker image as a base:
FROM pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . CMD ["python", "api_server.py"]
- For deployment size issues, use multi-stage builds:
# Build stage FROM pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime AS builder WORKDIR /build COPY requirements.txt . RUN pip install --no-cache-dir --target=/install -r requirements.txt # Runtime stage FROM python:3.8-slim WORKDIR /app COPY --from=builder /install /usr/local/lib/python3.8/site-packages COPY models/ /app/models/ COPY sensor_actuator_network.py /app/ COPY api_server.py /app/ CMD ["python", "api_server.py"]
- Ensure model files are correctly included in the image:
# Make sure models directory exists in the image RUN mkdir -p /app/models # Copy model files COPY models/sensor_model.pt /app/models/
Problem: CUDA errors when running on Windows.
Solution:
- Ensure matching CUDA toolkit and PyTorch versions:
# Check PyTorch CUDA version python -c "import torch; print(torch.version.cuda)" # Check system CUDA version nvcc --version
- Add CUDA DLLs to PATH:
# Add to Windows PATH set PATH=%PATH%;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1\bin
- Try CPU-only version for debugging:
# Force CPU usage model = model.to('cpu') data = data.to('cpu')
Problem: Issues deploying on macOS.
Solution:
- For macOS, use CPU-only version as CUDA is not supported
- For Apple Silicon (M1/M2), use PyTorch with MPS support:
if torch.backends.mps.is_available(): device = torch.device("mps") else: device = torch.device("cpu") model = model.to(device) data = data.to(device)
- Handle Metal Performance Shaders (MPS) specific issues:
# Some operations might not be supported on MPS # Fall back to CPU for these try: output = model(data.to(device)) except RuntimeError as e: if "not implemented for" in str(e) and device.type == "mps": print("Operation not supported on MPS, falling back to CPU") model = model.to("cpu") output = model(data.to("cpu")) model = model.to(device) # Move back to MPS else: raise
Problem: Issues deploying on Linux servers.
Solution:
- Ensure correct library versions:
# Check CUDA compatibility ldconfig -p | grep cuda # Install required libraries for PyTorch sudo apt-get install -y libopenblas-dev libomp-dev
- Set environment variables:
# Set number of OpenMP threads export OMP_NUM_THREADS=4 # Disable NUMA balancing for better performance echo 0 | sudo tee /proc/sys/kernel/numa_balancing
- Handle headless servers (no display):
# Set matplotlib to use a non-GUI backend import matplotlib matplotlib.use('Agg')
If you're still experiencing issues after trying the solutions in this guide:
- Check the GitHub Issues to see if someone has reported a similar problem
- Search the Discussions forum for related topics
- Create a detailed issue report including:
- SensorAugmentor version
- Python/PyTorch versions
- OS and hardware information
- Complete error message and stack trace
- Minimal reproducible example
For urgent issues or commercial support, contact support@sensoraugmentor.ai.
This troubleshooting guide covers common issues you might encounter. For more specific problems, please refer to the documentation or reach out to the community.