This commit is contained in:
Zhevniak Dmytro 2025-11-20 00:51:35 +02:00
коміт 3f524d8b18
43 змінених файлів з 185738 додано та 0 видалено

1
.gitignore сторонній Normal file

@ -0,0 +1 @@
/.idea/

Різницю між файлами не показано, бо вона завелика Завантажити різницю

Різницю між файлами не показано, бо вона завелика Завантажити різницю

Різницю між файлами не показано, бо вона завелика Завантажити різницю

Різницю між файлами не показано, бо вона завелика Завантажити різницю

3261
Batch2_DataExploration.ipynb Normal file

Різницю між файлами не показано, оскільки один чи декілька рядків занадто довгі

Різницю між файлами не показано, бо вона завелика Завантажити різницю

Різницю між файлами не показано, бо вона завелика Завантажити різницю

Бінарний файл не відображається.

Різницю між файлами не показано, бо вона завелика Завантажити різницю

Бінарний файл не відображається.

Різницю між файлами не показано, бо вона завелика Завантажити різницю

Різницю між файлами не показано, бо вона завелика Завантажити різницю

Різницю між файлами не показано, бо вона завелика Завантажити різницю

Бінарний файл не відображається.

Різницю між файлами не показано, бо вона завелика Завантажити різницю

Бінарний файл не відображається.

Бінарний файл не відображається.

Різницю між файлами не показано, бо вона завелика Завантажити різницю

Бінарний файл не відображається.

Різницю між файлами не показано, бо вона завелика Завантажити різницю

Бінарний файл не відображається.

129976
Data/winemag-data-130k-v2.csv Normal file

Різницю між файлами не показано, бо вона завелика Завантажити різницю

Бінарний файл не відображається.

121
README.md Normal file

@ -0,0 +1,121 @@
# SPS Sintering Data Smoothing and Regression
This project provides tools for analyzing and predicting the "Relative Piston Travel" in Spark Plasma Sintering (SPS) processes. It includes scripts for data exploration, smoothing, and regression modeling.
## Problem Overview
The original dataset contains "Rel. Piston Trav" values with limited precision, which results in numerous plateaus in the data (consecutive identical values). This lack of precision can hinder the performance of regression models. The smoothing scripts address this issue by adding controlled noise to break the plateaus while preserving the overall trends.
## Files and Scripts
### Original Scripts
- `data-exploration.py`: Performs exploratory data analysis and visualization
- `sintering-regression.py`: Implements three regression approaches for predicting "Rel. Piston Trav"
### New Scripts
- `smooth_data.py`: Creates smoothed versions of the original CSV files
- `verify_smoothing.py`: Verifies the quality of the smoothing and visualizes the improvements
- `use_smoothed_data.py`: Helper script to switch between original and smoothed data in the regression pipeline
## How to Use
### 1. Creating Smoothed Data
Run the `smooth_data.py` script to create smoothed versions of the original CSV files:
```bash
python smooth_data.py
```
This will:
- Load each of the original CSV files
- Analyze the "Rel. Piston Trav" column to determine appropriate smoothing parameters
- Apply a small amount of controlled noise to break plateaus
- Generate visualizations comparing original and smoothed data
- Save new CSV files with "_smoothed" suffix
### 2. Verifying the Smoothing
After creating the smoothed files, run the verification script to ensure the smoothing was effective:
```bash
python verify_smoothing.py
```
This will:
- Compare original and smoothed files
- Analyze the reduction in consecutive repeated values
- Generate various visualizations showing the improvements
- Confirm that the overall data distribution is preserved
### 3. Running Regression with Smoothed Data
You can use the `use_smoothed_data.py` script to switch between original and smoothed data for regression:
```bash
# Switch to smoothed data and run regression with approach 2
python use_smoothed_data.py --smoothed --approach 2
# Switch back to original data
python use_smoothed_data.py --approach 2
```
Alternatively, you can manually edit the file paths in `sintering-regression.py` to use the smoothed files.
## Regression Approaches
The regression script (`sintering-regression.py`) implements three approaches:
1. **Standard Approach**: Predict "Rel. Piston Trav" based only on current parameter values
2. **Window Approach**: Use current and previous time step data (including previous "Rel. Piston Trav") for prediction
3. **Virtual Experiment**: Similar to the window approach, but using predicted values from previous steps instead of actual measurements
## Smoothing Methods
The `smooth_data.py` script supports three smoothing methods:
1. **Noise** (default): Adds small random noise to break plateaus
2. **Spline**: Uses spline interpolation for smoothing
3. **Rolling**: Uses rolling average for smoothing
The noise method is the default as it preserves the overall structure of the data while effectively breaking plateaus.
## Data Files
### Original Data
- `160508-1021-1000,0min,56kN.csv`
- `160508-1022-900,0min,56kN.csv`
- `200508-1023-1350,0min,56kN.csv`
- `200508-1024-1200,0min,56kN.csv`
### Smoothed Data (generated)
- `160508-1021-1000,0min,56kN_smoothed.csv`
- `160508-1022-900,0min,56kN_smoothed.csv`
- `200508-1023-1350,0min,56kN_smoothed.csv`
- `200508-1024-1200,0min,56kN_smoothed.csv`
## Expected Results
Using the smoothed data with the regression approaches should yield:
1. Improved model performance (higher R² scores, lower RMSE)
2. More stable predictions in the virtual experiment approach
3. Reduced plateau effects in the predictions
4. Better feature importance insights
## Requirements
The code requires the following Python packages:
- numpy
- pandas
- matplotlib
- seaborn
- scikit-learn
- xgboost
- lightgbm
- scipy (for spline smoothing)
You can install these dependencies using:
```bash
pip install -r requirements.txt
```

1906
TestDataExploration1.ipynb Normal file

Різницю між файлами не показано, оскільки один чи декілька рядків занадто довгі

1430
TestDataExploration2.ipynb Normal file

Різницю між файлами не показано, оскільки один чи декілька рядків занадто довгі

1713
TiN-Starck-0/1459.csv Normal file

Різницю між файлами не показано, бо вона завелика Завантажити різницю

1864
TiN-Starck-0/1460.csv Normal file

Різницю між файлами не показано, бо вона завелика Завантажити різницю

1737
TiN-Starck-0/1468.csv Normal file

Різницю між файлами не показано, бо вона завелика Завантажити різницю

1701
TiN-Starck-0/1469.csv Normal file

Різницю між файлами не показано, бо вона завелика Завантажити різницю

2052
TiN-Starck-0/1471.csv Normal file

Різницю між файлами не показано, бо вона завелика Завантажити різницю

2129
TiN-Starck-0/1483.csv Normal file

Різницю між файлами не показано, бо вона завелика Завантажити різницю

2211
TiN-Starck-0/1499.csv Normal file

Різницю між файлами не показано, бо вона завелика Завантажити різницю

2096
TiN-Starck-0/1500.csv Normal file

Різницю між файлами не показано, бо вона завелика Завантажити різницю

Бінарний файл не відображається.

265
data-exploration.py Normal file

@ -0,0 +1,265 @@
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Define file paths
file_paths = [
'160508-1021-1000,0min,56kN.csv',
'160508-1022-900,0min,56kN.csv',
'200508-1023-1350,0min,56kN.csv',
'200508-1024-1200,0min,56kN.csv'
]
def load_and_explore_data(file_paths):
"""
Load all CSV files and perform exploratory data analysis.
Args:
file_paths: List of CSV file paths
"""
all_data = []
print("Loading and exploring data files...")
for i, file_path in enumerate(file_paths):
print(f"\nFile {i + 1}: {file_path}")
# Read the CSV file with proper settings for European number format
try:
df = pd.read_csv(file_path, sep=';', decimal=',', header=0)
# Add a file identifier column
df['file_id'] = i
all_data.append(df)
# Display basic information
print(f" Rows: {df.shape[0]}, Columns: {df.shape[1]}")
print(" First few rows:")
print(df.head(3).to_string())
# Check for missing values
missing_values = df.isnull().sum()
if missing_values.sum() > 0:
print("\n Missing values:")
print(missing_values[missing_values > 0])
# Analyze target variable
target_col = 'Rel. Piston Trav'
if target_col in df.columns:
print(f"\n {target_col} statistics:")
print(f" Min: {df[target_col].min()}")
print(f" Max: {df[target_col].max()}")
print(f" Mean: {df[target_col].mean():.4f}")
print(f" Std Dev: {df[target_col].std():.4f}")
print(f" Unique values: {df[target_col].nunique()}")
# Check for precision issues
decimal_places = df[target_col].astype(str).str.split('.').str[1].str.len().max()
print(f" Decimal places: {decimal_places}")
# Quick correlation analysis
if target_col in df.columns:
# Get correlations with target
corr = df.corr()[target_col].sort_values(ascending=False)
print("\n Top 5 correlations with target:")
print(corr.head(6).to_string()) # +1 to include the target itself
print("\n Bottom 5 correlations with target:")
print(corr.tail(5).to_string())
except Exception as e:
print(f"Error loading {file_path}: {e}")
# Combine all data for overall analysis
if all_data:
combined_df = pd.concat(all_data, ignore_index=True)
print("\nCombined dataset:")
print(f" Total rows: {combined_df.shape[0]}, Columns: {combined_df.shape[1]}")
return combined_df
return None
def plot_target_variable(df, target_col='Rel. Piston Trav'):
"""
Create visualizations for the target variable.
Args:
df: DataFrame with all data
target_col: Name of target column
"""
if target_col not in df.columns:
print(f"Target column '{target_col}' not found in data")
return
print(f"\nGenerating plots for {target_col}...")
# Create a copy of the dataframe with only numeric columns for correlation analysis
numeric_df = df.select_dtypes(include=[np.number])
# Set up figure with multiple subplots
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# Plot 1: Distribution of target variable
sns.histplot(df[target_col], kde=True, ax=axes[0, 0])
axes[0, 0].set_title(f'Distribution of {target_col}')
axes[0, 0].set_xlabel(target_col)
axes[0, 0].set_ylabel('Frequency')
# Plot 2: Target variable by file
sns.boxplot(x='file_id', y=target_col, data=df, ax=axes[0, 1])
axes[0, 1].set_title(f'{target_col} by File')
axes[0, 1].set_xlabel('File ID')
axes[0, 1].set_ylabel(target_col)
# Plot 3: Target variable over time (for first 1000 points)
sample_size = min(1000, df.shape[0])
axes[1, 0].plot(df['Nr.'].head(sample_size), df[target_col].head(sample_size))
axes[1, 0].set_title(f'{target_col} Over Time (First {sample_size} Points)')
axes[1, 0].set_xlabel('Record Number')
axes[1, 0].set_ylabel(target_col)
# Plot 4: Correlation heatmap (top correlated features)
try:
# Get absolute correlations with target from numeric columns only
corr = numeric_df.corr()[target_col].abs().sort_values(ascending=False)
top_features = corr.head(10).index # Top 10 features
# Create correlation matrix for selected features
corr_matrix = numeric_df[top_features].corr()
# Plot heatmap
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', ax=axes[1, 1])
axes[1, 1].set_title('Correlation Heatmap (Top Features)')
# Prepare for additional plot with top features
top_correlated = corr.head(6).index.tolist()
if target_col in top_correlated:
top_correlated.remove(target_col) # Exclude target itself
top_correlated = top_correlated[:4] # Get top 4
except Exception as e:
print(f"Error calculating correlations: {e}")
axes[1, 1].set_title('Correlation Heatmap (Error occurred)')
top_correlated = []
plt.tight_layout()
plt.show() # Add this line to display the plot
plt.close()
# Additional plot: Scatter plots of top correlated features vs target
if top_correlated:
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.flatten()
for i, feature in enumerate(top_correlated):
if i < 4: # Plot top 4 correlated features
sns.scatterplot(x=feature, y=target_col, data=df.sample(min(1000, df.shape[0])),
alpha=0.5, ax=axes[i])
axes[i].set_title(f'{feature} vs {target_col}')
# Hide unused subplots
for j in range(len(top_correlated), len(axes)):
axes[j].set_visible(False)
plt.tight_layout()
plt.show()
plt.close()
def analyze_feature_distributions(df):
"""
Analyze the distributions of key features.
Args:
df: DataFrame with all data
"""
print("\nAnalyzing feature distributions...")
# Identify numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
# Remove certain columns we don't need to visualize
cols_to_exclude = ['Nr.', 'file_id', 'Abs. Piston Trav']
feature_cols = [col for col in numeric_cols if col not in cols_to_exclude]
# Select top features based on data exploration
selected_features = [
'MTC1', 'MTC2', 'MTC3', 'Pyrometer', 'SV Temperature',
'SV Power', 'SV Force', 'AV Force', 'AV Speed',
'I RMS', 'U RMS', 'Heating power'
]
# Ensure all selected features exist in the dataframe
selected_features = [f for f in selected_features if f in df.columns]
# Create distribution plots for selected features
n_cols = 3
n_rows = (len(selected_features) + n_cols - 1) // n_cols
fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 4))
axes = axes.flatten()
for i, feature in enumerate(selected_features):
if i < len(axes):
sns.histplot(df[feature].dropna(), kde=True, ax=axes[i])
axes[i].set_title(f'Distribution of {feature}')
axes[i].set_xlabel(feature)
# Hide unused subplots
for j in range(len(selected_features), len(axes)):
axes[j].set_visible(False)
plt.tight_layout()
plt.show()
plt.close()
def plot_time_series_by_file(df, target_col='Rel. Piston Trav'):
"""
Plot time series of target variable for each file.
Args:
df: DataFrame with all data
target_col: Name of target column
"""
print("\nPlotting time series by file...")
# Create a figure
plt.figure(figsize=(15, 8))
# Plot for each file ID
for file_id in df['file_id'].unique():
file_data = df[df['file_id'] == file_id]
plt.plot(range(len(file_data)), file_data[target_col],
label=f'File {file_id}', alpha=0.7)
plt.title(f'{target_col} Time Series by File')
plt.xlabel('Time Step')
plt.ylabel(target_col)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
plt.close()
def main():
"""Main execution function"""
print("SPS Sintering Data Exploration")
# Load and explore data
combined_df = load_and_explore_data(file_paths)
if combined_df is not None:
# Generate plots
plot_target_variable(combined_df)
analyze_feature_distributions(combined_df)
plot_time_series_by_file(combined_df)
print("\nData exploration complete.")
if __name__ == "__main__":
main()

9
requirements.txt Normal file

@ -0,0 +1,9 @@
numpy~=2.0.2
matplotlib~=3.9.4
pandas~=2.2.3
seaborn~=0.13.2
xgboost~=2.1.4
lightgbm~=4.6.0
scikit-learn~=1.6.1
scikit-optimize~=0.10.2
tqdm

154
sintering-readme.md Normal file

@ -0,0 +1,154 @@
# SPS Sintering Regression Analysis
This code implements regression analysis for Spark Plasma Sintering (SPS) process data. It includes three different approaches to predicting the "Rel. Piston Trav." parameter based on other process parameters.
## Overview
The script provides a comprehensive framework for analyzing SPS sintering data using machine learning regression techniques. It implements three different prediction approaches:
1. **Standard Approach**: Predict the "Rel. Piston Trav." for each row independently based only on the current values of other parameters.
2. **Window Approach**: Predict the "Rel. Piston Trav." of row `n` using both the parameter values at row `n` and the parameter values (including "Rel. Piston Trav.") from row `n-1`.
3. **Virtual Experiment**: Similar to the window approach, but uses the predicted value of "Rel. Piston Trav." from the previous step instead of the actual value, enabling continuous prediction without relying on measured "Rel. Piston Trav." values.
## Requirements
The code requires the following Python packages:
- numpy
- pandas
- matplotlib
- seaborn
- scikit-learn
- xgboost
- lightgbm
- bayes_opt (for Bayesian optimization of hyperparameters)
You can install these dependencies using pip:
```bash
pip install numpy pandas matplotlib seaborn scikit-learn xgboost lightgbm bayesian-optimization
```
## Data Files
The code expects the following CSV files:
- 200508102313500min56kN.csv
- 200508102412000min56kN.csv
- 160508102110000min56kN.csv
- 16050810229000min56kN.csv
These files should contain SPS sintering data with semicolon-separated values and European number format (comma as decimal separator).
## Configuration
The script includes several configuration options at the top of the file:
```python
# Configuration for regression approaches
APPROACH = 1 # 1: Standard approach, 2: Window approach, 3: Virtual experiment
VALIDATION_FILE_INDEX = 3 # Use the 4th file for validation (0-indexed)
TARGET_COLUMN = 'Rel. Piston Trav'
EXCLUDED_COLUMNS = ['Abs. Piston Trav', 'Nr.', 'Datum', 'Zeit'] # Columns to exclude
# Feature selection (manual control)
# Set to None to use all available features
SELECTED_FEATURES = [
'MTC1', 'MTC2', 'MTC3', 'Pyrometer', 'SV Temperature',
'SV Power', 'SV Force', 'AV Force', 'AV Rel. Pressure',
'I RMS', 'U RMS', 'Heating power'
]
# Model selection (set to True to include in the evaluation)
MODELS_TO_EVALUATE = {
'Linear Regression': True,
'Ridge': True,
'Lasso': True,
'ElasticNet': True,
'Decision Tree': True,
'Random Forest': True,
'Gradient Boosting': True,
'XGBoost': True,
'LightGBM': True,
'SVR': True,
'KNN': True
}
# Hyperparameter tuning settings
TUNING_METHOD = 'bayesian' # 'grid', 'random', 'bayesian'
CV_FOLDS = 5
N_ITER = 20 # Number of iterations for random/bayesian search
```
You can modify these settings to customize the analysis:
- `APPROACH`: Set to 1, 2, or 3 based on which approach you want to use
- `VALIDATION_FILE_INDEX`: Specify which file to use for validation (0-based index)
- `SELECTED_FEATURES`: Specify which features to include in the regression, or set to `None` to use all available features
- `MODELS_TO_EVALUATE`: Enable/disable specific regression models
- `TUNING_METHOD`: Choose the hyperparameter tuning method (grid search, random search, or Bayesian optimization)
## Usage
1. Place the CSV data files in the same directory as the script
2. Configure the settings as described above
3. Run the script:
```bash
python sintering_regression.py
```
## Output
The script will output:
1. Console logs showing the progress and results of the regression analysis
2. Visualization plots:
- Feature importance plots for the best model
- Actual vs. predicted value plots
- Time series prediction plots
- Residual analysis plots
## Approaches in Detail
### 1. Standard Approach
This is the simplest approach where each row is treated independently. The regression model predicts the "Rel. Piston Trav." value based only on the current values of other parameters.
```
Input: [Parameters at time t]
Output: Predicted Rel. Piston Trav. at time t
```
### 2. Window Approach
In this approach, we use information from the previous time step to help predict the current "Rel. Piston Trav." value. The model uses both the current parameter values and the previous parameter values (including the previous "Rel. Piston Trav.").
```
Input: [Parameters at time t, Parameters at time t-1, Rel. Piston Trav. at time t-1]
Output: Predicted Rel. Piston Trav. at time t
```
### 3. Virtual Experiment
This approach builds on the Window Approach but enables continuous prediction without requiring real "Rel. Piston Trav." measurements after the initial value. Instead, it uses its own predictions from previous steps:
```
Initial input: [Parameters at time t=1, Parameters at time t=0, Known Rel. Piston Trav. at time t=0]
Output: Predicted Rel. Piston Trav. at time t=1
Next step input: [Parameters at time t=2, Parameters at time t=1, Predicted Rel. Piston Trav. at time t=1]
Output: Predicted Rel. Piston Trav. at time t=2
And so on...
```
This allows for a "virtual experiment" where you only need to provide the machine configuration parameters, and the model can predict how the "Rel. Piston Trav." will evolve throughout the sintering process.
## Extending the Code
The code is designed to be modular and extensible:
- To add new regression models, add them to the `models` and `param_grids` dictionaries in the `build_and_evaluate_models` function
- To add new preprocessing steps, modify the `preprocess_data` function
- To add new evaluation metrics, extend the `evaluate_model` function
- To create additional visualizations, add new plotting functions
## Improving Precision
As noted in your requirements, the "Rel. Piston Trav." values in the dataset have limited precision (2 decimal places). The code handles this by using float64 precision for all calculations, which ensures that small differences can be represented accurately in the model predictions, even if the original data had limited precision.

Різницю між файлами не показано, бо вона завелика Завантажити різницю

1294
sintering-regression.py Normal file

Різницю між файлами не показано, бо вона завелика Завантажити різницю

449
sintering_tuning.py Normal file

@ -0,0 +1,449 @@
import numpy as np
import time
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import mean_squared_error
# Try to import Bayesian optimization libraries
try:
from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical
BAYESIAN_AVAILABLE = True
except ImportError:
BAYESIAN_AVAILABLE = False
print("Bayesian optimization libraries not available. Will use RandomizedSearchCV instead.")
def ensure_finite(X, default_value=0.0):
"""
Replace any NaN, inf, or extremely large values with a default value.
Args:
X: Input array or matrix
default_value: Value to use for replacement
Returns:
X_clean: Cleaned array with finite values
"""
# Make a copy to avoid modifying the original
X_clean = np.array(X, copy=True)
# Replace inf values
mask_inf = np.isinf(X_clean)
if np.any(mask_inf):
print(f"Warning: Found {np.sum(mask_inf)} infinite values. Replacing with {default_value}.")
X_clean[mask_inf] = default_value
# Replace NaN values
mask_nan = np.isnan(X_clean)
if np.any(mask_nan):
print(f"Warning: Found {np.sum(mask_nan)} NaN values. Replacing with {default_value}.")
X_clean[mask_nan] = default_value
# Check for extremely large values
large_threshold = 1e6 # Adjust as needed
mask_large = np.abs(X_clean) > large_threshold
if np.any(mask_large):
print(f"Warning: Found {np.sum(mask_large)} extremely large values. Replacing with {default_value}.")
X_clean[mask_large] = default_value
return X_clean
def tune_hyperparameters(model_class, param_grid, X_train, y_train, method='grid', cv=5, n_iter=20, model_name=None):
"""
Tune hyperparameters for a model.
Args:
model_class: Scikit-learn model class
param_grid: Dictionary of hyperparameters
X_train: Training feature matrix
y_train: Training target vector
method: Tuning method ('grid', 'random', or 'bayesian')
cv: Number of cross-validation folds
n_iter: Number of iterations for random/bayesian search
model_name: Name of the model for special handling
Returns:
best_model: Tuned model
best_params: Best hyperparameter values
"""
start_time = time.time()
print(f" Tuning hyperparameters for {model_name} using {method} search...")
print(f" X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
if method == 'grid':
search = GridSearchCV(
model_class(), param_grid, cv=cv, scoring='neg_mean_squared_error',
verbose=1, n_jobs=-1
)
search.fit(X_train, y_train)
best_model = search.best_estimator_
best_params = search.best_params_
elif method == 'random':
search = RandomizedSearchCV(
model_class(), param_grid, n_iter=n_iter, cv=cv,
scoring='neg_mean_squared_error', verbose=1, random_state=42, n_jobs=-1
)
search.fit(X_train, y_train)
best_model = search.best_estimator_
best_params = search.best_params_
elif method == 'bayesian':
if BAYESIAN_AVAILABLE:
# Convert param_grid to skopt space format
search_space = {}
for param, values in param_grid.items():
# If parameter values are a list
if isinstance(values, list):
# Check types of values to determine space type
if all(isinstance(v, bool) for v in values) or all(isinstance(v, str) for v in values):
search_space[param] = Categorical(values)
elif all(isinstance(v, int) for v in values):
search_space[param] = Integer(min(values), max(values))
elif all(isinstance(v, float) for v in values):
search_space[param] = Real(min(values), max(values), prior='log-uniform')
else:
# Mixed types or other - use categorical
search_space[param] = Categorical(values)
# If parameter values are already a dictionary or distribution
else:
search_space[param] = values
print(f" Created Bayesian search space: {search_space}")
# Special handling for models that need parameter mapping
model_instance = model_class()
if model_name == 'GPR' and 'kernel' in search_space:
# Create a modified search with a custom kernel mapping
def map_kernel(params):
# Map numeric values to actual kernels
if 'kernel' in params and isinstance(params['kernel'], int):
from sklearn.gaussian_process.kernels import RBF, Matern, WhiteKernel, ConstantKernel as C
kernel_map = {
1: C(1.0, (1e-3, 1e3)) * RBF(1.0, (1e-2, 1e2)),
2: C(1.0, (1e-3, 1e3)) * Matern(1.0, (1e-2, 1e2), nu=1.5),
3: C(1.0, (1e-3, 1e3)) * RBF(1.0, (1e-2, 1e2)) + WhiteKernel(0.1)
}
params['kernel'] = kernel_map.get(params['kernel'], kernel_map[1])
return params
# Use a subset of data for GPR to speed up training
subset_size = min(1000, len(X_train))
idx = np.random.choice(len(X_train), subset_size, replace=False)
X_subset = X_train[idx]
y_subset = y_train[idx]
# Manual Bayesian optimization for GPR
best_score = float('-inf')
best_params = {}
best_model = None # Initialize best_model
for _ in range(n_iter):
# Sample parameters randomly from the space
params = {}
for param, space in search_space.items():
if hasattr(space, 'rvs'): # It's a distribution
params[param] = space.rvs(1)[0]
elif isinstance(space, list): # It's a list of values
params[param] = np.random.choice(space)
# Map parameters for kernels
params = map_kernel(params)
# Create and fit model with these parameters
try:
model = model_class(**params)
model.fit(X_subset, y_subset)
# Score model
score = -mean_squared_error(y_subset, model.predict(X_subset)) # Neg MSE
if score > best_score:
best_score = score
best_params = params
best_model = model
except Exception as e:
print(f" Skipping parameters due to error: {e}")
continue
print(f" Best params: {best_params}")
if best_model is None:
# Fallback if no model was successfully trained
print(" No successful model training, using default parameters")
best_model = model_class()
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C
kernel_rbf = C(1.0) * RBF(1.0)
best_model.set_params(kernel=kernel_rbf, alpha=1e-6)
best_model.fit(X_subset, y_subset)
return best_model, best_params
elif model_name == 'MLP' and 'hidden_layer_sizes' in search_space:
# Create a modified search with a custom hidden_layer_sizes mapping
def map_hidden_layers(params):
# Map numeric values to actual tuples for hidden_layer_sizes
if 'hidden_layer_sizes' in params and isinstance(params['hidden_layer_sizes'], (int, float)):
# Map integers to hidden layer configurations
layer_map = {
1: (50,),
2: (100,),
3: (50, 50),
4: (100, 50)
}
params['hidden_layer_sizes'] = layer_map.get(int(params['hidden_layer_sizes']), (50,))
return params
# Manual optimization for MLP
best_score = float('-inf')
best_params = {}
best_model = None # Initialize best_model
for _ in range(n_iter):
# Sample parameters randomly from the space
params = {}
for param, space in search_space.items():
if hasattr(space, 'rvs'): # It's a distribution
params[param] = space.rvs(1)[0]
elif isinstance(space, list): # It's a list of values
params[param] = np.random.choice(space)
# Map parameters for hidden layer sizes
params = map_hidden_layers(params)
# Create and fit model with these parameters
try:
model = model_class(**params)
model.fit(X_train, y_train)
# Score model
score = -mean_squared_error(y_train, model.predict(X_train)) # Neg MSE
if score > best_score:
best_score = score
best_params = params
best_model = model
except Exception as e:
print(f" Skipping parameters due to error: {e}")
continue
print(f" Best params: {best_params}")
if best_model is None:
# Fallback if no model was successfully trained
print(" No successful model training, using default parameters")
best_model = model_class(random_state=42, max_iter=1000)
best_model.fit(X_train, y_train)
return best_model, best_params
else:
# For other models, use standard BayesSearchCV
search = BayesSearchCV(
model_instance, search_space, n_iter=n_iter, cv=cv,
scoring='neg_mean_squared_error', verbose=1, random_state=42, n_jobs=-1
)
search.fit(X_train, y_train)
best_model = search.best_estimator_
best_params = search.best_params_
else:
print(" Bayesian optimization not available, falling back to RandomizedSearchCV")
search = RandomizedSearchCV(
model_class(), param_grid, n_iter=n_iter, cv=cv,
scoring='neg_mean_squared_error', verbose=1, random_state=42, n_jobs=-1
)
search.fit(X_train, y_train)
best_model = search.best_estimator_
best_params = search.best_params_
else:
raise ValueError(f"Unknown tuning method: {method}")
elapsed_time = time.time() - start_time
print(f" Tuning completed in {elapsed_time:.2f} seconds")
print(f" Best params: {best_params}")
return best_model, best_params
def get_param_grids():
"""
Get parameter grids for different models.
Returns:
param_grids: Dictionary of parameter grids for grid/random search
param_ranges: Dictionary of parameter ranges for Bayesian optimization
"""
# Parameter grids for grid/random search
param_grids = {}
# Linear models
param_grids['Linear Regression'] = {'fit_intercept': [True, False]}
param_grids['Ridge'] = {
'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0],
'fit_intercept': [True, False],
'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']
}
param_grids['Lasso'] = {
'alpha': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0],
'fit_intercept': [True, False],
'max_iter': [1000, 3000, 5000]
}
param_grids['ElasticNet'] = {
'alpha': [0.0001, 0.001, 0.01, 0.1, 1.0],
'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9],
'fit_intercept': [True, False],
'max_iter': [1000, 3000, 5000]
}
# Tree-based models
param_grids['Decision Tree'] = {
'max_depth': [None, 5, 10, 15, 20],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
param_grids['Random Forest'] = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
param_grids['Gradient Boosting'] = {
'n_estimators': [50, 100, 200],
'learning_rate': [0.01, 0.05, 0.1, 0.2],
'max_depth': [3, 5, 7, 9],
'min_samples_split': [2, 5, 10]
}
param_grids['XGBoost'] = {
'n_estimators': [50, 100, 200],
'learning_rate': [0.01, 0.05, 0.1, 0.2],
'max_depth': [3, 5, 7, 9],
'subsample': [0.8, 0.9, 1.0],
'colsample_bytree': [0.8, 0.9, 1.0]
}
param_grids['LightGBM'] = {
'n_estimators': [50, 100, 200],
'learning_rate': [0.01, 0.05, 0.1, 0.2],
'max_depth': [3, 5, 7, 9],
'num_leaves': [31, 63, 127],
'subsample': [0.8, 0.9, 1.0]
}
# Other models
param_grids['SVR'] = {
'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
'C': [0.1, 1, 10, 100],
'gamma': ['scale', 'auto', 0.1, 0.01, 0.001]
}
param_grids['KNN'] = {
'n_neighbors': [3, 5, 7, 9, 11, 13, 15],
'weights': ['uniform', 'distance'],
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']
}
# Neural Network model
param_grids['MLP'] = {
'hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 50)],
'activation': ['relu', 'tanh'],
'solver': ['adam', 'sgd'],
'alpha': [0.0001, 0.001, 0.01],
'learning_rate': ['constant', 'adaptive']
}
# Gaussian Process Regression
from sklearn.gaussian_process.kernels import RBF, Matern, WhiteKernel, ConstantKernel as C
kernel_rbf = C(1.0, (1e-3, 1e3)) * RBF(1.0, (1e-2, 1e2))
kernel_matern = C(1.0, (1e-3, 1e3)) * Matern(1.0, (1e-2, 1e2), nu=1.5)
kernel_rbf_white = C(1.0, (1e-3, 1e3)) * RBF(1.0, (1e-2, 1e2)) + WhiteKernel(0.1)
param_grids['GPR'] = {
'kernel': [kernel_rbf, kernel_matern, kernel_rbf_white],
'alpha': [1e-10, 1e-8, 1e-6],
'normalize_y': [True, False],
'n_restarts_optimizer': [0, 1, 3]
}
# Parameter ranges for Bayesian optimization
param_ranges = {}
# Linear models
param_ranges['Linear Regression'] = {'fit_intercept': [True, False]}
param_ranges['Ridge'] = {
'alpha': (0.001, 100.0, 'log-uniform'),
'fit_intercept': [True, False],
'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']
}
param_ranges['Lasso'] = {
'alpha': (0.0001, 10.0, 'log-uniform'),
'fit_intercept': [True, False],
'max_iter': (1000, 10000)
}
param_ranges['ElasticNet'] = {
'alpha': (0.0001, 1.0, 'log-uniform'),
'l1_ratio': (0.1, 0.9),
'fit_intercept': [True, False],
'max_iter': (1000, 10000)
}
# Tree-based models
param_ranges['Decision Tree'] = {
'max_depth': (3, 30), # None will be handled specially
'min_samples_split': (2, 20),
'min_samples_leaf': (1, 10)
}
param_ranges['Random Forest'] = {
'n_estimators': (10, 300),
'max_depth': (3, 50), # None will be handled specially
'min_samples_split': (2, 20),
'min_samples_leaf': (1, 10)
}
param_ranges['Gradient Boosting'] = {
'n_estimators': (10, 300),
'learning_rate': (0.001, 0.3, 'log-uniform'),
'max_depth': (2, 15),
'min_samples_split': (2, 20)
}
param_ranges['XGBoost'] = {
'n_estimators': (10, 300),
'learning_rate': (0.001, 0.3, 'log-uniform'),
'max_depth': (2, 15),
'subsample': (0.5, 1.0),
'colsample_bytree': (0.5, 1.0)
}
# Other models
param_ranges['SVR'] = {
'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
'C': (0.01, 1000.0, 'log-uniform'),
'gamma': ['scale', 'auto'] + [(0.0001, 1.0, 'log-uniform')]
}
param_ranges['KNN'] = {
'n_neighbors': (1, 30),
'weights': ['uniform', 'distance'],
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']
}
# Neural Network model
param_ranges['MLP'] = {
'hidden_layer_sizes': [1, 2, 3, 4], # Will map to actual tuples later
'activation': ['relu', 'tanh'],
'solver': ['adam', 'sgd'],
'alpha': (0.00001, 0.1, 'log-uniform'),
'learning_rate': ['constant', 'adaptive']
}
# Gaussian Process Regression
param_ranges['GPR'] = {
'kernel': [1, 2, 3], # Will map to actual kernels later
'alpha': (1e-12, 1e-4, 'log-uniform'),
'normalize_y': [True, False],
'n_restarts_optimizer': (0, 5)
}
return param_grids, param_ranges

317
smooth_data.py Normal file

@ -0,0 +1,317 @@
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
# Define file paths
file_paths = [
'160508-1021-1000,0min,56kN.csv',
'160508-1022-900,0min,56kN.csv',
'200508-1023-1350,0min,56kN.csv',
'200508-1024-1200,0min,56kN.csv'
]
# Target column to smooth
TARGET_COLUMN = 'Rel. Piston Trav'
def load_file(file_path):
"""
Load a CSV file with European number format.
Args:
file_path: Path to the CSV file
Returns:
DataFrame with the loaded data
"""
try:
df = pd.read_csv(file_path, sep=';', decimal=',', header=0)
print(f"Loaded {file_path}, shape: {df.shape}")
return df
except Exception as e:
print(f"Error loading {file_path}: {e}")
return None
def analyze_target_column(df, target_col):
"""
Analyze the target column to understand precision issues.
Args:
df: DataFrame containing the data
target_col: Name of the target column
Returns:
Dictionary with analysis results
"""
if target_col not in df.columns:
print(f"Target column '{target_col}' not found in data")
return None
# Extract target column
target_values = df[target_col].values
# Calculate differences between consecutive values
differences = np.diff(target_values)
non_zero_diffs = differences[differences != 0]
# Ensure we have absolute differences for calculations that need positive values
abs_non_zero_diffs = np.abs(non_zero_diffs)
# Count occurrences of repeated values
consecutive_repeats = []
current_count = 1
for i in range(1, len(target_values)):
if abs(target_values[i] - target_values[i-1]) < 1e-10:
current_count += 1
else:
if current_count > 1:
consecutive_repeats.append(current_count)
current_count = 1
# Add the last group if it's a repeat
if current_count > 1:
consecutive_repeats.append(current_count)
# Calculate statistics
results = {
'unique_values': df[target_col].nunique(),
'total_values': len(target_values),
'min_nonzero_diff': np.min(non_zero_diffs) if len(non_zero_diffs) > 0 else 0,
'min_abs_nonzero_diff': np.min(abs_non_zero_diffs) if len(abs_non_zero_diffs) > 0 else 0.0001,
'avg_nonzero_diff': np.mean(non_zero_diffs) if len(non_zero_diffs) > 0 else 0,
'avg_abs_nonzero_diff': np.mean(abs_non_zero_diffs) if len(abs_non_zero_diffs) > 0 else 0.0001,
'median_nonzero_diff': np.median(non_zero_diffs) if len(non_zero_diffs) > 0 else 0,
'zero_diff_count': len(differences) - len(non_zero_diffs),
'zero_diff_percentage': 100 * (len(differences) - len(non_zero_diffs)) / len(differences),
'max_consecutive_repeats': max(consecutive_repeats) if consecutive_repeats else 0,
'avg_consecutive_repeats': np.mean(consecutive_repeats) if consecutive_repeats else 0
}
print(f"\nAnalysis of '{target_col}':")
print(f" Unique values: {results['unique_values']} out of {results['total_values']} total values")
print(f" Minimum non-zero difference: {results['min_nonzero_diff']:.8f}")
print(f" Zero differences: {results['zero_diff_count']} ({results['zero_diff_percentage']:.2f}% of all consecutive pairs)")
print(f" Maximum consecutive repeated values: {results['max_consecutive_repeats']}")
return results
def smooth_target_column(df, target_col, method='noise', params=None):
"""
Smooth the target column to address precision issues.
Args:
df: DataFrame containing the data
target_col: Name of the target column
method: Smoothing method to use ('noise', 'spline', or 'rolling')
params: Parameters for the smoothing method
Returns:
DataFrame with the smoothed target column
"""
# Make a copy to avoid modifying the original
smoothed_df = df.copy()
if target_col not in smoothed_df.columns:
print(f"Target column '{target_col}' not found in data")
return smoothed_df
# Extract target column
target_values = smoothed_df[target_col].values
if method == 'noise':
# Default parameters
if params is None:
params = {'noise_scale': 0.0001}
# Add small noise to break plateaus
noise_scale = params.get('noise_scale', 0.0001)
np.random.seed(42) # For reproducibility
smoothed_values = target_values + np.random.normal(0, noise_scale, len(target_values))
elif method == 'spline':
from scipy.interpolate import UnivariateSpline
# Default parameters
if params is None:
params = {'s': 0.01}
# Use spline interpolation
x = np.arange(len(target_values))
s = params.get('s', 0.01) # Smoothing factor
spline = UnivariateSpline(x, target_values, s=s)
smoothed_values = spline(x)
elif method == 'rolling':
# Default parameters
if params is None:
params = {'window': 3, 'center': True}
# Use rolling average
window = params.get('window', 3)
center = params.get('center', True)
smoothed_series = pd.Series(target_values).rolling(
window=window, center=center, min_periods=1).mean()
smoothed_values = smoothed_series.values
else:
print(f"Unknown smoothing method: {method}")
return smoothed_df
# Update the target column in the DataFrame
smoothed_df[target_col] = smoothed_values
return smoothed_df
def plot_comparison(original_df, smoothed_df, target_col, file_name=None, samples=1000):
"""
Plot comparison between original and smoothed data.
Args:
original_df: DataFrame with original data
smoothed_df: DataFrame with smoothed data
target_col: Name of the target column
file_name: Name of the file (for title)
samples: Number of samples to plot
"""
if target_col not in original_df.columns or target_col not in smoothed_df.columns:
print(f"Target column '{target_col}' not found in data")
return
# Create a figure with multiple subplots
fig, axes = plt.subplots(3, 1, figsize=(15, 12))
# Get data for plotting
original_values = original_df[target_col].values[:samples]
smoothed_values = smoothed_df[target_col].values[:samples]
x = np.arange(len(original_values))
# Plot 1: Overview
axes[0].plot(x, original_values, label='Original', alpha=0.7)
axes[0].plot(x, smoothed_values, label='Smoothed', alpha=0.7)
axes[0].set_title(f"Overview of {target_col}" + (f" ({file_name})" if file_name else ""))
axes[0].set_xlabel('Index')
axes[0].set_ylabel(target_col)
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Plot 2: Zoomed section (first 200 points)
zoom_end = min(200, len(original_values))
axes[1].plot(x[:zoom_end], original_values[:zoom_end], label='Original', alpha=0.7)
axes[1].plot(x[:zoom_end], smoothed_values[:zoom_end], label='Smoothed', alpha=0.7)
axes[1].set_title(f"Zoomed View (First {zoom_end} Points)")
axes[1].set_xlabel('Index')
axes[1].set_ylabel(target_col)
axes[1].legend()
axes[1].grid(True, alpha=0.3)
# Plot 3: Difference between original and smoothed
diff = smoothed_values - original_values
axes[2].plot(x, diff, label='Smoothed - Original', color='green', alpha=0.7)
axes[2].axhline(y=0, color='r', linestyle='--', alpha=0.5)
axes[2].set_title('Difference (Smoothed - Original)')
axes[2].set_xlabel('Index')
axes[2].set_ylabel('Difference')
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
def save_smoothed_file(df, original_path, suffix="_smoothed"):
"""
Save the DataFrame to a new CSV file with European number format.
Args:
df: DataFrame to save
original_path: Path to the original CSV file
suffix: Suffix to add to the new filename
Returns:
Path to the saved file
"""
# Create new filename
base, ext = os.path.splitext(original_path)
new_path = f"{base}{suffix}{ext}"
# Save with European number format
df.to_csv(new_path, sep=';', decimal=',', index=False)
print(f"Saved smoothed data to {new_path}")
return new_path
def process_file(file_path, smoothing_method, params=None):
"""
Process a single file: load, analyze, smooth, plot comparison, and save.
Args:
file_path: Path to the CSV file
smoothing_method: Method to use for smoothing
params: Parameters for the smoothing method
Returns:
Path to the saved smoothed file
"""
# Load the file
df = load_file(file_path)
if df is None:
return None
# Analyze the target column
analysis = analyze_target_column(df, TARGET_COLUMN)
if analysis is None:
return None
# Adjust smoothing parameters based on analysis if not provided
if params is None:
if smoothing_method == 'noise':
# Use 1/10 of the minimum non-zero difference
noise_scale = max(0.00001, abs(analysis['min_abs_nonzero_diff']) / 10)
params = {'noise_scale': noise_scale}
print(f"Using noise scale: {noise_scale:.8f}")
elif smoothing_method == 'spline':
# Adjust smoothing factor based on data range
data_range = df[TARGET_COLUMN].max() - df[TARGET_COLUMN].min()
s = 0.0001 * data_range * len(df)
params = {'s': s}
print(f"Using spline smoothing factor: {s:.8f}")
elif smoothing_method == 'rolling':
# Use window size based on average run length of repeated values
window = max(3, int(analysis['avg_consecutive_repeats'] / 2))
params = {'window': window, 'center': True}
print(f"Using rolling window size: {window}")
# Smooth the target column
smoothed_df = smooth_target_column(df, TARGET_COLUMN, smoothing_method, params)
# Plot comparison
plot_comparison(df, smoothed_df, TARGET_COLUMN, os.path.basename(file_path))
# Save the smoothed data
smoothed_path = save_smoothed_file(smoothed_df, file_path)
return smoothed_path
def main():
"""Main execution function"""
print("SPS Data Smoothing Utility")
print("==========================")
# Smoothing parameters
smoothing_method = 'noise' # 'noise', 'spline', or 'rolling'
# Process each file
smoothed_files = []
for file_path in file_paths:
print(f"\nProcessing {file_path}...")
smoothed_path = process_file(file_path, smoothing_method)
if smoothed_path:
smoothed_files.append(smoothed_path)
print("\nProcessing complete!")
print(f"Created {len(smoothed_files)} smoothed files:")
for file_path in smoothed_files:
print(f" {file_path}")
if __name__ == "__main__":
main()