Initial commit
This commit is contained in:
коміт
3f524d8b18
|
|
@ -0,0 +1 @@
|
|||
/.idea/
|
||||
Різницю між файлами не показано, бо вона завелика
Завантажити різницю
Різницю між файлами не показано, бо вона завелика
Завантажити різницю
Різницю між файлами не показано, бо вона завелика
Завантажити різницю
Різницю між файлами не показано, бо вона завелика
Завантажити різницю
Різницю між файлами не показано, оскільки один чи декілька рядків занадто довгі
Різницю між файлами не показано, бо вона завелика
Завантажити різницю
Різницю між файлами не показано, бо вона завелика
Завантажити різницю
Бінарний файл не відображається.
Різницю між файлами не показано, бо вона завелика
Завантажити різницю
Бінарний файл не відображається.
Різницю між файлами не показано, бо вона завелика
Завантажити різницю
Різницю між файлами не показано, бо вона завелика
Завантажити різницю
Різницю між файлами не показано, бо вона завелика
Завантажити різницю
Бінарний файл не відображається.
Різницю між файлами не показано, бо вона завелика
Завантажити різницю
Бінарний файл не відображається.
Бінарний файл не відображається.
Різницю між файлами не показано, бо вона завелика
Завантажити різницю
Бінарний файл не відображається.
Різницю між файлами не показано, бо вона завелика
Завантажити різницю
Бінарний файл не відображається.
Різницю між файлами не показано, бо вона завелика
Завантажити різницю
Бінарний файл не відображається.
|
|
@ -0,0 +1,121 @@
|
|||
# SPS Sintering Data Smoothing and Regression
|
||||
|
||||
This project provides tools for analyzing and predicting the "Relative Piston Travel" in Spark Plasma Sintering (SPS) processes. It includes scripts for data exploration, smoothing, and regression modeling.
|
||||
|
||||
## Problem Overview
|
||||
|
||||
The original dataset contains "Rel. Piston Trav" values with limited precision, which results in numerous plateaus in the data (consecutive identical values). This lack of precision can hinder the performance of regression models. The smoothing scripts address this issue by adding controlled noise to break the plateaus while preserving the overall trends.
|
||||
|
||||
## Files and Scripts
|
||||
|
||||
### Original Scripts
|
||||
- `data-exploration.py`: Performs exploratory data analysis and visualization
|
||||
- `sintering-regression.py`: Implements three regression approaches for predicting "Rel. Piston Trav"
|
||||
|
||||
### New Scripts
|
||||
- `smooth_data.py`: Creates smoothed versions of the original CSV files
|
||||
- `verify_smoothing.py`: Verifies the quality of the smoothing and visualizes the improvements
|
||||
- `use_smoothed_data.py`: Helper script to switch between original and smoothed data in the regression pipeline
|
||||
|
||||
## How to Use
|
||||
|
||||
### 1. Creating Smoothed Data
|
||||
|
||||
Run the `smooth_data.py` script to create smoothed versions of the original CSV files:
|
||||
|
||||
```bash
|
||||
python smooth_data.py
|
||||
```
|
||||
|
||||
This will:
|
||||
- Load each of the original CSV files
|
||||
- Analyze the "Rel. Piston Trav" column to determine appropriate smoothing parameters
|
||||
- Apply a small amount of controlled noise to break plateaus
|
||||
- Generate visualizations comparing original and smoothed data
|
||||
- Save new CSV files with "_smoothed" suffix
|
||||
|
||||
### 2. Verifying the Smoothing
|
||||
|
||||
After creating the smoothed files, run the verification script to ensure the smoothing was effective:
|
||||
|
||||
```bash
|
||||
python verify_smoothing.py
|
||||
```
|
||||
|
||||
This will:
|
||||
- Compare original and smoothed files
|
||||
- Analyze the reduction in consecutive repeated values
|
||||
- Generate various visualizations showing the improvements
|
||||
- Confirm that the overall data distribution is preserved
|
||||
|
||||
### 3. Running Regression with Smoothed Data
|
||||
|
||||
You can use the `use_smoothed_data.py` script to switch between original and smoothed data for regression:
|
||||
|
||||
```bash
|
||||
# Switch to smoothed data and run regression with approach 2
|
||||
python use_smoothed_data.py --smoothed --approach 2
|
||||
|
||||
# Switch back to original data
|
||||
python use_smoothed_data.py --approach 2
|
||||
```
|
||||
|
||||
Alternatively, you can manually edit the file paths in `sintering-regression.py` to use the smoothed files.
|
||||
|
||||
## Regression Approaches
|
||||
|
||||
The regression script (`sintering-regression.py`) implements three approaches:
|
||||
|
||||
1. **Standard Approach**: Predict "Rel. Piston Trav" based only on current parameter values
|
||||
2. **Window Approach**: Use current and previous time step data (including previous "Rel. Piston Trav") for prediction
|
||||
3. **Virtual Experiment**: Similar to the window approach, but using predicted values from previous steps instead of actual measurements
|
||||
|
||||
## Smoothing Methods
|
||||
|
||||
The `smooth_data.py` script supports three smoothing methods:
|
||||
|
||||
1. **Noise** (default): Adds small random noise to break plateaus
|
||||
2. **Spline**: Uses spline interpolation for smoothing
|
||||
3. **Rolling**: Uses rolling average for smoothing
|
||||
|
||||
The noise method is the default as it preserves the overall structure of the data while effectively breaking plateaus.
|
||||
|
||||
## Data Files
|
||||
|
||||
### Original Data
|
||||
- `160508-1021-1000,0min,56kN.csv`
|
||||
- `160508-1022-900,0min,56kN.csv`
|
||||
- `200508-1023-1350,0min,56kN.csv`
|
||||
- `200508-1024-1200,0min,56kN.csv`
|
||||
|
||||
### Smoothed Data (generated)
|
||||
- `160508-1021-1000,0min,56kN_smoothed.csv`
|
||||
- `160508-1022-900,0min,56kN_smoothed.csv`
|
||||
- `200508-1023-1350,0min,56kN_smoothed.csv`
|
||||
- `200508-1024-1200,0min,56kN_smoothed.csv`
|
||||
|
||||
## Expected Results
|
||||
|
||||
Using the smoothed data with the regression approaches should yield:
|
||||
|
||||
1. Improved model performance (higher R² scores, lower RMSE)
|
||||
2. More stable predictions in the virtual experiment approach
|
||||
3. Reduced plateau effects in the predictions
|
||||
4. Better feature importance insights
|
||||
|
||||
## Requirements
|
||||
|
||||
The code requires the following Python packages:
|
||||
- numpy
|
||||
- pandas
|
||||
- matplotlib
|
||||
- seaborn
|
||||
- scikit-learn
|
||||
- xgboost
|
||||
- lightgbm
|
||||
- scipy (for spline smoothing)
|
||||
|
||||
You can install these dependencies using:
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
Різницю між файлами не показано, оскільки один чи декілька рядків занадто довгі
Різницю між файлами не показано, оскільки один чи декілька рядків занадто довгі
Різницю між файлами не показано, бо вона завелика
Завантажити різницю
Різницю між файлами не показано, бо вона завелика
Завантажити різницю
Різницю між файлами не показано, бо вона завелика
Завантажити різницю
Різницю між файлами не показано, бо вона завелика
Завантажити різницю
Різницю між файлами не показано, бо вона завелика
Завантажити різницю
Різницю між файлами не показано, бо вона завелика
Завантажити різницю
Різницю між файлами не показано, бо вона завелика
Завантажити різницю
Різницю між файлами не показано, бо вона завелика
Завантажити різницю
Бінарний файл не відображається.
|
|
@ -0,0 +1,265 @@
|
|||
import pandas as pd
|
||||
import numpy as np
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
import seaborn as sns
|
||||
|
||||
# Define file paths
|
||||
file_paths = [
|
||||
'160508-1021-1000,0min,56kN.csv',
|
||||
'160508-1022-900,0min,56kN.csv',
|
||||
'200508-1023-1350,0min,56kN.csv',
|
||||
'200508-1024-1200,0min,56kN.csv'
|
||||
]
|
||||
|
||||
|
||||
def load_and_explore_data(file_paths):
|
||||
"""
|
||||
Load all CSV files and perform exploratory data analysis.
|
||||
|
||||
Args:
|
||||
file_paths: List of CSV file paths
|
||||
"""
|
||||
all_data = []
|
||||
|
||||
print("Loading and exploring data files...")
|
||||
|
||||
for i, file_path in enumerate(file_paths):
|
||||
print(f"\nFile {i + 1}: {file_path}")
|
||||
|
||||
# Read the CSV file with proper settings for European number format
|
||||
try:
|
||||
df = pd.read_csv(file_path, sep=';', decimal=',', header=0)
|
||||
# Add a file identifier column
|
||||
df['file_id'] = i
|
||||
all_data.append(df)
|
||||
|
||||
# Display basic information
|
||||
print(f" Rows: {df.shape[0]}, Columns: {df.shape[1]}")
|
||||
print(" First few rows:")
|
||||
print(df.head(3).to_string())
|
||||
|
||||
# Check for missing values
|
||||
missing_values = df.isnull().sum()
|
||||
if missing_values.sum() > 0:
|
||||
print("\n Missing values:")
|
||||
print(missing_values[missing_values > 0])
|
||||
|
||||
# Analyze target variable
|
||||
target_col = 'Rel. Piston Trav'
|
||||
if target_col in df.columns:
|
||||
print(f"\n {target_col} statistics:")
|
||||
print(f" Min: {df[target_col].min()}")
|
||||
print(f" Max: {df[target_col].max()}")
|
||||
print(f" Mean: {df[target_col].mean():.4f}")
|
||||
print(f" Std Dev: {df[target_col].std():.4f}")
|
||||
print(f" Unique values: {df[target_col].nunique()}")
|
||||
|
||||
# Check for precision issues
|
||||
decimal_places = df[target_col].astype(str).str.split('.').str[1].str.len().max()
|
||||
print(f" Decimal places: {decimal_places}")
|
||||
|
||||
# Quick correlation analysis
|
||||
if target_col in df.columns:
|
||||
# Get correlations with target
|
||||
corr = df.corr()[target_col].sort_values(ascending=False)
|
||||
print("\n Top 5 correlations with target:")
|
||||
print(corr.head(6).to_string()) # +1 to include the target itself
|
||||
print("\n Bottom 5 correlations with target:")
|
||||
print(corr.tail(5).to_string())
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error loading {file_path}: {e}")
|
||||
|
||||
# Combine all data for overall analysis
|
||||
if all_data:
|
||||
combined_df = pd.concat(all_data, ignore_index=True)
|
||||
print("\nCombined dataset:")
|
||||
print(f" Total rows: {combined_df.shape[0]}, Columns: {combined_df.shape[1]}")
|
||||
|
||||
return combined_df
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def plot_target_variable(df, target_col='Rel. Piston Trav'):
|
||||
"""
|
||||
Create visualizations for the target variable.
|
||||
|
||||
Args:
|
||||
df: DataFrame with all data
|
||||
target_col: Name of target column
|
||||
"""
|
||||
if target_col not in df.columns:
|
||||
print(f"Target column '{target_col}' not found in data")
|
||||
return
|
||||
|
||||
print(f"\nGenerating plots for {target_col}...")
|
||||
|
||||
# Create a copy of the dataframe with only numeric columns for correlation analysis
|
||||
numeric_df = df.select_dtypes(include=[np.number])
|
||||
|
||||
# Set up figure with multiple subplots
|
||||
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
|
||||
|
||||
# Plot 1: Distribution of target variable
|
||||
sns.histplot(df[target_col], kde=True, ax=axes[0, 0])
|
||||
axes[0, 0].set_title(f'Distribution of {target_col}')
|
||||
axes[0, 0].set_xlabel(target_col)
|
||||
axes[0, 0].set_ylabel('Frequency')
|
||||
|
||||
# Plot 2: Target variable by file
|
||||
sns.boxplot(x='file_id', y=target_col, data=df, ax=axes[0, 1])
|
||||
axes[0, 1].set_title(f'{target_col} by File')
|
||||
axes[0, 1].set_xlabel('File ID')
|
||||
axes[0, 1].set_ylabel(target_col)
|
||||
|
||||
# Plot 3: Target variable over time (for first 1000 points)
|
||||
sample_size = min(1000, df.shape[0])
|
||||
axes[1, 0].plot(df['Nr.'].head(sample_size), df[target_col].head(sample_size))
|
||||
axes[1, 0].set_title(f'{target_col} Over Time (First {sample_size} Points)')
|
||||
axes[1, 0].set_xlabel('Record Number')
|
||||
axes[1, 0].set_ylabel(target_col)
|
||||
|
||||
# Plot 4: Correlation heatmap (top correlated features)
|
||||
try:
|
||||
# Get absolute correlations with target from numeric columns only
|
||||
corr = numeric_df.corr()[target_col].abs().sort_values(ascending=False)
|
||||
top_features = corr.head(10).index # Top 10 features
|
||||
|
||||
# Create correlation matrix for selected features
|
||||
corr_matrix = numeric_df[top_features].corr()
|
||||
|
||||
# Plot heatmap
|
||||
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', ax=axes[1, 1])
|
||||
axes[1, 1].set_title('Correlation Heatmap (Top Features)')
|
||||
|
||||
# Prepare for additional plot with top features
|
||||
top_correlated = corr.head(6).index.tolist()
|
||||
if target_col in top_correlated:
|
||||
top_correlated.remove(target_col) # Exclude target itself
|
||||
top_correlated = top_correlated[:4] # Get top 4
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error calculating correlations: {e}")
|
||||
axes[1, 1].set_title('Correlation Heatmap (Error occurred)')
|
||||
top_correlated = []
|
||||
|
||||
plt.tight_layout()
|
||||
plt.show() # Add this line to display the plot
|
||||
plt.close()
|
||||
|
||||
# Additional plot: Scatter plots of top correlated features vs target
|
||||
if top_correlated:
|
||||
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
|
||||
axes = axes.flatten()
|
||||
|
||||
for i, feature in enumerate(top_correlated):
|
||||
if i < 4: # Plot top 4 correlated features
|
||||
sns.scatterplot(x=feature, y=target_col, data=df.sample(min(1000, df.shape[0])),
|
||||
alpha=0.5, ax=axes[i])
|
||||
axes[i].set_title(f'{feature} vs {target_col}')
|
||||
|
||||
# Hide unused subplots
|
||||
for j in range(len(top_correlated), len(axes)):
|
||||
axes[j].set_visible(False)
|
||||
|
||||
plt.tight_layout()
|
||||
plt.show()
|
||||
plt.close()
|
||||
|
||||
|
||||
def analyze_feature_distributions(df):
|
||||
"""
|
||||
Analyze the distributions of key features.
|
||||
|
||||
Args:
|
||||
df: DataFrame with all data
|
||||
"""
|
||||
print("\nAnalyzing feature distributions...")
|
||||
|
||||
# Identify numeric columns
|
||||
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
|
||||
|
||||
# Remove certain columns we don't need to visualize
|
||||
cols_to_exclude = ['Nr.', 'file_id', 'Abs. Piston Trav']
|
||||
feature_cols = [col for col in numeric_cols if col not in cols_to_exclude]
|
||||
|
||||
# Select top features based on data exploration
|
||||
selected_features = [
|
||||
'MTC1', 'MTC2', 'MTC3', 'Pyrometer', 'SV Temperature',
|
||||
'SV Power', 'SV Force', 'AV Force', 'AV Speed',
|
||||
'I RMS', 'U RMS', 'Heating power'
|
||||
]
|
||||
|
||||
# Ensure all selected features exist in the dataframe
|
||||
selected_features = [f for f in selected_features if f in df.columns]
|
||||
|
||||
# Create distribution plots for selected features
|
||||
n_cols = 3
|
||||
n_rows = (len(selected_features) + n_cols - 1) // n_cols
|
||||
|
||||
fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 4))
|
||||
axes = axes.flatten()
|
||||
|
||||
for i, feature in enumerate(selected_features):
|
||||
if i < len(axes):
|
||||
sns.histplot(df[feature].dropna(), kde=True, ax=axes[i])
|
||||
axes[i].set_title(f'Distribution of {feature}')
|
||||
axes[i].set_xlabel(feature)
|
||||
|
||||
# Hide unused subplots
|
||||
for j in range(len(selected_features), len(axes)):
|
||||
axes[j].set_visible(False)
|
||||
|
||||
plt.tight_layout()
|
||||
plt.show()
|
||||
plt.close()
|
||||
|
||||
|
||||
def plot_time_series_by_file(df, target_col='Rel. Piston Trav'):
|
||||
"""
|
||||
Plot time series of target variable for each file.
|
||||
|
||||
Args:
|
||||
df: DataFrame with all data
|
||||
target_col: Name of target column
|
||||
"""
|
||||
print("\nPlotting time series by file...")
|
||||
|
||||
# Create a figure
|
||||
plt.figure(figsize=(15, 8))
|
||||
|
||||
# Plot for each file ID
|
||||
for file_id in df['file_id'].unique():
|
||||
file_data = df[df['file_id'] == file_id]
|
||||
plt.plot(range(len(file_data)), file_data[target_col],
|
||||
label=f'File {file_id}', alpha=0.7)
|
||||
|
||||
plt.title(f'{target_col} Time Series by File')
|
||||
plt.xlabel('Time Step')
|
||||
plt.ylabel(target_col)
|
||||
plt.legend()
|
||||
plt.grid(True, alpha=0.3)
|
||||
plt.show()
|
||||
plt.close()
|
||||
|
||||
|
||||
def main():
|
||||
"""Main execution function"""
|
||||
print("SPS Sintering Data Exploration")
|
||||
|
||||
# Load and explore data
|
||||
combined_df = load_and_explore_data(file_paths)
|
||||
|
||||
if combined_df is not None:
|
||||
# Generate plots
|
||||
plot_target_variable(combined_df)
|
||||
analyze_feature_distributions(combined_df)
|
||||
plot_time_series_by_file(combined_df)
|
||||
|
||||
print("\nData exploration complete.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -0,0 +1,9 @@
|
|||
numpy~=2.0.2
|
||||
matplotlib~=3.9.4
|
||||
pandas~=2.2.3
|
||||
seaborn~=0.13.2
|
||||
xgboost~=2.1.4
|
||||
lightgbm~=4.6.0
|
||||
scikit-learn~=1.6.1
|
||||
scikit-optimize~=0.10.2
|
||||
tqdm
|
||||
|
|
@ -0,0 +1,154 @@
|
|||
# SPS Sintering Regression Analysis
|
||||
|
||||
This code implements regression analysis for Spark Plasma Sintering (SPS) process data. It includes three different approaches to predicting the "Rel. Piston Trav." parameter based on other process parameters.
|
||||
|
||||
## Overview
|
||||
|
||||
The script provides a comprehensive framework for analyzing SPS sintering data using machine learning regression techniques. It implements three different prediction approaches:
|
||||
|
||||
1. **Standard Approach**: Predict the "Rel. Piston Trav." for each row independently based only on the current values of other parameters.
|
||||
2. **Window Approach**: Predict the "Rel. Piston Trav." of row `n` using both the parameter values at row `n` and the parameter values (including "Rel. Piston Trav.") from row `n-1`.
|
||||
3. **Virtual Experiment**: Similar to the window approach, but uses the predicted value of "Rel. Piston Trav." from the previous step instead of the actual value, enabling continuous prediction without relying on measured "Rel. Piston Trav." values.
|
||||
|
||||
## Requirements
|
||||
|
||||
The code requires the following Python packages:
|
||||
- numpy
|
||||
- pandas
|
||||
- matplotlib
|
||||
- seaborn
|
||||
- scikit-learn
|
||||
- xgboost
|
||||
- lightgbm
|
||||
- bayes_opt (for Bayesian optimization of hyperparameters)
|
||||
|
||||
You can install these dependencies using pip:
|
||||
```bash
|
||||
pip install numpy pandas matplotlib seaborn scikit-learn xgboost lightgbm bayesian-optimization
|
||||
```
|
||||
|
||||
## Data Files
|
||||
|
||||
The code expects the following CSV files:
|
||||
- 200508102313500min56kN.csv
|
||||
- 200508102412000min56kN.csv
|
||||
- 160508102110000min56kN.csv
|
||||
- 16050810229000min56kN.csv
|
||||
|
||||
These files should contain SPS sintering data with semicolon-separated values and European number format (comma as decimal separator).
|
||||
|
||||
## Configuration
|
||||
|
||||
The script includes several configuration options at the top of the file:
|
||||
|
||||
```python
|
||||
# Configuration for regression approaches
|
||||
APPROACH = 1 # 1: Standard approach, 2: Window approach, 3: Virtual experiment
|
||||
VALIDATION_FILE_INDEX = 3 # Use the 4th file for validation (0-indexed)
|
||||
TARGET_COLUMN = 'Rel. Piston Trav'
|
||||
EXCLUDED_COLUMNS = ['Abs. Piston Trav', 'Nr.', 'Datum', 'Zeit'] # Columns to exclude
|
||||
|
||||
# Feature selection (manual control)
|
||||
# Set to None to use all available features
|
||||
SELECTED_FEATURES = [
|
||||
'MTC1', 'MTC2', 'MTC3', 'Pyrometer', 'SV Temperature',
|
||||
'SV Power', 'SV Force', 'AV Force', 'AV Rel. Pressure',
|
||||
'I RMS', 'U RMS', 'Heating power'
|
||||
]
|
||||
|
||||
# Model selection (set to True to include in the evaluation)
|
||||
MODELS_TO_EVALUATE = {
|
||||
'Linear Regression': True,
|
||||
'Ridge': True,
|
||||
'Lasso': True,
|
||||
'ElasticNet': True,
|
||||
'Decision Tree': True,
|
||||
'Random Forest': True,
|
||||
'Gradient Boosting': True,
|
||||
'XGBoost': True,
|
||||
'LightGBM': True,
|
||||
'SVR': True,
|
||||
'KNN': True
|
||||
}
|
||||
|
||||
# Hyperparameter tuning settings
|
||||
TUNING_METHOD = 'bayesian' # 'grid', 'random', 'bayesian'
|
||||
CV_FOLDS = 5
|
||||
N_ITER = 20 # Number of iterations for random/bayesian search
|
||||
```
|
||||
|
||||
You can modify these settings to customize the analysis:
|
||||
|
||||
- `APPROACH`: Set to 1, 2, or 3 based on which approach you want to use
|
||||
- `VALIDATION_FILE_INDEX`: Specify which file to use for validation (0-based index)
|
||||
- `SELECTED_FEATURES`: Specify which features to include in the regression, or set to `None` to use all available features
|
||||
- `MODELS_TO_EVALUATE`: Enable/disable specific regression models
|
||||
- `TUNING_METHOD`: Choose the hyperparameter tuning method (grid search, random search, or Bayesian optimization)
|
||||
|
||||
## Usage
|
||||
|
||||
1. Place the CSV data files in the same directory as the script
|
||||
2. Configure the settings as described above
|
||||
3. Run the script:
|
||||
```bash
|
||||
python sintering_regression.py
|
||||
```
|
||||
|
||||
## Output
|
||||
|
||||
The script will output:
|
||||
1. Console logs showing the progress and results of the regression analysis
|
||||
2. Visualization plots:
|
||||
- Feature importance plots for the best model
|
||||
- Actual vs. predicted value plots
|
||||
- Time series prediction plots
|
||||
- Residual analysis plots
|
||||
|
||||
## Approaches in Detail
|
||||
|
||||
### 1. Standard Approach
|
||||
|
||||
This is the simplest approach where each row is treated independently. The regression model predicts the "Rel. Piston Trav." value based only on the current values of other parameters.
|
||||
|
||||
```
|
||||
Input: [Parameters at time t]
|
||||
Output: Predicted Rel. Piston Trav. at time t
|
||||
```
|
||||
|
||||
### 2. Window Approach
|
||||
|
||||
In this approach, we use information from the previous time step to help predict the current "Rel. Piston Trav." value. The model uses both the current parameter values and the previous parameter values (including the previous "Rel. Piston Trav.").
|
||||
|
||||
```
|
||||
Input: [Parameters at time t, Parameters at time t-1, Rel. Piston Trav. at time t-1]
|
||||
Output: Predicted Rel. Piston Trav. at time t
|
||||
```
|
||||
|
||||
### 3. Virtual Experiment
|
||||
|
||||
This approach builds on the Window Approach but enables continuous prediction without requiring real "Rel. Piston Trav." measurements after the initial value. Instead, it uses its own predictions from previous steps:
|
||||
|
||||
```
|
||||
Initial input: [Parameters at time t=1, Parameters at time t=0, Known Rel. Piston Trav. at time t=0]
|
||||
Output: Predicted Rel. Piston Trav. at time t=1
|
||||
|
||||
Next step input: [Parameters at time t=2, Parameters at time t=1, Predicted Rel. Piston Trav. at time t=1]
|
||||
Output: Predicted Rel. Piston Trav. at time t=2
|
||||
|
||||
And so on...
|
||||
```
|
||||
|
||||
This allows for a "virtual experiment" where you only need to provide the machine configuration parameters, and the model can predict how the "Rel. Piston Trav." will evolve throughout the sintering process.
|
||||
|
||||
## Extending the Code
|
||||
|
||||
The code is designed to be modular and extensible:
|
||||
|
||||
- To add new regression models, add them to the `models` and `param_grids` dictionaries in the `build_and_evaluate_models` function
|
||||
- To add new preprocessing steps, modify the `preprocess_data` function
|
||||
- To add new evaluation metrics, extend the `evaluate_model` function
|
||||
- To create additional visualizations, add new plotting functions
|
||||
|
||||
## Improving Precision
|
||||
|
||||
As noted in your requirements, the "Rel. Piston Trav." values in the dataset have limited precision (2 decimal places). The code handles this by using float64 precision for all calculations, which ensures that small differences can be represented accurately in the model predictions, even if the original data had limited precision.
|
||||
Різницю між файлами не показано, бо вона завелика
Завантажити різницю
Різницю між файлами не показано, бо вона завелика
Завантажити різницю
|
|
@ -0,0 +1,449 @@
|
|||
import numpy as np
|
||||
import time
|
||||
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
|
||||
from sklearn.metrics import mean_squared_error
|
||||
|
||||
# Try to import Bayesian optimization libraries
|
||||
try:
|
||||
from skopt import BayesSearchCV
|
||||
from skopt.space import Real, Integer, Categorical
|
||||
BAYESIAN_AVAILABLE = True
|
||||
except ImportError:
|
||||
BAYESIAN_AVAILABLE = False
|
||||
print("Bayesian optimization libraries not available. Will use RandomizedSearchCV instead.")
|
||||
|
||||
|
||||
def ensure_finite(X, default_value=0.0):
|
||||
"""
|
||||
Replace any NaN, inf, or extremely large values with a default value.
|
||||
|
||||
Args:
|
||||
X: Input array or matrix
|
||||
default_value: Value to use for replacement
|
||||
|
||||
Returns:
|
||||
X_clean: Cleaned array with finite values
|
||||
"""
|
||||
# Make a copy to avoid modifying the original
|
||||
X_clean = np.array(X, copy=True)
|
||||
|
||||
# Replace inf values
|
||||
mask_inf = np.isinf(X_clean)
|
||||
if np.any(mask_inf):
|
||||
print(f"Warning: Found {np.sum(mask_inf)} infinite values. Replacing with {default_value}.")
|
||||
X_clean[mask_inf] = default_value
|
||||
|
||||
# Replace NaN values
|
||||
mask_nan = np.isnan(X_clean)
|
||||
if np.any(mask_nan):
|
||||
print(f"Warning: Found {np.sum(mask_nan)} NaN values. Replacing with {default_value}.")
|
||||
X_clean[mask_nan] = default_value
|
||||
|
||||
# Check for extremely large values
|
||||
large_threshold = 1e6 # Adjust as needed
|
||||
mask_large = np.abs(X_clean) > large_threshold
|
||||
if np.any(mask_large):
|
||||
print(f"Warning: Found {np.sum(mask_large)} extremely large values. Replacing with {default_value}.")
|
||||
X_clean[mask_large] = default_value
|
||||
|
||||
return X_clean
|
||||
|
||||
|
||||
def tune_hyperparameters(model_class, param_grid, X_train, y_train, method='grid', cv=5, n_iter=20, model_name=None):
|
||||
"""
|
||||
Tune hyperparameters for a model.
|
||||
|
||||
Args:
|
||||
model_class: Scikit-learn model class
|
||||
param_grid: Dictionary of hyperparameters
|
||||
X_train: Training feature matrix
|
||||
y_train: Training target vector
|
||||
method: Tuning method ('grid', 'random', or 'bayesian')
|
||||
cv: Number of cross-validation folds
|
||||
n_iter: Number of iterations for random/bayesian search
|
||||
model_name: Name of the model for special handling
|
||||
|
||||
Returns:
|
||||
best_model: Tuned model
|
||||
best_params: Best hyperparameter values
|
||||
"""
|
||||
start_time = time.time()
|
||||
print(f" Tuning hyperparameters for {model_name} using {method} search...")
|
||||
print(f" X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
|
||||
|
||||
if method == 'grid':
|
||||
search = GridSearchCV(
|
||||
model_class(), param_grid, cv=cv, scoring='neg_mean_squared_error',
|
||||
verbose=1, n_jobs=-1
|
||||
)
|
||||
search.fit(X_train, y_train)
|
||||
best_model = search.best_estimator_
|
||||
best_params = search.best_params_
|
||||
|
||||
elif method == 'random':
|
||||
search = RandomizedSearchCV(
|
||||
model_class(), param_grid, n_iter=n_iter, cv=cv,
|
||||
scoring='neg_mean_squared_error', verbose=1, random_state=42, n_jobs=-1
|
||||
)
|
||||
search.fit(X_train, y_train)
|
||||
best_model = search.best_estimator_
|
||||
best_params = search.best_params_
|
||||
|
||||
elif method == 'bayesian':
|
||||
if BAYESIAN_AVAILABLE:
|
||||
# Convert param_grid to skopt space format
|
||||
search_space = {}
|
||||
for param, values in param_grid.items():
|
||||
# If parameter values are a list
|
||||
if isinstance(values, list):
|
||||
# Check types of values to determine space type
|
||||
if all(isinstance(v, bool) for v in values) or all(isinstance(v, str) for v in values):
|
||||
search_space[param] = Categorical(values)
|
||||
elif all(isinstance(v, int) for v in values):
|
||||
search_space[param] = Integer(min(values), max(values))
|
||||
elif all(isinstance(v, float) for v in values):
|
||||
search_space[param] = Real(min(values), max(values), prior='log-uniform')
|
||||
else:
|
||||
# Mixed types or other - use categorical
|
||||
search_space[param] = Categorical(values)
|
||||
# If parameter values are already a dictionary or distribution
|
||||
else:
|
||||
search_space[param] = values
|
||||
|
||||
print(f" Created Bayesian search space: {search_space}")
|
||||
|
||||
# Special handling for models that need parameter mapping
|
||||
model_instance = model_class()
|
||||
if model_name == 'GPR' and 'kernel' in search_space:
|
||||
# Create a modified search with a custom kernel mapping
|
||||
def map_kernel(params):
|
||||
# Map numeric values to actual kernels
|
||||
if 'kernel' in params and isinstance(params['kernel'], int):
|
||||
from sklearn.gaussian_process.kernels import RBF, Matern, WhiteKernel, ConstantKernel as C
|
||||
kernel_map = {
|
||||
1: C(1.0, (1e-3, 1e3)) * RBF(1.0, (1e-2, 1e2)),
|
||||
2: C(1.0, (1e-3, 1e3)) * Matern(1.0, (1e-2, 1e2), nu=1.5),
|
||||
3: C(1.0, (1e-3, 1e3)) * RBF(1.0, (1e-2, 1e2)) + WhiteKernel(0.1)
|
||||
}
|
||||
params['kernel'] = kernel_map.get(params['kernel'], kernel_map[1])
|
||||
return params
|
||||
|
||||
# Use a subset of data for GPR to speed up training
|
||||
subset_size = min(1000, len(X_train))
|
||||
idx = np.random.choice(len(X_train), subset_size, replace=False)
|
||||
X_subset = X_train[idx]
|
||||
y_subset = y_train[idx]
|
||||
|
||||
# Manual Bayesian optimization for GPR
|
||||
best_score = float('-inf')
|
||||
best_params = {}
|
||||
best_model = None # Initialize best_model
|
||||
|
||||
for _ in range(n_iter):
|
||||
# Sample parameters randomly from the space
|
||||
params = {}
|
||||
for param, space in search_space.items():
|
||||
if hasattr(space, 'rvs'): # It's a distribution
|
||||
params[param] = space.rvs(1)[0]
|
||||
elif isinstance(space, list): # It's a list of values
|
||||
params[param] = np.random.choice(space)
|
||||
|
||||
# Map parameters for kernels
|
||||
params = map_kernel(params)
|
||||
|
||||
# Create and fit model with these parameters
|
||||
try:
|
||||
model = model_class(**params)
|
||||
model.fit(X_subset, y_subset)
|
||||
# Score model
|
||||
score = -mean_squared_error(y_subset, model.predict(X_subset)) # Neg MSE
|
||||
if score > best_score:
|
||||
best_score = score
|
||||
best_params = params
|
||||
best_model = model
|
||||
except Exception as e:
|
||||
print(f" Skipping parameters due to error: {e}")
|
||||
continue
|
||||
|
||||
print(f" Best params: {best_params}")
|
||||
if best_model is None:
|
||||
# Fallback if no model was successfully trained
|
||||
print(" No successful model training, using default parameters")
|
||||
best_model = model_class()
|
||||
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C
|
||||
kernel_rbf = C(1.0) * RBF(1.0)
|
||||
best_model.set_params(kernel=kernel_rbf, alpha=1e-6)
|
||||
best_model.fit(X_subset, y_subset)
|
||||
return best_model, best_params
|
||||
elif model_name == 'MLP' and 'hidden_layer_sizes' in search_space:
|
||||
# Create a modified search with a custom hidden_layer_sizes mapping
|
||||
def map_hidden_layers(params):
|
||||
# Map numeric values to actual tuples for hidden_layer_sizes
|
||||
if 'hidden_layer_sizes' in params and isinstance(params['hidden_layer_sizes'], (int, float)):
|
||||
# Map integers to hidden layer configurations
|
||||
layer_map = {
|
||||
1: (50,),
|
||||
2: (100,),
|
||||
3: (50, 50),
|
||||
4: (100, 50)
|
||||
}
|
||||
params['hidden_layer_sizes'] = layer_map.get(int(params['hidden_layer_sizes']), (50,))
|
||||
return params
|
||||
|
||||
# Manual optimization for MLP
|
||||
best_score = float('-inf')
|
||||
best_params = {}
|
||||
best_model = None # Initialize best_model
|
||||
|
||||
for _ in range(n_iter):
|
||||
# Sample parameters randomly from the space
|
||||
params = {}
|
||||
for param, space in search_space.items():
|
||||
if hasattr(space, 'rvs'): # It's a distribution
|
||||
params[param] = space.rvs(1)[0]
|
||||
elif isinstance(space, list): # It's a list of values
|
||||
params[param] = np.random.choice(space)
|
||||
|
||||
# Map parameters for hidden layer sizes
|
||||
params = map_hidden_layers(params)
|
||||
|
||||
# Create and fit model with these parameters
|
||||
try:
|
||||
model = model_class(**params)
|
||||
model.fit(X_train, y_train)
|
||||
# Score model
|
||||
score = -mean_squared_error(y_train, model.predict(X_train)) # Neg MSE
|
||||
if score > best_score:
|
||||
best_score = score
|
||||
best_params = params
|
||||
best_model = model
|
||||
except Exception as e:
|
||||
print(f" Skipping parameters due to error: {e}")
|
||||
continue
|
||||
|
||||
print(f" Best params: {best_params}")
|
||||
if best_model is None:
|
||||
# Fallback if no model was successfully trained
|
||||
print(" No successful model training, using default parameters")
|
||||
best_model = model_class(random_state=42, max_iter=1000)
|
||||
best_model.fit(X_train, y_train)
|
||||
return best_model, best_params
|
||||
else:
|
||||
# For other models, use standard BayesSearchCV
|
||||
search = BayesSearchCV(
|
||||
model_instance, search_space, n_iter=n_iter, cv=cv,
|
||||
scoring='neg_mean_squared_error', verbose=1, random_state=42, n_jobs=-1
|
||||
)
|
||||
search.fit(X_train, y_train)
|
||||
best_model = search.best_estimator_
|
||||
best_params = search.best_params_
|
||||
else:
|
||||
print(" Bayesian optimization not available, falling back to RandomizedSearchCV")
|
||||
search = RandomizedSearchCV(
|
||||
model_class(), param_grid, n_iter=n_iter, cv=cv,
|
||||
scoring='neg_mean_squared_error', verbose=1, random_state=42, n_jobs=-1
|
||||
)
|
||||
search.fit(X_train, y_train)
|
||||
best_model = search.best_estimator_
|
||||
best_params = search.best_params_
|
||||
|
||||
else:
|
||||
raise ValueError(f"Unknown tuning method: {method}")
|
||||
|
||||
elapsed_time = time.time() - start_time
|
||||
print(f" Tuning completed in {elapsed_time:.2f} seconds")
|
||||
print(f" Best params: {best_params}")
|
||||
|
||||
return best_model, best_params
|
||||
|
||||
|
||||
def get_param_grids():
|
||||
"""
|
||||
Get parameter grids for different models.
|
||||
|
||||
Returns:
|
||||
param_grids: Dictionary of parameter grids for grid/random search
|
||||
param_ranges: Dictionary of parameter ranges for Bayesian optimization
|
||||
"""
|
||||
# Parameter grids for grid/random search
|
||||
param_grids = {}
|
||||
|
||||
# Linear models
|
||||
param_grids['Linear Regression'] = {'fit_intercept': [True, False]}
|
||||
|
||||
param_grids['Ridge'] = {
|
||||
'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0],
|
||||
'fit_intercept': [True, False],
|
||||
'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']
|
||||
}
|
||||
|
||||
param_grids['Lasso'] = {
|
||||
'alpha': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0],
|
||||
'fit_intercept': [True, False],
|
||||
'max_iter': [1000, 3000, 5000]
|
||||
}
|
||||
|
||||
param_grids['ElasticNet'] = {
|
||||
'alpha': [0.0001, 0.001, 0.01, 0.1, 1.0],
|
||||
'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9],
|
||||
'fit_intercept': [True, False],
|
||||
'max_iter': [1000, 3000, 5000]
|
||||
}
|
||||
|
||||
# Tree-based models
|
||||
param_grids['Decision Tree'] = {
|
||||
'max_depth': [None, 5, 10, 15, 20],
|
||||
'min_samples_split': [2, 5, 10],
|
||||
'min_samples_leaf': [1, 2, 4]
|
||||
}
|
||||
|
||||
param_grids['Random Forest'] = {
|
||||
'n_estimators': [50, 100, 200],
|
||||
'max_depth': [None, 10, 20, 30],
|
||||
'min_samples_split': [2, 5, 10],
|
||||
'min_samples_leaf': [1, 2, 4]
|
||||
}
|
||||
|
||||
param_grids['Gradient Boosting'] = {
|
||||
'n_estimators': [50, 100, 200],
|
||||
'learning_rate': [0.01, 0.05, 0.1, 0.2],
|
||||
'max_depth': [3, 5, 7, 9],
|
||||
'min_samples_split': [2, 5, 10]
|
||||
}
|
||||
|
||||
param_grids['XGBoost'] = {
|
||||
'n_estimators': [50, 100, 200],
|
||||
'learning_rate': [0.01, 0.05, 0.1, 0.2],
|
||||
'max_depth': [3, 5, 7, 9],
|
||||
'subsample': [0.8, 0.9, 1.0],
|
||||
'colsample_bytree': [0.8, 0.9, 1.0]
|
||||
}
|
||||
|
||||
param_grids['LightGBM'] = {
|
||||
'n_estimators': [50, 100, 200],
|
||||
'learning_rate': [0.01, 0.05, 0.1, 0.2],
|
||||
'max_depth': [3, 5, 7, 9],
|
||||
'num_leaves': [31, 63, 127],
|
||||
'subsample': [0.8, 0.9, 1.0]
|
||||
}
|
||||
|
||||
# Other models
|
||||
param_grids['SVR'] = {
|
||||
'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
|
||||
'C': [0.1, 1, 10, 100],
|
||||
'gamma': ['scale', 'auto', 0.1, 0.01, 0.001]
|
||||
}
|
||||
|
||||
param_grids['KNN'] = {
|
||||
'n_neighbors': [3, 5, 7, 9, 11, 13, 15],
|
||||
'weights': ['uniform', 'distance'],
|
||||
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']
|
||||
}
|
||||
|
||||
# Neural Network model
|
||||
param_grids['MLP'] = {
|
||||
'hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 50)],
|
||||
'activation': ['relu', 'tanh'],
|
||||
'solver': ['adam', 'sgd'],
|
||||
'alpha': [0.0001, 0.001, 0.01],
|
||||
'learning_rate': ['constant', 'adaptive']
|
||||
}
|
||||
|
||||
# Gaussian Process Regression
|
||||
from sklearn.gaussian_process.kernels import RBF, Matern, WhiteKernel, ConstantKernel as C
|
||||
kernel_rbf = C(1.0, (1e-3, 1e3)) * RBF(1.0, (1e-2, 1e2))
|
||||
kernel_matern = C(1.0, (1e-3, 1e3)) * Matern(1.0, (1e-2, 1e2), nu=1.5)
|
||||
kernel_rbf_white = C(1.0, (1e-3, 1e3)) * RBF(1.0, (1e-2, 1e2)) + WhiteKernel(0.1)
|
||||
|
||||
param_grids['GPR'] = {
|
||||
'kernel': [kernel_rbf, kernel_matern, kernel_rbf_white],
|
||||
'alpha': [1e-10, 1e-8, 1e-6],
|
||||
'normalize_y': [True, False],
|
||||
'n_restarts_optimizer': [0, 1, 3]
|
||||
}
|
||||
|
||||
# Parameter ranges for Bayesian optimization
|
||||
param_ranges = {}
|
||||
|
||||
# Linear models
|
||||
param_ranges['Linear Regression'] = {'fit_intercept': [True, False]}
|
||||
|
||||
param_ranges['Ridge'] = {
|
||||
'alpha': (0.001, 100.0, 'log-uniform'),
|
||||
'fit_intercept': [True, False],
|
||||
'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']
|
||||
}
|
||||
|
||||
param_ranges['Lasso'] = {
|
||||
'alpha': (0.0001, 10.0, 'log-uniform'),
|
||||
'fit_intercept': [True, False],
|
||||
'max_iter': (1000, 10000)
|
||||
}
|
||||
|
||||
param_ranges['ElasticNet'] = {
|
||||
'alpha': (0.0001, 1.0, 'log-uniform'),
|
||||
'l1_ratio': (0.1, 0.9),
|
||||
'fit_intercept': [True, False],
|
||||
'max_iter': (1000, 10000)
|
||||
}
|
||||
|
||||
# Tree-based models
|
||||
param_ranges['Decision Tree'] = {
|
||||
'max_depth': (3, 30), # None will be handled specially
|
||||
'min_samples_split': (2, 20),
|
||||
'min_samples_leaf': (1, 10)
|
||||
}
|
||||
|
||||
param_ranges['Random Forest'] = {
|
||||
'n_estimators': (10, 300),
|
||||
'max_depth': (3, 50), # None will be handled specially
|
||||
'min_samples_split': (2, 20),
|
||||
'min_samples_leaf': (1, 10)
|
||||
}
|
||||
|
||||
param_ranges['Gradient Boosting'] = {
|
||||
'n_estimators': (10, 300),
|
||||
'learning_rate': (0.001, 0.3, 'log-uniform'),
|
||||
'max_depth': (2, 15),
|
||||
'min_samples_split': (2, 20)
|
||||
}
|
||||
|
||||
param_ranges['XGBoost'] = {
|
||||
'n_estimators': (10, 300),
|
||||
'learning_rate': (0.001, 0.3, 'log-uniform'),
|
||||
'max_depth': (2, 15),
|
||||
'subsample': (0.5, 1.0),
|
||||
'colsample_bytree': (0.5, 1.0)
|
||||
}
|
||||
|
||||
# Other models
|
||||
param_ranges['SVR'] = {
|
||||
'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
|
||||
'C': (0.01, 1000.0, 'log-uniform'),
|
||||
'gamma': ['scale', 'auto'] + [(0.0001, 1.0, 'log-uniform')]
|
||||
}
|
||||
|
||||
param_ranges['KNN'] = {
|
||||
'n_neighbors': (1, 30),
|
||||
'weights': ['uniform', 'distance'],
|
||||
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']
|
||||
}
|
||||
|
||||
# Neural Network model
|
||||
param_ranges['MLP'] = {
|
||||
'hidden_layer_sizes': [1, 2, 3, 4], # Will map to actual tuples later
|
||||
'activation': ['relu', 'tanh'],
|
||||
'solver': ['adam', 'sgd'],
|
||||
'alpha': (0.00001, 0.1, 'log-uniform'),
|
||||
'learning_rate': ['constant', 'adaptive']
|
||||
}
|
||||
|
||||
# Gaussian Process Regression
|
||||
param_ranges['GPR'] = {
|
||||
'kernel': [1, 2, 3], # Will map to actual kernels later
|
||||
'alpha': (1e-12, 1e-4, 'log-uniform'),
|
||||
'normalize_y': [True, False],
|
||||
'n_restarts_optimizer': (0, 5)
|
||||
}
|
||||
|
||||
return param_grids, param_ranges
|
||||
|
|
@ -0,0 +1,317 @@
|
|||
import pandas as pd
|
||||
import numpy as np
|
||||
import matplotlib.pyplot as plt
|
||||
import seaborn as sns
|
||||
import os
|
||||
|
||||
# Define file paths
|
||||
file_paths = [
|
||||
'160508-1021-1000,0min,56kN.csv',
|
||||
'160508-1022-900,0min,56kN.csv',
|
||||
'200508-1023-1350,0min,56kN.csv',
|
||||
'200508-1024-1200,0min,56kN.csv'
|
||||
]
|
||||
|
||||
# Target column to smooth
|
||||
TARGET_COLUMN = 'Rel. Piston Trav'
|
||||
|
||||
def load_file(file_path):
|
||||
"""
|
||||
Load a CSV file with European number format.
|
||||
|
||||
Args:
|
||||
file_path: Path to the CSV file
|
||||
|
||||
Returns:
|
||||
DataFrame with the loaded data
|
||||
"""
|
||||
try:
|
||||
df = pd.read_csv(file_path, sep=';', decimal=',', header=0)
|
||||
print(f"Loaded {file_path}, shape: {df.shape}")
|
||||
return df
|
||||
except Exception as e:
|
||||
print(f"Error loading {file_path}: {e}")
|
||||
return None
|
||||
|
||||
def analyze_target_column(df, target_col):
|
||||
"""
|
||||
Analyze the target column to understand precision issues.
|
||||
|
||||
Args:
|
||||
df: DataFrame containing the data
|
||||
target_col: Name of the target column
|
||||
|
||||
Returns:
|
||||
Dictionary with analysis results
|
||||
"""
|
||||
if target_col not in df.columns:
|
||||
print(f"Target column '{target_col}' not found in data")
|
||||
return None
|
||||
|
||||
# Extract target column
|
||||
target_values = df[target_col].values
|
||||
|
||||
# Calculate differences between consecutive values
|
||||
differences = np.diff(target_values)
|
||||
non_zero_diffs = differences[differences != 0]
|
||||
|
||||
# Ensure we have absolute differences for calculations that need positive values
|
||||
abs_non_zero_diffs = np.abs(non_zero_diffs)
|
||||
|
||||
# Count occurrences of repeated values
|
||||
consecutive_repeats = []
|
||||
current_count = 1
|
||||
|
||||
for i in range(1, len(target_values)):
|
||||
if abs(target_values[i] - target_values[i-1]) < 1e-10:
|
||||
current_count += 1
|
||||
else:
|
||||
if current_count > 1:
|
||||
consecutive_repeats.append(current_count)
|
||||
current_count = 1
|
||||
|
||||
# Add the last group if it's a repeat
|
||||
if current_count > 1:
|
||||
consecutive_repeats.append(current_count)
|
||||
|
||||
# Calculate statistics
|
||||
results = {
|
||||
'unique_values': df[target_col].nunique(),
|
||||
'total_values': len(target_values),
|
||||
'min_nonzero_diff': np.min(non_zero_diffs) if len(non_zero_diffs) > 0 else 0,
|
||||
'min_abs_nonzero_diff': np.min(abs_non_zero_diffs) if len(abs_non_zero_diffs) > 0 else 0.0001,
|
||||
'avg_nonzero_diff': np.mean(non_zero_diffs) if len(non_zero_diffs) > 0 else 0,
|
||||
'avg_abs_nonzero_diff': np.mean(abs_non_zero_diffs) if len(abs_non_zero_diffs) > 0 else 0.0001,
|
||||
'median_nonzero_diff': np.median(non_zero_diffs) if len(non_zero_diffs) > 0 else 0,
|
||||
'zero_diff_count': len(differences) - len(non_zero_diffs),
|
||||
'zero_diff_percentage': 100 * (len(differences) - len(non_zero_diffs)) / len(differences),
|
||||
'max_consecutive_repeats': max(consecutive_repeats) if consecutive_repeats else 0,
|
||||
'avg_consecutive_repeats': np.mean(consecutive_repeats) if consecutive_repeats else 0
|
||||
}
|
||||
|
||||
print(f"\nAnalysis of '{target_col}':")
|
||||
print(f" Unique values: {results['unique_values']} out of {results['total_values']} total values")
|
||||
print(f" Minimum non-zero difference: {results['min_nonzero_diff']:.8f}")
|
||||
print(f" Zero differences: {results['zero_diff_count']} ({results['zero_diff_percentage']:.2f}% of all consecutive pairs)")
|
||||
print(f" Maximum consecutive repeated values: {results['max_consecutive_repeats']}")
|
||||
|
||||
return results
|
||||
|
||||
def smooth_target_column(df, target_col, method='noise', params=None):
|
||||
"""
|
||||
Smooth the target column to address precision issues.
|
||||
|
||||
Args:
|
||||
df: DataFrame containing the data
|
||||
target_col: Name of the target column
|
||||
method: Smoothing method to use ('noise', 'spline', or 'rolling')
|
||||
params: Parameters for the smoothing method
|
||||
|
||||
Returns:
|
||||
DataFrame with the smoothed target column
|
||||
"""
|
||||
# Make a copy to avoid modifying the original
|
||||
smoothed_df = df.copy()
|
||||
|
||||
if target_col not in smoothed_df.columns:
|
||||
print(f"Target column '{target_col}' not found in data")
|
||||
return smoothed_df
|
||||
|
||||
# Extract target column
|
||||
target_values = smoothed_df[target_col].values
|
||||
|
||||
if method == 'noise':
|
||||
# Default parameters
|
||||
if params is None:
|
||||
params = {'noise_scale': 0.0001}
|
||||
|
||||
# Add small noise to break plateaus
|
||||
noise_scale = params.get('noise_scale', 0.0001)
|
||||
np.random.seed(42) # For reproducibility
|
||||
smoothed_values = target_values + np.random.normal(0, noise_scale, len(target_values))
|
||||
|
||||
elif method == 'spline':
|
||||
from scipy.interpolate import UnivariateSpline
|
||||
|
||||
# Default parameters
|
||||
if params is None:
|
||||
params = {'s': 0.01}
|
||||
|
||||
# Use spline interpolation
|
||||
x = np.arange(len(target_values))
|
||||
s = params.get('s', 0.01) # Smoothing factor
|
||||
spline = UnivariateSpline(x, target_values, s=s)
|
||||
smoothed_values = spline(x)
|
||||
|
||||
elif method == 'rolling':
|
||||
# Default parameters
|
||||
if params is None:
|
||||
params = {'window': 3, 'center': True}
|
||||
|
||||
# Use rolling average
|
||||
window = params.get('window', 3)
|
||||
center = params.get('center', True)
|
||||
smoothed_series = pd.Series(target_values).rolling(
|
||||
window=window, center=center, min_periods=1).mean()
|
||||
smoothed_values = smoothed_series.values
|
||||
|
||||
else:
|
||||
print(f"Unknown smoothing method: {method}")
|
||||
return smoothed_df
|
||||
|
||||
# Update the target column in the DataFrame
|
||||
smoothed_df[target_col] = smoothed_values
|
||||
|
||||
return smoothed_df
|
||||
|
||||
def plot_comparison(original_df, smoothed_df, target_col, file_name=None, samples=1000):
|
||||
"""
|
||||
Plot comparison between original and smoothed data.
|
||||
|
||||
Args:
|
||||
original_df: DataFrame with original data
|
||||
smoothed_df: DataFrame with smoothed data
|
||||
target_col: Name of the target column
|
||||
file_name: Name of the file (for title)
|
||||
samples: Number of samples to plot
|
||||
"""
|
||||
if target_col not in original_df.columns or target_col not in smoothed_df.columns:
|
||||
print(f"Target column '{target_col}' not found in data")
|
||||
return
|
||||
|
||||
# Create a figure with multiple subplots
|
||||
fig, axes = plt.subplots(3, 1, figsize=(15, 12))
|
||||
|
||||
# Get data for plotting
|
||||
original_values = original_df[target_col].values[:samples]
|
||||
smoothed_values = smoothed_df[target_col].values[:samples]
|
||||
x = np.arange(len(original_values))
|
||||
|
||||
# Plot 1: Overview
|
||||
axes[0].plot(x, original_values, label='Original', alpha=0.7)
|
||||
axes[0].plot(x, smoothed_values, label='Smoothed', alpha=0.7)
|
||||
axes[0].set_title(f"Overview of {target_col}" + (f" ({file_name})" if file_name else ""))
|
||||
axes[0].set_xlabel('Index')
|
||||
axes[0].set_ylabel(target_col)
|
||||
axes[0].legend()
|
||||
axes[0].grid(True, alpha=0.3)
|
||||
|
||||
# Plot 2: Zoomed section (first 200 points)
|
||||
zoom_end = min(200, len(original_values))
|
||||
axes[1].plot(x[:zoom_end], original_values[:zoom_end], label='Original', alpha=0.7)
|
||||
axes[1].plot(x[:zoom_end], smoothed_values[:zoom_end], label='Smoothed', alpha=0.7)
|
||||
axes[1].set_title(f"Zoomed View (First {zoom_end} Points)")
|
||||
axes[1].set_xlabel('Index')
|
||||
axes[1].set_ylabel(target_col)
|
||||
axes[1].legend()
|
||||
axes[1].grid(True, alpha=0.3)
|
||||
|
||||
# Plot 3: Difference between original and smoothed
|
||||
diff = smoothed_values - original_values
|
||||
axes[2].plot(x, diff, label='Smoothed - Original', color='green', alpha=0.7)
|
||||
axes[2].axhline(y=0, color='r', linestyle='--', alpha=0.5)
|
||||
axes[2].set_title('Difference (Smoothed - Original)')
|
||||
axes[2].set_xlabel('Index')
|
||||
axes[2].set_ylabel('Difference')
|
||||
axes[2].grid(True, alpha=0.3)
|
||||
|
||||
plt.tight_layout()
|
||||
plt.show()
|
||||
|
||||
def save_smoothed_file(df, original_path, suffix="_smoothed"):
|
||||
"""
|
||||
Save the DataFrame to a new CSV file with European number format.
|
||||
|
||||
Args:
|
||||
df: DataFrame to save
|
||||
original_path: Path to the original CSV file
|
||||
suffix: Suffix to add to the new filename
|
||||
|
||||
Returns:
|
||||
Path to the saved file
|
||||
"""
|
||||
# Create new filename
|
||||
base, ext = os.path.splitext(original_path)
|
||||
new_path = f"{base}{suffix}{ext}"
|
||||
|
||||
# Save with European number format
|
||||
df.to_csv(new_path, sep=';', decimal=',', index=False)
|
||||
print(f"Saved smoothed data to {new_path}")
|
||||
|
||||
return new_path
|
||||
|
||||
def process_file(file_path, smoothing_method, params=None):
|
||||
"""
|
||||
Process a single file: load, analyze, smooth, plot comparison, and save.
|
||||
|
||||
Args:
|
||||
file_path: Path to the CSV file
|
||||
smoothing_method: Method to use for smoothing
|
||||
params: Parameters for the smoothing method
|
||||
|
||||
Returns:
|
||||
Path to the saved smoothed file
|
||||
"""
|
||||
# Load the file
|
||||
df = load_file(file_path)
|
||||
if df is None:
|
||||
return None
|
||||
|
||||
# Analyze the target column
|
||||
analysis = analyze_target_column(df, TARGET_COLUMN)
|
||||
if analysis is None:
|
||||
return None
|
||||
|
||||
# Adjust smoothing parameters based on analysis if not provided
|
||||
if params is None:
|
||||
if smoothing_method == 'noise':
|
||||
# Use 1/10 of the minimum non-zero difference
|
||||
noise_scale = max(0.00001, abs(analysis['min_abs_nonzero_diff']) / 10)
|
||||
params = {'noise_scale': noise_scale}
|
||||
print(f"Using noise scale: {noise_scale:.8f}")
|
||||
elif smoothing_method == 'spline':
|
||||
# Adjust smoothing factor based on data range
|
||||
data_range = df[TARGET_COLUMN].max() - df[TARGET_COLUMN].min()
|
||||
s = 0.0001 * data_range * len(df)
|
||||
params = {'s': s}
|
||||
print(f"Using spline smoothing factor: {s:.8f}")
|
||||
elif smoothing_method == 'rolling':
|
||||
# Use window size based on average run length of repeated values
|
||||
window = max(3, int(analysis['avg_consecutive_repeats'] / 2))
|
||||
params = {'window': window, 'center': True}
|
||||
print(f"Using rolling window size: {window}")
|
||||
|
||||
# Smooth the target column
|
||||
smoothed_df = smooth_target_column(df, TARGET_COLUMN, smoothing_method, params)
|
||||
|
||||
# Plot comparison
|
||||
plot_comparison(df, smoothed_df, TARGET_COLUMN, os.path.basename(file_path))
|
||||
|
||||
# Save the smoothed data
|
||||
smoothed_path = save_smoothed_file(smoothed_df, file_path)
|
||||
|
||||
return smoothed_path
|
||||
|
||||
def main():
|
||||
"""Main execution function"""
|
||||
print("SPS Data Smoothing Utility")
|
||||
print("==========================")
|
||||
|
||||
# Smoothing parameters
|
||||
smoothing_method = 'noise' # 'noise', 'spline', or 'rolling'
|
||||
|
||||
# Process each file
|
||||
smoothed_files = []
|
||||
for file_path in file_paths:
|
||||
print(f"\nProcessing {file_path}...")
|
||||
smoothed_path = process_file(file_path, smoothing_method)
|
||||
if smoothed_path:
|
||||
smoothed_files.append(smoothed_path)
|
||||
|
||||
print("\nProcessing complete!")
|
||||
print(f"Created {len(smoothed_files)} smoothed files:")
|
||||
for file_path in smoothed_files:
|
||||
print(f" {file_path}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Завантаження…
Посилання в новій задачі