May 10, 2026

Radios Tech

Connecting the World with Radio Technology

Anomaly detection using machine learning and adopted digital twin concepts in radio environments

Anomaly detection using machine learning and adopted digital twin concepts in radio environments

Our methodology begins with creating a dataset using system parameters referenced in Reference Paper11. This data generation was accomplished using our Python framework, which consists of six dedicated code modules. Each module handles a specific part of data generation, cleaning, and grouping.

The system structure assumes a 40 × 40 m radio environment free of obstacles. Within this area, there are 10 transmitters and 0 or 1 jammer, each operating at the same power level as specified in the table below. A key parameter in our study is the RSSI, which reflects the strength of signals received by the SUs from either the TXs or jammers. All these components are connected to a control unit running our framework.

Anomalies in the radio environment are identified through examination of localization data, RSSI, and SNR values. This process identifies active transmitters and their effect on the 25 SUs spread across the area. This setup enables signal intelligence analysis for detecting anomalies in the radio environment based on predefined threshold values. To extend anomaly detection into a security centric framework, anomalies were mapped to specific cyber threats. This mapping process follows:

  1. 1.

    Feature Alignment: The extracted features (RSSI, SNR, Path Loss, Localization Error, etc.) were mapped to known attack behaviors.

  2. 2.

    Threat Classification: Each detected anomaly was categorized under a predefined security attack type.

  3. 3.

    Severity Assessment: Attacks were assigned severity levels (High/Medium) based on potential system disruption.

  4. 4.

    Countermeasure Mapping: Each attack was linked to a security mitigation technique, ensuring real time defense adaptation.

Dataset generation

We generated a comprehensive anomaly detection dataset with over 20,000 samples using python code. The dataset is based on parameters and environmental characteristics like those in3. Particularly, we modeled network conditions within a 40 m × 40 m radio environment. Key parameters were derived from Table 1.

Figure 1 shows parameters that were selected based on their relevance to detecting anomalies in wireless communication environments. The dataset was divided into different anomaly classes: “Normal,” “Signal Drift,” “Multipath Effect,” “Localization Inaccuracy,” “HShadLev,” and “LShadLev.” Each sample was labeled with its respective anomaly class, which helped in training the machine learning models and evaluating the effectiveness of anomaly detection.

Fig. 1
figure 1

Key parameters collected.

Table 1 Our proposed system parameters.

Anomaly detection and classification

Figure 2 shows the systematic approach used in anomaly detection and classification under the domain of machine learning; the detailed workflow starting from data preprocessing to validation of the model results is presented. The first step, Data Preprocessing, includes cleansing and normalization of the dataset in order to bring consistency and minimize noise in the data; this is crucial for training the model accurately. After preprocessing, feature extraction is realized by identifying and extracting pertinent parameters: RSSI, Path Loss, SNR, and attack metrics that are relevant to the anomaly detection process. These features are instrumental in capturing the patterns associated with both normal and anomalous data.

The Model Training phase applies machine learning algorithms to learn relationships between extracted features and anomalies, using labeled datasets for supervised training. This step establishes the foundation for resilient detection and classification mechanisms. By Using 4 steps in each algorithm “Randomized cv search, Hyperparameter tuning, Permutation importance, Extraction Weights”.

Following training, Anomaly Detection is carried out on both the preprocessed dataset and real time inputs. This step target is to detect deviations indicative of potential anomalies, including signal drift, jamming, and multipath effects.

Detected anomalies are then categorized into predetermined classes during the Anomaly classification phase. Classes such as (Normal, Signal Drift, Multipath Effect) making sure that detected anomalies are accurately categorized according to its characteristics.

Finally, the Verification step evaluates the accuracy of classifications to verify the performance of the machine learning model. Figure 2 represents the impact of the models, estimated according to their accuracy in detecting and classifying anomalies in both the generated and real time datasets.

Fig. 2
figure 2

Anomaly detection and classification.

Security layer integration

Table 2 outlines information according to different types of attacks commonly experienced in wireless communication systems. Each attack is represented in detail, highlighting it’s mechanism and potential risks. The table also represents a severity Level column, which categorizes attacks as High or Medium depending on the damage they can cause to the network and system integrity. The potential countermeasure column suggests defensive strategies customized for each attack type. For example, jamming Attacks require adaptive frequency hopping to avoid interference, while MITM can be highlighted by using strong encryption and validation techniques. Similarly, replay Attacks and eavesdropping attacks take advantage of signal monitoring and encryption improvements. This representation aligns perfectly into the anomaly detection framework, particularly the security layer, to offer practical insights. By categorizing attacks and connecting their severities to corresponding countermeasures, this table enables quick reference and improve the entire defense strategy of the system.

Table 2 Attack types, severity levels, and suggested countermeasures.

Model workflow

The workflow in Fig. 3 present an overview of the methodology for enhancing a machine learning according to wireless Network anomaly Detection System using the Classified anomalies dataset. The process consists of the following stages:

  1. 1.

    Gathering Data: use the Classified anomalies dataset, consisting of 20000entries and 18 features, representing various types of cyber-attacks.

  2. 2.

    Data Preprocessing: categorize data quality issues by handling missing or unreliable values, removing duplicate entries, converting data types, and eliminating redundant columns to improve data efficiency.

  3. 3.

    Feature Selection: Resolve class imbalance through random undersampling of the majority class, normalize data using standard scaling to ensure they are on the same scale, and select the most relevant features using techniques such as Recursive Feature Elimination (RFE) & correlation analysis.

  4. 4.

    Data Splitting: use a segmented K fold cross validation with five folds to split the data while maintaining class spread for precise model validation.

  5. 5.

    Model Training: train and test models by applying different algorithms as (RF, LR, SVM, XGBoost, KNN).

For each algorithm, consistently perform optimize Hyperparameters using randomized search, Fine tune model parameters for better performance, assess feature importance analysis during training, and incorporate class weights to handle class imbalance effectively.

6. Model Evaluation: Assess overall model performance using metrics such as Weighted and Macro Precision, Recall, Accuracy, and F1 Score evaluation.

Fig. 3
figure 3

Overview of model architecture.

In the following subsections, every step is explained in detail.

Data leakage

To guarantee the integrity of the anomaly detection framework, there were stringent measures put in place to avoid leakage of data during processing. Key processes such as data normalization, feature selection, under sampling, and hyperparameter tuning (using randomized search) were performed exclusively on the training set to ensure that no information from the test set influenced these procedures. This separation maintained the integrity of the model evaluation process and provided unbiased results. Furthermore, we utilized repeated stratified K fold cross validation to evaluate the models. This method ensured even distribution of data across all folds, preserving the balance of anomaly classes and keeping the training and testing datasets independent. By following this approach precisely, we reduced the possibility of unintentional data leakage, preserving the accuracy of performance metrics and improving the model’s ability to perform well on new, unseen data.

Dataset columns

  • Sensing Unit (SU): Identifies the distributed sensing unit in charge for gathering data.

  • Source Type: refers to the type of source as(jammer or transmitter), that contributes to communication signals.

  • Distance (m): Distance between the source and the SU in meters.

  • Path Loss (dB): Signal power loss as it propagates through space.

  • RSSI (dBm): Received Signal Strength Indicator, representing the power of signal at the receiver.

  • SNR (dB): Signal to Noise Ratio, Ratio between signal and noise.

  • Shadowing: models for signal attenuation caused by environmental shadowing effects.

  • Jammer Present: A binary indicator used for indicating a jamming attack.

  • Localization Error: indicates how inaccurately the source location was determined. • No Anomaly Detected: Indicates whether the observation corresponds to normal network behavior.

  • Other Anomalies: Includes other network irregularities not classified under specific attacks.

  • Attack Type: specify the type of attack as shown in Table 2.

  • Attack Severity: attack severity is represented categorically.

  • Potential Countermeasure: Suggested mitigation strategies for each attack type.

  • Description: A brief explanation of the observed anomaly or attack. Anomaly Class: A numeric classification of anomalies ranging from 0 (Normal) to various types of network disruptions (non zero) as shown in Table 3.

Figure 4 visualizes the correlation between RSSI and SNR using a sieve diagram, where the x axis resembles RSSI into four intervals and the y axis divides SNR into four ranges. Red regions indicate higher than expected occurrences, likely due to stable signal conditions or reduced interference, while blue regions represent lower than expected occurrences, suggesting disruptions caused by environmental factors, hardware limitations, or interference. The diagonal pattern of blue regions implies that mid range RSSI values do not consistently align with moderate SNR, meaning that as RSSI improves, SNR does not always increase proportionally. The chi square statistic (χ2 = 60,000, p = 0.000) confirms a strong correlation, revealing that signal strength and quality deviate from a uniform distribution. This pattern provides insights into signal propagation characteristics and potential interference effects within the dataset.

Fig. 4
figure 4

Sieve diagram showing the relationship between RSSI (dBm) and SNR.

Data generation

Table 3 outlines a classification framework specifically designed to capture key anomalies and environmental factors that affect the performance of wireless networks. Six classes in this framework, each representing different network behaviors or challenges. The dataset created using the parameters in Table (1) to resemble real world conditions, is an essential resource for evaluating machine learning algorithms and anomaly detection models, offering a robust foundation for the development of more resilient and efficient cellular systems.

  1. 1.

    Class 0 Normal: This class represents the initial state of the network, where no anomalies are detected, and optimal functionality has occurred. It serves as the baseline category in comparison with other anomalous conditions.

  2. 2.

    Class 1 Signal Drift: Signal drift refers to changes in metrics like RSSI or SNR, often due to natural network fluctuations or user behavior, without external interference. While not immediately alarming, it can indicate issues like declining hardware or environmental changes.

  3. 3.

    Class 2 Multipath Effect: The multipath effect happens when signals scatter along different paths, causing fluctuations in strength and quality due to interference. Class 2 addresses these distortions, which can lead to unstable connections and impact network performance. This is particularly an issue in areas with complex obstacles.

  4. 4.

    Class 3 Class 3 focuses on errors in establishing the exact location of a source or sensing unit, caused by environmental factors or system limitations. These errors can lead to inaccurate position data, affecting network resource allocation. These errors are essential for maintaining system reliability, particularly in environments mobility which is very important to select appropriate location.

  5. 5.

    Class 4 HShadLev (High Shadowing Levels): It refers to situations where environmental obstacles, such as buildings, cause significant signal attenuation. This results in critical degradation of the signal, often leading to weak or nearly unavailable connections.

  6. 6.

    Class 5 LShadLev (Low Shadowing Levels): This class represents conditions with minimal signal attenuation, where environmental factors have little impact on signal propagation, ensuring more stable communication. The dataset, created to simulate real world conditions, includes these labeled classes to aid in developing and testing machine learning models for anomaly detection. By modeling factors like shadowing and path loss, it offers a strong foundation for advancing wireless network security and resilience.

Dataset specifications

A dataset of 20,000 samples was created to support strong anomaly detection in wireless networks. It was generated using tailored built scripts, inspired by studies, and simulated within a 40 m x 40 m area under realistic environmental conditions according to Table 1.

Data type optimization

To improve efficiency, data types were optimized for lowered memory usage, faster processing, and scalability:

  • Integer Columns: Discrete values like Source Index and Anomaly Class stored as integers.

  • Floating Point Columns: features like RSSI and SNR stored as float32 for fair precision and reduced memory.

  • Categorical Data: to save memory string values are transformed to categorical types.

  • Binary Columns: to save storage space, boolean data is stored as binary.

  • String Columns: Unique descriptive entries kept as strings where compression is unnecessary.

The Benefits of These optimizations decreased memory usage, improved computational performance, and enhanced scalability, supporting efficient anomaly detection and scalability for future expansions.

Data pre‑processing

The preprocessing involved the following steps:

  1. 1.

    Missing data in columns such as Shadowing and Localization Error was imputed using the mean or median value, depending on the distribution of the data.

  2. 2.

    Records with excessive missing values were discarded to maintain dataset integrity.

  1. 3.

    Continuous variables such as distance (m), Path Loss (dB), RSSI (dBm), and SNR (dB) were standardized using z score normalization to guarantee all features contributed equally to the machine learning models. This reduced the bias proposed by features with highest magnitudes.

  1. 4.

    Outliers in numerical columns were detected using the Interquartile Range (IQR) method. Records with values beyond 1.5 times the IQR were flagged for removal.

  2. 5.

    For anomaly related attributes, extreme outliers were retained if they represented valid anomalies, as they were crucial for model training.

  1. 6.

    Columns like Source Type, Attack Type, Attack Severity, and Potential Countermeasure were converted into numerical representations using one hot encoding or label encoding, depending on the column’s characteristics.

  2. 7.

    The Anomaly Class column, which represents the target variable, was encoded as an integer for classification tasks.

  1. 8.

    Noise was added to features like RSSI (dBm) and SNR (dB) to simulate real world variations and improve the robustness of the models.

  2. 9.

    Rotations and translations were applied to spatial parameters like Distance (m) to mimic different environmental conditions.

  1. 10.

    The dataset was structured for model evaluation using stratified 5 fold cross validation, confirming that class spreading was preserved across all folds.

  2. 11.

    This approach allowed the dataset to be systematically divided into five subsets, where each fold served as a validation set once while the remaining four folds were used for training, providing a robust evaluation framework.

  3. 12.

    Feature importance was assessed using correlation analysis technique. Redundant features were rejected to streamline the dataset and enhance model performance.

  1. 13.

    Signal related columns like RSSI in dBm and SNR in dB were smoothed using moving average filters to decrease noise without compromising the integrity of the data.

Figure 5 represents the critical data preprocessing steps implemented to prepare the dataset for machine learning tasks. By ensuring the dataset is clean, consistent, and suitable for anomaly detection, thereby enhancing model accuracy and reliability. Initially, missing values in numerical and categorical columns were handled using imputation techniques or removal when excessive gaps were identified. Continuous variables including Distance, Path Loss, RSSI, and SNR, were normalized using z score normalization to remove biases due to changing feature magnitudes. Outliers were discovered using the Interquartile Range technique, with exceptions made for anomaly specific features that served as valid inputs for model training. Categorical features, including Source Type and Attack Severity, were numerically encoded using one hot or label encoding, while the target variable Anomaly Class was modified for classification. Data augmentation techniques, such as introducing noise to RSSI and SNR and preforming spatial rotations and translations, were used to imitate real world variations and environmental conditions. The dataset was prepared for model assessment using stratified five fold cross validation, with class distributions retained across all folds. Feature selection was performed using Recursive Feature Elimination (RFE) and correlation analysis to eliminate duplicates features and improve model preformance. Lastly, noise reduction technologies, such as moving average filters, were used on signal based features to improve data quality while maintaining integrity.

Fig. 5
figure 5

Anomaly detection data preparation.

Missing values and unreliable values

Continuous variables, such as Distance (m) and RSSI (dBm), were filled using the mean or median values to preserve the dataset’s statistical properties. Categorical variables, such as Source Type and Attack Type, were imputed with the most frequent category. Records with a high percentage of missing values across multiple features were discarded to maintain overall dataset quality.

Correlation and dropping redundant features

Redundant categorical variables were evaluated using Cramér’s V statistic to determine their correlation, ensuring that the retained features offered unique insights. This process reduced the dataset’s dimensionality, improved computational efficiency, and enhanced model performance by eliminating noise and redundancy.

Data duplicates

In our dataset, we performed a full analysis to detect and eliminate duplicate rows, considering all feature columns as criteria for duplication. Duplicate records were detected when the values in all columns matched exactly across rows. After identifying these duplicates, they were removed to avoid affecting model training with redundant information.

The process was further refined to account for partial duplicates cases where most features were identical except for small differences in non critical fields, such as comments or descriptions. In these cases, domain expertise was applied to review the importance of retaining or merging these entries.

By resolving data duplication, we ensured the uniqueness and authenticity of the dataset, which contributes to balanced model training and a more accurate evaluation of anomaly detection algorithms. This step also helped improve computational efficiency by reducing the overall dataset size without reducing its quality.

Data grouping and data sampling

The gain of grouping was important for feature engineering and model validation steps, as it enabled us to maintain the essential structure of the data while ensuring balanced representation across different categories during sampling.

Given the large size of the dataset (20,000 samples), data sampling was applied to create balanced and manageable groups for model training, testing, and validation. Sampling techniques were employed to address the following objectives:

  • Balancing Class Distribution.

  • Reducing Computational Overhead.

  • Addressing Specific Use Cases.

Hyperparameter fine‑tuning

In the field of DT Wireless Networks, effective anomaly detection is essential for maintaining network integrity and performance. To enhance the performance of various machine learning models used for this purpose, thorough hyperparameter tuning was methodically implemented. This process involved adjusting key parameters for each algorithm to optimize accuracy, precision, recall, and overall stability. Techniques such as Grid Search and Randomized Search Cross Validation (CV) were employed, utilizing repeated stratified k fold cross validation to ensure consistent results and reduce the risk of overfitting.

K nearest neighbors (KNN)

KNN is a non parametric algorithm used for classification tasks, determining a sample’s class based on the majority vote of its k nearest neighbors. It is particularly effective for datasets with variable distributions and gradual transitions, such as Class 1 (Signal Drift). The hyperparameter k, representing the number of neighbors, was optimized by testing values from 1 to 30. Weighting schemes, including uniform and distance based, and distance metrics like Euclidean, Manhattan, and Minkowski were evaluated. The predicted value is computed as the average of the target values of the k nearest neighbors in Eq. (1):

$$\:\widehat{y}=\frac{1}{k}{\sum\:}_{i=1}^{k}{y}_{i}\:$$

(1)

where:

\(\:\widehat{y}\): predicted value for the target variable.

$$\:k:\:number\:of\:nearest\:neighbors\:considered.$$

$$\:{y}_{i}:value\:of\:the\:target\:variable\:for\:the\:\:{\varvec{i}}^{th}\:nearest\:neighbor.$$

Random forest (RF)

Random Forest is a collective learning algorithm that combines multiple decision trees to produce a strong classification by averaging their outputs. It is a very efficient method for handling high dimensional datasets and decreasing the chances of overfitting. This approach is well suited for analyzing positional errors in Class 3 (Localization Inaccuracy). Hyperparameters as the number of trees (n estimators), tree depth, and minimum samples for separations were fine-tuned, with values ranging from 50 to 500 trees, depths between 5 and 20, and separations from 2 to 10. The model’s prediction is provided by:

$$\:f\left(x\right)=\frac{1}{N}{\sum\:}_{j=1}^{N}{T}_{j}\left(x\right)$$

(2)

Where:

\(\:f\left(x\right)\::\) This represents the predicted value for the target variable at data point x.

N: This is the total number of decision trees in the Random Forest.

\(\:{T}_{j}\left(x\right)\): This represents the prediction of the j th decision tree in the forest for data point x.

XGBoost

XGBoost stands for advanced Gradient Boosting40, an improved boosting algorithm that builds models step by step. The optimization in the performance is reached by the gradient descent approach, which minimizes the prediction error. XGBoost is highly effective in anomaly detection in radio environments because it can handle both linear and nonlinear relationships between anomalies41. This controls the model to prevent overfitting and makes it generalize well across new environments. XGBoost makes iteration steps, each round adds tree function to each iteration to reduce the loss. Tree function calculating negative gradient of the loss with respect to the prediction, which can be written as:

$$\:\widehat{{y}_{i}^{\left(t+1\right)}}=\widehat{{y}_{i}^{\left(t\right)}}+\eta\:{f}_{t}\left({x}_{i}\right)$$

(3)

Where; \(\:\widehat{{y}_{i}^{\left(t\right)}}\): is the predicted value in iteration steps.

$$\:{f}_{t}\left({x}_{i}\right)\:\:prediction\:of\:the\:\:weak\:learner\:\left(usually\:a\:tree\right)\:for\:data\:point\:i.$$

\(\:\eta\:\) : is learning rate \(\:,which\:controls\:the\:step\:size\:of\:the\:update.\)

Logistic regression (LR)

Logistic Regression is a statistical model for binary classification, predicting probabilities using a sigmoid function. It estimates the probability of an event, such as separating normal (Class 0) from anomalous conditions. Key hyperparameters, including constraint strength (C), solver such as saga, and penalties (L1, L2), were fine tuned. The logistic function is expressed as in Eq. (4):

$$\:P\left(y=1|X\right)=\frac{1}{1+{e}^{-\left({{\upbeta\:}}_{0}+{{\upbeta\:}}_{1}{X}_{1}+\dots\:+{{\upbeta\:}}_{n}{X}_{n}\right)}}$$

(4)

\(\:P\left(y=1|X\right)\): probability of y being 1 given the features X.

X: input features.

\(\:{\beta\:}_{0}\): This is the intercept term (bias).

\(\:{\beta\:}_{1}:\)

These are the coefficients associated with each feature.

The Logistic Regression equation models the probability of the target variable being 1 as a function of the input features using a sigmoid function (the logistic function). The sigmoid function maps any input value to a value between 0 and 1, which represents the probability.

Support vector machine (SVM)

SVM separates classes in high dimensional space by maximizing the gap between them. It is effective for binary classification, such as classifying conditions in Class 5 (LShadLev). Hyperparameter tuning involved exploring kernel alternatives (linear, RBF, polynomial), constraint (C), and kernel coefficients (γ). The decision function is as in Eq. (5)

$$\:f\left(x\right)={w}^{T}x+b$$

(5)

Where:

$$\:f\left(x\right):is\:the\:decision\:score.$$

\(\:{w}^{T}\): This denotes the transpose of the weight vector.

x: This is the input data vector.

b: This is the bias term, which determines the position of the hyperplane.

All machine learning models employ the following general tuning strategy:

Randomized Search: originally used to explore rapidly a broad variety of hyperparameters.

Grid Search: Applied for fine tuning after limiting promising parameter ranges.

Cross Validation: Repeated stratified k fold with K = 5, was used to guarantee reliable performance metrics and prevent data leakage.

Performance Metrics: Metrics like accuracy, precision, recall, F1 score, and AUC ROC were monitored to detect the best hyperparameter combinations for each model.

link

Leave a Reply

Your email address will not be published. Required fields are marked *

Copyright © All rights reserved. | Newsphere by AF themes.