LAMANet: A Real-Time, Machine Learning-Enhanced Approximate Message Passing Detector for Massive MIMO

Model-driven machine learning for signal detection in the physical layer of mobile communication systems combines well-known detector structures with learned parameters. Recent work has shown high detection performance in massive multiple-input–multiple-output (MIMO) detection; however, thorough complexity analysis and real-time processing hardware are lacking. This work proposes a novel machine learning enhanced approximate message passing (AMP) algorithm named LAMANet and its hardware implementation. The algorithm solves some major challenges of previous proposals such as the complete loss of performance in untrained detectors and the still high computational complexity compared to traditional massive MIMO detection methods. We provide a comprehensive complexity comparison, simulations of the symbol error rate (SER) performance over realistic channel models, and a field-programmable gate array (FPGA) implementation capable of processing LAMANet in real time. The results show a similar detection performance of LAMANet to previous machine learning-enhanced algorithms, while the computational effort is reduced to a level where real-time computation in hardware becomes comparable to traditional detection methods.


I. INTRODUCTION
M ACHINE learning in the context of mobile communication systems has received much attention recently and is speculated to be a significant part of future networks such as 6G [1]. It is investigated in many areas such as mobile data analysis, network control, network security, traffic control, and physical layer processing [2]. Physical layer processing is of particular interest to us as it holds the potential to the following. 1) Improve performance over complex or unknown communication channels via the following: a) optimization or extension of existing algorithms or by the introduction of new machine learning-based algorithms [3], [4]; b) joint optimization of various algorithms by removing classical block boundaries [5]. 2) Compensate for hardware impairments [e.g., lowresolution analog-to-digital converters (ADCs)] [6]. 3) Ease the design process by relying on well-established machine learning frameworks. Although there are many parts of the physical layer such as channel estimation, power control, source and channel coding, synchronization, and more that might benefit from machine learning, in this work, we focus on signal detection in massive multiple-input-multiple-output (MIMO) in the base station.
Two distinctive design paradigms of machine learning for signal detection have been developed, namely, data-and model-driven designs. Data-driven designs rely purely on the algorithm's ability to learn from the available training data. The signal detectors are implemented by common machine learning structures, such as fully connected layers of a neural network and convolutional operations. O'Shea and Hoydis [4] presented an autoencoder-based design, in which both the sender as well as the receiver are implemented as neural networks. This has the advantage that the optimal symbol representation can be learned, depending on the channel. An autoencoder for hybrid beamforming in a massive MIMO download scenario is implemented and profiled in [7]. Overthe-air communication of a simple autoencoder design with programmable radio platforms and graphics processing units (GPUs) for training has been shown in [8]. Although the results for data-driven designs are encouraging, they suffer from scalability issues due to the high computational complexity required by these models. This becomes clear when considering the short symbol duration of just ≈66.67 μs in long-term evolution [9] and the even shorter duration in 5G, which can be as low as ≈4.17 μs [10]. Combined with a large number of base station antennas, users, and subcarriers, the complexity of data-driven designs easily becomes infeasible.
Model-driven designs avoid this problem of infeasible complexity by building on the established results in communication science. Existing algorithms are extended by machine learning elements. Most commonly, some parameters of existing algorithms are optimized via machine learning, leading to higher performance. An overview of model-driven techniques can be found in [11].
Message passing algorithms, such as belief propagation (BP) [12] and orthogonal approximate message passing (OAMP) [13], [14], are commonly used in massive MIMO detection. Both classes of algorithms are a popular choice for extending them by machine learning enhancements in a model-driven design flow. The dampening factors and other parameters can be learned in BP, thereby improving the performance [15], [16]. Our work is based on recent works, namely, OAMPNet [17] and MMNet [18]. These works show high detection performance and moderate computational complexity. Furthermore, MMNet is one of the few detectors verified using a realistic channel model.
We notice a lack of hardware evaluation of machine learning-based detector algorithms in the literature. For massive MIMO detection, the work in [15] presents an application-specific integrated circuit (ASIC) processor based on the simplified message passing detector (sMPD) algorithm. A real-time MIMO orthogonal frequency-division multiplexing (OFDM) detector implemented via an echo state network is presented in [19]. The antenna configuration is, however, only four receive and transmit antennas. A GPU-based, but nonreal-time autoencoder for a basic communication system is implemented in [8]. Many proposed ML detector algorithms have very high computational complexity. This and the lack of real-time hardware implementations in the literature so far motivate the research presented in this work.

A. Contribution
We develop and profile LAMANet-a novel, AMP-based, machine learning-enhanced massive MIMO detector. From an algorithmic point of view, we propose to base the detector on AMP instead of the more complex OAMP algorithm to reduce complexity, propose a new way of incorporating learnable matrices in the AMP algorithm to prevent the loss of performance in untrained detectors, thereby relaxing the online training requirement of MMNet, and remove unnecessary computations in the AMP algorithm as their functionality is replaced by the learned parameters. After an assessment of computational complexity, we develop hardware architectures for real-time processing of the detector algorithm and the required preprocessing algorithms. We implement and simulate the algorithms for a Xilinx RFSoC field-programmable gate array (FPGA). The presented work is important as it shows one of the first real-time machine learning massive MIMO detectors.

B. Organization and Notation
In Section II, we present our massive MIMO system model and introduce traditional AMP detection algorithms. In Section III, we derive the novel LAMANet algorithm from previous work and improve on computational complexity to enable efficient processing in hardware. Next, in Section IV, we introduce design considerations for a custom hardware accelerator for LAMANet-type detectors and required preprocessing accelerators. The algorithm and circuit design is evaluated in Section V. The algorithm is evaluated in terms of symbol error rate (SER) detection performance, while for the circuit design, circuit metrics, such as resource utilization, throughput, and latency, are evaluated. Section VI presents the conclusions.
The most common notation and symbols used throughout the rest of the presented work are listed in the Nomenclature.

A. Massive MIMO
A generic massive MIMO system can be described as [20] y = Hx + n (1) with the complex channel matrix H ∈ C M R ×M T . The base station receives the complex-valued vector y ∈ C M R . The single antenna user terminals (UTs) transmit simultaneously to form the transmit vector x ∈ C M T . The additive noise is complex Gaussian distributed n ∼ N (0, σ 2 I R ). To simplify the processing architecture and since most machine learning frameworks are not capable of processing complex numbers, the channel matrix can be decomposed into a real-valued channel matrix In a similar fashion, y is decomposed to y ∈ R M R , x is decomposed to x ∈ R M T , and n is decomposed to n ∈ R M R . The values x are defined by the constellation points c ∈ X ⊂ C with |X| = M C × M C . We consider the modulation types of QPSK, QAM16, QAM64, QAM256, and QAM1024. For all experiments, we assume perfect power control such that the columns of H are normalized to one (i.e.,

B. Approximate Message Passing
AMP belongs to the family of iterative thresholding algorithms and was initially conceived for compressed sensing Algorithm 1 LAMA Algorithm Algorithm 2 Mean and Variance Functions F and G applications [21]. Since then, variations of the algorithm have been applied to various statistical estimation tasks, such as machine learning, image processing, and communications [22]. Based on this, Jeon et al. [13], [23] proposed the LArge MIMO AMP (LAMA) algorithm, which is shown in Algorithm 1.
The function F(z i , τ ) produces the mean value of z i (i.e., the denoised, new estimate of x i+1 ). The function G(z i , τ ) produces the variance of z i . Both functions can be implemented as seen in Algorithm 2. It is worth noting that the linear function generating z i can be easily extended z i = x t + H T r i + v i . Where r i = y − H T x i is the residual term.
Another often-used variant of the original AMP algorithm to perform massive MIMO detection is OAMP [14]. It was introduced to improve the reliability of the classical AMP algorithm for channel matrices not following an independent and identically distributed (i.i.d) Gaussian distribution. The OAMP algorithm is shown in Algorithm 3. It can be imple-Algorithm 3 OAMP Algorithm mented in different versions, allowing to trade off performance and computational complexity. Setting W i to either the transposed channel matrix (H T ) or even to the pseudoinverse of the channel matrix [pinv(H)] allows for reduced computational complexity since W i , B, tr B , and tr W can be calculated during preprocessing and are the same for all iterations. This approach, however, will reduce the performance. On the other hand, computing the minimum mean square error (MMSE) and the optimal W i matrix for each iteration might be prohibitively complex in many applications. Incorporating a learnable matrix in this algorithm as seen in line 4 is discussed in Section III.

III. MACHINE LEARNING-ENHANCED AMP ALGORITHMS
It is possible to extend the AMP-and OAMP-based detectors with learnable parameters in a model-driven design methodology. These detectors can then be trained on channel models or field measurements to optimize the numerical values of the embedded, learnable parameters and improve the detector performance.
The traditional OAMP algorithm, as shown in Algorithm 3, can be easily extended to include learnable parameters. He et al. [20] proposed to add learnable, scalar parameters to the OAMP algorithm in order to allow for an optimal step size and noise estimation. In the linear function at line 10 of Algorithm 3, the parameter θ 1 is added such that Furthermore, the parameter θ 2 can be added to τ to scale the noise variance in line 22 In OAMPNet, the matrix W i is chosen to be the optimal matrix (W type = opt). In a previous implementation, the pseudoinverse of H is assigned to all W i (W type = pinv). This network was called TISTA [24].
The work in [18] introduces other learnable parameters into the OAMP algorithm and presents two versions, namely, MMNetiid and MMNet. The former sets W i = H T and introduces a learnable parameter in the linear part and noise variance estimation part similar to that of OAMPNet. MMNet on the other hand introduces W i itself as a learnable matrix and adds a vector to scale the noise estimate individually for each element of x.
MMNet's performance on realistic channels is impressive, and however, the cost for this performance is its requirement of online training. Online training means that before the detector can be used, it has to be trained on the specific realization of the channel in the current coherence interval. The training could be performed based on on-the-fly generated, random input data and the current channel matrix. If the detector was trained on a slightly different channel matrix, it suffers complete performance loss. In practice, this means that an excessively high detection latency as the training of the detector needs to be completed first.
On the other hand, OAMPNet performs well without online training, and however, its computational complexity is challenging for real-time deployment. It requires the MMSE matrix calculation with current noise estimates for each iteration of the algorithm. This includes a full matrix inverse for each iteration (Algorithm 3, line 14).

A. LAMANet
OAMP was introduced to stabilize the AMP algorithm in case the entries of the transformation matrix are not strictly adhering to an i.i.d. Gaussian distribution. This makes OAMP applicable to a wider range of problems, including detection in massive MIMO. TISTA, OAMPNet, MMNetiid, and MMNet are based on that method. However, the reason why MMNet performs particularly well on real-world channels far from the ideal i.i.d. case is that in the linear function, the noise present in z can be shaped very close to a Gaussian distribution by learning the matrices W i . In the following results, we show that based on this reasoning, the usage of the complex OAMP algorithm is not required for machine learning-enhanced massive MIMO detection. Instead, the more traditional and computationally simpler AMP algorithm can be enhanced with learnable parameters. Therefore, we propose to add similar learnable parameters to the LAMA algorithm instead of to the OAMP algorithm. The modifications themselves follow closely the proposal made in MMNet. In this work, we do not consider the simple case of i.i.d entries in H but the more general case with realistic channel models. We call this first proposal LAMANetBL for LAMANetBaseLine, as it closely follows MMNet's extensions for the LAMA algorithm and since we will use it for comparison. The addition of the learnable parameter to the expanded, linear part of Algorithm 1 (line 4) is given as follows: Similarly, a noise scaling parameter is introduced to Algorithm 1 (line 7) as follows: In LAMANetBL, the learnable matrix i has the form of To reduce the computational complexity, we consider the condensed form of the linear part in Algorithm 1 (line 4) and enhance it with two learnable matrices per iteration The learnable matrices are elementwise multiplied and added. We name this approach LAMANetC (for LAMANet Condensed). The number of multiplications in the linear part for LAMANet is 2M R M T , whereas for LAMANetC, it is 2M T M T . In the large system limit (M T M R ), this difference can become quite significant. In the condensed form, it is also possible to define a learnable matrix with the dimensions M T × M T and multiply it with G directly; however; this would be more expensive in terms of computational cost with (M T ) 3 multiplications. It is important to notice that the above simplification of the linear function can only be applied in AMP-type detectors such as LAMA and LAMANet, not in OAMP-type detectors such as in OAMPNet and MMNet. The reason is that in the latter, the residual r i has to be generated separately as it is used in the estimate of the noise variance (Algorithm 3, line 12).
To confirm the validity of elementwise operations, we propose and evaluate another design, which defines the linear function as follows: We call this approach LAMANetEW (for LAMANet Elementwise). Its computational complexity is higher than LAMANet, as additional M R × M T multiplications and additions are required.
The results in Section V-A show that it is not strictly necessary to calculate a new noise estimate (τ i+1 ) and Onsager term (v i ) in every iteration when online learning on a specific channel is performed. This is because the noise variance can Algorithm 4 LAMANetMMSE Algorithm be learned by θ for each layer and the Onsager term can be learned by in the linear part. This saves computational cost in the LAMANet-type algorithms as in Algorithm 1 (lines 6-8) can be omitted. We refer to this detector type as LAMANet.

B. LAMANetMMSE
As mentioned above, MMNet, LAMANetBL, LAMANetC, and LAMANet perform poorly when not trained on a specific channel realization. On the other hand, OAMPNet performs decently without online training due to the choice of the optimal matrix in its linear part (Algorithm 3, lines 14 and 17, W type = opt). However, its computational complexity can be infeasibly high as it requires matrix inversion in every iteration of the algorithm. Also, its trained performance is lower than that of MMNet. We propose to initialize LAMANet with the MMSE matrix in the linear part, which avoids both problems. The MMSE matrix is calculated only once per coherence interval, thereby reducing the computational effort to acceptable levels. This is while the detection performance is also kept at reasonable levels when not trained on a specific channel-relaxing the online training requirement. We name this approach LAMANetMMSE and show it in Algorithm 4. In this algorithm, the elementwise modification via learnable matrices of W i ) is required.

C. LAMANetOpt
In a similar fashion to LAMANetMMSE, and as proposed in OAMPNet, the matrix W i can be set to the optimal matrix [20] For the stability of the algorithm, it is important that v 2 i does not become zero or negative [14]. For this reason, we set where LSB is the least significant bit of the fixed-point datatype of v 2 i . Since W i ultimately depends on the residual r i , it has to be calculated in every iteration of the algorithm. This is undesirable because the matrix operations are expensive (in particular, the matrix inverse is expensive). To reduce the complexity, we propose to use the approach of eigenvalue decomposition of the MMSE matrix in the preprocessing, i.e., once per channel coherence interval where V ∈ R M T ×M T is a diagonal matrix with the eigenvalues on its diagonal and D ∈ R M T ×M T holds in its columns the right eigenvectors. Then, the inverted MMSE matrix can be calculated in each iteration aŝ with the argument of the inverse as We call this proposal LAMANetOpt.

D. Other Initializations
A low-complexity initialization of the LAMANet detector can be given by setting where the function diag() takes the diagonal of an input matrix. Both schemes do not require any additional computations per coherence interval, and however, they reduce the untrained performance. For this reason, we do not further evaluate them.
The initialization of the weights in HyperMIMO is provided by meta-learning [25]. A dedicated neural network learns the weights and provides them to an MMNet-type detector. The neural network produces the weights based on QR-decomposed channel matrices as its input and can accommodate slow changes in the user location. However, initial training on the user positions is required and the additional neural network and QR-decomposition are expensive. We see HyperMIMO as an alternative to the MMSE and Opt initialization.

E. Algorithmic Complexity
It is one of the main performance metrics to evaluate for any massive MIMO detector, as it will have a major impact on circuit performance such as latency, resource requirements, and throughput. In Table I I   COMPUTATIONAL COMPLEXITY IN TERMS OF THE NUMBER OF MULTIPLICATIONS AND ADDITIONS FOR VARIOUS DETECTORS.  THE COMPLEXITY OF THE F AND G FUNCTIONS ARE SHOWN IN TABLE II   TABLE II  COMPUTATIONAL COMPLEXITY OF F AND G FUNCTION light gray indicates the latter. The complexity of the denoiser functions F and G is listed in Table II. Required divisions are not listed, and however, these can be easily estimated from the algorithmic listings. The complexity for various antenna configurations and modulation types is shown in Fig. 1. For clarity, only the number of multiplications is shown. The number of additions is approximately equal to the number of multiplications.
The advantage of using the AMP-based LAMANet over the OAMP-based OAMPNet or MMNet can be easily seen in Fig. 1(a). OAMP-based designs need a large amount of preprocessing, while the amount of preprocessing in the AMP-based designs is zero (not shown on the log scale). Differences in the computations for each channel use between OAMPNet, MMNet, and LAMANet can be seen to be neglectable in Fig. 1(b). The introduction of elementwise extended detectors (LAMANetEW and LAMANet) shows an approximate increase in the number of multiplications by 32%. The simplifications introduced in LAMANet do not significantly contribute to the reduction in computational complexity. The LAMANetMMSE and LAMANetOpt detectors have the same computational complexity as LAMANet plus an additional term for their respective calculations. These calculations require a separate complexity analysis, as they highly depend on the chosen algorithm to perform them. For LAMANetMMSE, this additional computation will be purely in preprocessing, while for LAMANetOpt, preprocessing and the iterative detection part are affected. Hardware architectures for LAMANetMMSE and LAMANetOpt are proposed in Sections IV-B and IV-C and evaluated in Sections V-B2 and V-B3, respectively. For a complexity analysis of traditional massive MIMO detection algorithms, we refer the interested reader to [26].

IV. ACCELERATOR DESIGN CONSIDERATIONS
In this section, we present design considerations for implementing the LAMANetMMSE and LAMANetOpt detector circuits. We target the circuit for FPGA implementation as FPGAs provide flexibility for prototyping and are often considered for remote radio unit (RRU) or distributed unit (DU) deployment in the radio access network (RAN). Some of the design considerations revolve around the fact that we choose Xilinx FPGAs and the Xilinx HLS design flow for our design. As dictated by the Tensorflow machine learning environment, the hardware design follows the transformation of the complex system to a real-valued system according to Section II-A.

A. LAMANet
The LAMANet design shows high detection performance while reducing computational complexity significantly. In the design of the accelerator, we follow the algorithmic description of Algorithm 4. The initialization via MMSE matrix in Algorithm 4 is just one possibility. The LAMANet accelerator is agnostic to the type of initialization matrix provided.
We choose to implement the accelerator as a noniterative, deeply pipelined circuit for maximum throughput. As the LAMANet algorithm itself is iterative in nature, the circuit may be deployed multiple times, each implementing one iteration for high throughput. Alternatively, the output data might be fed back into the same instance of the accelerator to implement a low-resource design. The accelerator's high-level design is shown in Fig. 2.
The first step in any new iteration is to store the input vectors x, and y and H stream in internal memory. Ideally, this step would be omitted, and however, the Vitis HLS design tool we use requires this step to correctly implement the pipeline via the "HLS DATAFLOW" pragma [27]. All data exchanged between pipeline stages need to follow a singleproducer, single-consumer approach (relevant for x). Also, pipeline stages should not be bypassed (relevant for H stream ).
As shown in Fig. 2, we make use of three data formats throughout the design. 1) IF t : Smallest amount of bits, mostly used for inputs to the detector or to save memory space. In our design, this is an 18-bit signed fixed-point number of which 11 are fractional bits. The length of 18 bits is ideal for mapping to Xilinx memory elements and provides good accuracy. 2) IFH t : Similar to IF t , but used to stream the H matrix.
This format uses all 18 bits as fractional bits, as the absolute value of the entries of H is guaranteed by the optimal power control to be smaller than one. 3) C t : Typical format for computational results. In our design, this is a 27-bit signed fixed-point number of which 19 are fractional bits. This is ideal for Xilinx Ultrascale+ DSP slices as its multiplier can handle one input of up to 27 bits. 4) L t : Format for high-precision calculations with a large integer part. In our design, this is a 32-bit signed fixed-point number of which 12 bits are fractional. Larger matrices in the design are streamed into the accelerator via the AXI-4 streaming protocol. This is to provide a standard interface without the need for extra buffering of data (such as in local memory). The sources of the streams might be other blocks in a receiver design or memory. We decide to implement a stream width of 32 values, i.e., 576 bits at 18 bits per value. For high-bandwidth memory (HBM) accelerator cards, the bit-width of IF t might be reduced to 16 bits to access an HBM memory port with 512 bits in parallel per stream (e.g., [28]).
In parallel with storing various input data, the streams 1stream , 2stream , and W stream are provided to the accelerator. The data of the stream are not stored but directly used to calculate the intermediate variable M. The number of parallel processing elements in this stage is a compile-time parameter and can be adjusted to match pipeline latency.
Next, the residual is calculated in one pipeline stage, and following that, the noisy tx-vector estimate z is calculated. It can be noted that in the LAMANet detectors, the Onsager term is not considered (Algorithm 4, line 7). The noise estimate τ inv in the next pipeline stage is formed simply by We use the reciprocal of the noise estimate N0 as input to the accelerator to reduce the number of required divisions in the system to once per noise estimate. The Arggen pipeline stage also implements the required multiplication of τ inv by two via an arithmetic left shift (Algorithm 2, line 3). It generates a 2-D array by taking the difference between each entry of z and each possible symbol value according to Algorithm 2 (lines 2-4). After squaring the result, τ inv is multiplied by each corresponding row. Generating τ inv instead of τ is not more complex but saves a significant amount of divisions in this stage.
In the next step, the calculated 2-D array is passed through the exponent function. Implementing this function in hardware would be too costly, so instead, we use a lookup table (LUT) approach. In the Expargshift stage, the values are prepared to be inputs to the LUT as follows. First, the maximum of the current row (i.e., along the symbol axis) is found. The difference between the maximum input value of the LUT and this maximum value is calculated. In the next step, this difference is added to each value of the row. This has the effect of shifting the values of the row into the input range of the LUT. The maximum value of the row is placed at the maximum value of the LUT. In this way, the limited range of the LUT is used most efficiently. It is clear that this linear shift in the input results in a nonlinear shift in the exponent functions output, and however, since for this application, the difference between output values is important, this nonlinear distortion is acceptable. In our experiments, we observe good matching with the ideal denoiser function. If an input value of the exponent function is below the minimum input value of the LUT despite the shifting, it is set to the LUT's minimum value. The LUT has 1024 samples and its input range is from −10.4 to 0. The minimum input corresponds to an output value of one LSB (2 −15 ). Each value of the LUT is quantized to 18 bits. This leads to the total memory usage of 1 BRAM18 (0.5 BRAM36).
In the next stage (ExpLUT and Sum), the output of the exponent LUT is obtained and a rowwise sum is calculated. The reciprocal of each rowwise sum is calculated via a pipelined divider. This saves divisions in the next step. The Expargshift and ExpLUT pipeline stages are further pipelined with an interval of one clock cycle.
In the Softmax step, each value is multiplied by its respective, reciprocal, rowwise sum. This is the last step before obtaining the new x by multiplying the resulting 2-D array with the symbol alphabet. The symbol alphabet is specified via compile-time defines. The newly calculated x is the final output of the accelerator and is specified in the C t -format for high accuracy. Evaluation results of the designed accelerator can be found in Section V-B1.

B. LAMANetMMSE
As mentioned above, the LAMANet detector supports many different forms of initialization without changes to its structure. The MMSE initialization provides a good compromise between computational complexity and performance. In order to evaluate the feasibility of the LAMANetMMSE detector, we present a possible implementation of a direct MMSE matrix calculation according to Algorithm 4 (line 4). It is worth noting that in regular massive MIMO detectors, there are more efficient ways to perform MIMO detection without directly calculating the inverse matrix [26]. However, direct inverse matrix calculation is one possibility. Neumann series expansion and Cholesky decomposition-based approaches have also been proposed [29], [30]. We base our implementation on the matrix inverse in Xilinx's Vitis accelerated libraries, which uses the Cholesky decomposition [31]. The data type for all signals in the accelerator is single-precision floating point [32]. Floating-point numbers are required for the matrix inverse to ensure algorithmic stability. The accelerator can use double-precision numbers by a compile-time switch, and however, in our experiments, we did not see significant performance improvements with that. The accelerator consists of three main stages, as shown in Fig. 3. To satisfy the Vitis HLS tool's pipeline requirements, the H matrix is first stored in local memory and another temporary memory (MemH2.2). Then, the input argument for the matrix inverse (A) is calculated from the noise variance and the channel matrix. The resulting matrix is stored in a parallel in parallel out (PIPO) channel for use in the inverse. Since A is positive definite, the Cholesky decomposition can be used to generate a triangular matrix such that A = LU and U = L T , where A, L, and U have real-valued entries. To generate L, first, the diagonal entries are directly calculated in a pipelined fashion. Then, with a number of parallel off-diagonal calculation units, all off-diagonal entries are calculated. The number of these processing cores is configurable during compile time. The L matrix is used to generate a columnwise intermediate signal d in a forward-substitution process. From this intermediate signal and L T , the inverse is computed in a backward substitution process The process is repeated for each column of the inverse matrix A −1 . Finally, the MMSE matrix is calculated by W = HA −1 .

C. LAMANetOpt
For the option of initializing LAMANet with the optimal matrix, we propose to perform the eigenvalue decomposition of the Gram matrix once per channel realization according to (14). The proposed design first calculates the Gram matrix (G = H T H) and then the eigenvalues and eigenvectors based on an IP from Xilinx's Vitis accelerated libraries [31]. The eigenvalue decomposition uses the one-sided Jacobi decomposition as the Gram matrix is symmetric. Off-diagonal entries of the matrix are eliminated in an iterative process using 2 × 2 Jacobi rotations [33]. A schematic of the circuit is shown in Fig. 4.
The LAMANetOpt detector itself has to be modified according to (11), (13), (14), and (17) to make use of the obtained eigenvalues and eigenvectors. For this purpose, we implement a streaming interface for the eigenvectors V consisting of 32 IF t entries. Another stream with H is provided to calculatê W i . The eigenvalues are provided via a standard interface. The calculations are implemented in a pipelined fashion. First, the reciprocal of M i is calculated in two stages. Then, it is multiplied with the eigenvalues in another three stages to calculateŴ i according to (14). However, v 2 i is not multiplied for reasons of numerical stability in the case of small values of v 2 i . This does not influence the final result W i as the division of W i with tr(Ŵ i H) cancels the multiplication with v 2 i regardless. Finally,Ŵ i is calculated and provided for further processing to the unchanged rest of the algorithm. The performance results are presented in Section V-B1.

V. PERFORMANCE EVALUATION
The baseband processing for massive MIMO in a 5G-type RAN can be located either directly in the DU-close to the cell antennas-or in a centralized unit (CU) [34]. This choice influences the type of accelerator deployed for processing. In our evaluation, we choose Alpha Data's RFSoC [35], which is most suitable for deployment in DUs due to its integrated RF chains. Alpha Data's RFSoC board hosts the Xilinx Zynq 11 Ultrascale+ XCZU27DR-2 FPGA, DDR4 memory, and RF infrastructure to support the system-on-chip (SoC) operation. However, the presented algorithm could also be deployed easily in a data center FPGA for usage in the CU.

A. Detection Performance
In this section, we present and compare the detection performance of the proposed neural networks. Most results are obtained by simulation in Tensorflow unless otherwise specified. The Tensorflow framework is based on [36], although significant alterations have been made and new detectors have been implemented. We use Python 3.6.9 and TensorFlow 1.13. Similar to the work in [18], we generate a realistic channel model via QuaDriGa with the parameters, as shown in Table III. Without loss of generality, we choose a configuration of 64 base station antennas and 16 mobile users for detection performance measurement (M R = 128 and M T = 32). For the hardware verification given next, we evaluate more configurations. For each detector, we provide the results for QPSK, QAM64, and QAM1024 modulation types. The modulation types QAM16 and QAM256 behave similarly and, however, are omitted for brevity reasons.
1) Untrained Channel Realization: First, we compare the detectors in an untrained state with each other. It is important that the detector can perform reasonably well in an untrained state such that the training process on a specific channel can be removed from the critical path (i.e., does not contribute to the latency) of the detector. This is the main advantage 1 Registered trademark  Fig. 5.
MMNet and LAMANetBL perform poorly in the untrained state giving SERs in the range of 0.5-0.9. This loss in performance for MMNet is easily understandable as the matrix in the linear part has to exactly match the current channel realization to perform the noise shaping correctly. Also, the classical IO-LAMA [13] algorithm performs poorly. On the other hand, the approaches, which initialize W i with a sensible guess (such as the MMSE matrix or the OAMP optimal matrix), maintain much better SER performance. The learnable parameters i,1 , i,2 , and θ can be initialized to the identity element of their respective operators in order to avoid any influence in the untrained state. This is why LAMANetMMSE and LAMANe-tOpt achieve approximately the same performance as MMSE and OAMP detectors even in the untrained state. Fig. 5 shows that even when untrained, the performance of the simplified LAMANet detectors (LAMANetMMSE and LAMANetOpt) is not degraded compared to the LAMANet baseline detector (LAMANetBLMMSE and LAMANetBLOpt).
2) Trained Channel Realization: Each detector is trained for 900 epochs with a batch size of 300 samples at the highest SNR value of the respective range. The loss is formed by comparing the actually transmitted symbol with the prediction of the detector. Based on the loss, the learnable parameters are updated in the backpropagation step of the ADAM optimizer. For testing, the batch size is increased to 10 000 samples (limited by GPU memory) and the results are averaged over ten epochs. The measurement is repeated for each SNR value separately. All the detectors are trained and tested on the same, randomly chosen channel realization for a configuration of M R = 128, M T = 32, and various modulation types in Fig. 6. It can be seen that OAMPNet cannot quite reach the performance of LAMANet and MMNet detectors. Whether an MMSE matrix or the OAMP optimal matrix is used for W i has no significant impact on the fully trained detector. At higher SNR values, LAMANet and LAMANetBL show a small performance loss compared to MMNet. To investigate this difference further, a cumulative distribution function (cdf)   is shown in Fig. 7. The trained detection performance is obtained over 143 distinct channel realizations. Each detector is trained for 9000 epochs and the performance is evaluated across 9000 channel uses. Only a small performance loss can be observed between MMNet and LAMANetBL, which confirms that an AMP-based design (instead of OAMP) suffices to achieve high performance if the algorithm is enhanced with machine learning elements. The detection performance is almost the same for LAMANetBL and LAMANet. This confirms that the learned parameters can compensate for the missing Onsager and tau-scaling calculations.

B. Circuit Performance
In this section, we analyze and present latency, throughput, and utilization results for the LAMANet detector and its pre-processing circuits. We define circuit latency as the time when the first input is provided to the circuit until the first result is fully produced. With the term circuit interval, we mean the time between fully produced outputs when the pipeline is fully filled. Both measures are obtained by register transfer level (RTL) simulation. The circuit interval means the time in between providing outputs once the pipeline is fully filled. Utilization and timing results are obtained after out-of-context FPGA place and route (P&R). We follow this strategy for practicability reasons since we evaluate a large number of different configurations (number of antennas and modulation type). The maximum clock frequency is estimated after P&R of the circuit. The target frequency (which influences the level of optimization during P&R) is set for all circuits to 300 MHz. The actually achieved clock frequency is used for throughput calculations. The presented configurations show the number of antennas, i.e., the complex system dimensions. After real decomposition, M R and M T are twice these values.
It is important to functionally verify the proposed accelerators. For the LAMANet accelerator, we use the Tensorflow implementation as a golden model. During the running of the Tensorflow evaluation, we export accelerator input and output data for each configuration and each tested SNR value after one iteration and ten iterations. For all tests, we follow the design strategy in Vitis HLS, namely, we first verify the functionality in C-simulation, and after high-level synthesis, we verify the correctness of the RTL. The MMSE preprocessor is evaluated by comparing the MMSE matrix calculated in Tensorflow with the hardware-calculated MMSE matrix. The eigenvalue decomposition preprocessor is verified by recreating the Gram matrix from the accelerator's outputs. Then, the Gram matrix is calculated in the testbench and compared with the accelerator's output.
1) LAMANet: The detector LAMANet as described in Section IV is profiled for various configurations. The results are shown in Fig. 8, where the complexity of the configurations is increasing from left to right. The latency and interval are fairly constant for all configurations. The throughput is in the range between ≈40 and ≈210 Mbit/s, peaking at the largest ratios of M R to M T , except for the configurations to the right, where the increase in interval time dampens the achievable throughput. A classical AMP accelerator for FPGA deployment is proposed in [37]. Its latency and throughput are plotted in Fig. 8(a) for comparison. The number of resources used is largely determined by the number of connected user antennas. The choice of using BRAM or URAM resources is left to the high-level synthesis tool. Except for the last configuration, the BRAM usage is fairly constant at ≈35 BRAM36 blocks. The detailed results after high-level synthesis for the configuration of M R = 64, M T = 16, and QAM64 modulation are shown in Table IV. The modules largely correspond to the pipeline structure in Fig. 2. It can be seen how the interval of many pipelined stages is close to the maximum of 143 clock cycles, which is ideal.
When comparing the proposed accelerator with previous work in Table V, we can see a competitive performance even though the proposed accelerator implements machine learning enhancements. We synthesized the design on a Xilinx Zynq7 device for a fair comparison with other work. When compared to the work in [37], which implements a similar AMP algorithm, our design achieves comparable throughput and throughput per used LUTs. It uses comparatively more BRAM resources but fewer DSP resources and supports QAM64 modulation. Other detector types such as [38] and [39] tend to outperform the AMP-type detectors, and however, these works implement linear detectors whose SER performance is expected to be lower. We noticed some inefficiencies created by the HLS tool when targeting the Zynq7 platform such as more LUTs and more DSPs used. For reference, we provide the implementation results on current FPGA hardware (Ultrascale+ RFSoC). The 3GPP technical specification TR 38.913 defines the target for user plane latency depending on the use case. For enhanced mobile broadband (eMBB), this latency is defined as 4 ms in the downlink and 4 ms in the uplink. However, the processing delay for the base station is just one part contributing to the overall delay and the standards do not provide delay requirements for subcomponents such as detector circuits. In the 5th Generation New Radio (5G-NR) numerology, one slot consists of 14 symbols in the case of regular cyclic prefix length [10]. The target delay for the base station processing is one slot according to [40]. The duration of one slot depends on the subcarrier spacing and ranges from 1 ms for a subcarrier spacing of 15 kHz to 62.5 μs for a subcarrier spacing of 240 kHz. The last row of Table V shows the number of subcarriers that can be processed in real time in one slot of 15-kHz spaced subcarriers.
2) LAMANetMMSE: The right initialization of the LAMANet detector is important for its untrained detection performance. The option of initializing LAMANet with an MMSE matrix provides a good tradeoff between computational complexity and untrained detection performance. In Fig. 9(a) and (b), the latency and resource numbers are shown, respectively, for the MMSE matrix generation circuit presented in Section IV-B. The throughput decreases from ≈314k matrices/s in the smallest system configuration (32 × 8) to ≈19k matrices/s in the largest configuration (128 × 32). The resource usage in terms of DSPs and URAM storage is almost constant for all configurations, while BRAM usage increases strongly in the larger configurations. LUT and flip flop (FF) utilization do not vary significantly considering the large number of these resources available in modern FPGA devices.
The walking speed of 1 m/s in the SER simulation setting (Table III) results in a coherence time of ≈9.3 ms when approximating it via T c = (λ/2v), where λ is chosen as the center frequency for this approximation and v is the UT's speed [41]. Using this relationship, Table VI shows that the number of subcarriers one instance of the MMSE circuit is capable of processing. For MMSE preprocessing, the number of matrices calculated per coherence interval is sufficiently large to process several thousand subcarriers with one preprocessing circuit even for larger antenna configurations. Higher user speeds can also be tolerated.
3) LAMANetOpt: To achieve better SER performance in the untrained detector, it is possible to calculate the optimal initialization matrix according to Section IV-C. The throughput and resource usage for the required eigenvalue decomposition are shown in Fig. 10(a) and (b), respectively. It can be seen that resource usage, as well as throughput, is highly dependent on   COHERENCE TIME  ACCORDING TO THE SETTINGS IN TABLE III the number of users connected. This is understandable as the majority of the computational work is spent on the eigenvalue decomposition of the Gram matrix G, which is an M T × M T square matrix. For the eigenvalue decomposition accelerator, the number of channel matrices per coherence interval is substantially lower than in the MMSE case. Considering a large number of subcarriers such as 1200, it can be seen that even at the relatively low speed of 1 m/s, only smaller antenna configurations can be processed without the instantiation of multiple preprocessing modules. As the choice of the initialization matrix influences the detector performance only in the untrained state, the hardware cost of optimal processing might be too high for most cases.
The extension of the detector itself is costly as our results shown in Table VII. For the example antenna configuration of 64 × 16 (real-valued matrix dimensions are 128 × 32), we can see more than doubling in required DSP resources and a more than tenfold increase in BRAM resources. As we limit our DSP usage per pipeline stage to 32 DSPs, we identify the computation ofŴ i from H as the bottleneck taking M 2 T M R /32 = 4096 clock cycles. We allow other processes in the pipeline to also take this maximum latency to reduce resource usage. The throughput, however, is lowered to ≈1.6 Mb/s compared to ≈48.6 Mb/s in LAMANet.

VI. CONCLUSION
Model-driven machine learning in the physical layer of communication systems has shown high SER detection performance in previous works. The here presented work implements an AMP massive MIMO detector, which is extended by learnable parameters to realize these performance gains in a real-time hardware detector. We present a novel way of incorporating learnable parameters into the algorithm, allowing the detector to be initialized and to perform at a baseline level even when untrained. Some of the previous proposals perform very poorly in the untrained state (MMNet) or are prohibitively complex for real-time deployment and achieve lower SER performance (OAMPNet). The presented work shows comparable SER performance to MMNet while using the simpler AMP algorithm. Further algorithmic simplifications reduce the computational complexity and allow for efficient hardware implementation. Our FPGA design is comparable to previous work in the literature in terms of FPGA resource utilization and throughput while supporting machine learning enhancements and thereby increasing detection performance significantly after training. Furthermore, we implement the preprocessing to initialize the detector and show the feasibility of it being implemented on FPGA hardware. This article is important as it demonstrates the feasibility of machine learning-enhanced massive MIMO detection in a real-time capable FPGA accelerator.
Stefan Brennsteiner (Graduate Student Member, IEEE) received the B.Sc. degree in hardware-software design and the M.Sc. degree in embedded systems design from the University of Upper Austria, Wels, Austria, in 2013 and 2015, respectively. He is currently working toward the Ph.D. degree at the E-Wireless Group, School of Engineering, The University of Edinburgh, Edinburgh, U.K. The vision for his Ph.D. thesis is to find efficient ways of processing machine learning algorithms in the physical layer at scale and in real time.
From 2016 to 2018, he worked as an ASIC Digital Design and Verification Engineer for high-volume consumer products at NXP Semiconductors Austria GmbH & Co KG, Gratkorn, Austria. His research interests lay in novel signal processing schemes for the physical layer in 5G and beyond, digital circuit design, and field-programmable gate array (FPGA) technology.
Tughrul Arslan (Senior Member, IEEE) holds the Chair of Integrated Electronic Systems with the School of Engineering, The University of Edinburgh, Edinburgh, U.K. He is currently a member of the Integrated Micro and Nano Systems (IMNS) Institute and leads the Embedded Mobile and Wireless Sensor Systems (Ewireless) Group, The University of Edinburgh. He is the author of more than 500 refereed articles and an inventor of more than 20 patents. His research interests include developing low-power systems for wearable and portable applications.
Prof. Arslan has been a member of the IEEE CAS Executive Committee on VLSI Systems and Applications since 1999 and is a member of the steering and technical committees for a number of international conferences. He is a Co-Founder of the NASA/ESA Conference on Adaptive Hardware and Systems (AHS) and currently serves as a member of its steering committee.