Prediction of the performance of pre‐packed purification columns through machine learning

Abstract Pre‐packed columns have been increasingly used in process development and biomanufacturing thanks to their ease of use and consistency. Traditionally, packing quality is predicted through rate models, which require extensive calibration efforts through independent experiments to determine relevant mass transfer and kinetic rate constants. Here we propose machine learning as a complementary predictive tool for column performance. A machine learning algorithm, extreme gradient boosting, was applied to a large data set of packing quality (plate height and asymmetry) for pre‐packed columns as a function of quantitative parameters (column length, column diameter, and particle size) and qualitative attributes (backbone and functional mode). The machine learning model offered excellent predictive capabilities for the plate height and the asymmetry (90 and 93%, respectively), with packing quality strongly influenced by backbone (∼70% relative importance) and functional mode (∼15% relative importance), well above all other quantitative column parameters. The results highlight the ability of machine learning to provide reliable predictions of column performance from simple, generic parameters, including strategic qualitative parameters such as backbone and functionality, usually excluded from quantitative considerations. Our results will guide further efforts in column optimization, for example, by focusing on improvements of backbone and functional mode to obtain optimized packings.


INTRODUCTION
Pre-packed chromatography columns are widely employed in process development and biomanufacturing. Their biggest advantage is to take away the burden of costly and time-consuming packing procedures and associated validation protocols, ultimately ensuring a consistent prod-calculated from the response of the column following a pulse injection of a non-binding tracer, that is residence time distribution (RTD) experiments. The HETP corresponds to the column length over the number of theoretical plates (N), with efficient columns characterized by relatively large N and small HETP values. According to the general rate model, the RTD response of a "wellpacked" column is a symmetrical Gaussian peak. To better assess packing quality, RTD experiments are usually run under conditions for which hydrodynamic dispersion is the dominant contribution to mass transfer (negligible intraparticle mass transfer, no adsorption). Under these conditions (reduced velocity of about 1-10), the HETP The minimum HETP value theoretically depends only on the properties of the tracer, the velocity of the mobile phase, and the size of the chromatographic particles [5]. However, the general rate model is unable to capture how the HETP is influenced by key factors of practical relevance such as column size (column diameter and length) or ease of packing across different chromatographic resins [6]. For example, Scharl et al. [7] qualitatively discussed the importance of material backbone on the packing quality of a range of pre-packed columns. Deviations from symmetrical peaks are often observed in practice, with peak fronting or tailing associated with a number of non-idealities such as wall effects, inhomogeneous packing, inhomogeneous distribution of the solute over the bed at the column inlet/distributor and/or at the outlet/collector, and dispersion in the extra column volumes [8][9][10][11][12]. Such deviations are measured through the asymmetry, an empirical parameter used to quantify the degree of peak skewness and employed to assess packing quality in tandem to the HETP [13].
Mathematical models to predict column performance and chromatographic processes, including the general rate model, are generally based on first principles. In particular, they include details of mass transfer phenomena and binding kinetics to describe peak profiles and breakthrough curves [14,15]. While the predictive power of these models is often excellent, they require extensive calibration efforts through independent experiments, for example, to determine key model parameters such as mass transfer and kinetic coefficients [16,17]. Flow non-idealities such as wall effects and distribution/collection of the fluid at the column inlet/outlet also require independent experiments for them to be accounted for in the models. These additional experiments are specific to the chromatographic system (external column volumes) and column (diameter, length) employed, therefore cannot be extrapolated to different systems or different columns. Finally, such models based on first principles do not take into account qualitative variables such as resin backbone and functional chemistry by design.
Machine learning (ML) could represent an alternative modeling approach to analyze and predict column performance. The main advantage of ML is the ability to extract information from large data sets using no or only minimum assumptions, eventually determining generalizable predictive patterns between multiple inputs (including quantitative, qualitative, and categorical parameters), and the output variables [18,19]. A number of algorithms, for example, support vector machines, decision trees, gradient boosting, and deep neural networks have been developed over the years, and have proved their ability in dealing with complex data problems in a practical manner [20,21]. ML has been applied to chromatography systems, with many successful applications, for example, in peak observation [22][23][24], retention modeling [25][26][27][28], process optimization [29][30][31], and real-time process monitoring [32,33]. The main challenge associated with the application of ML is the availability of very large experimental data sets for the ML algorithm to draw meaningful correlations.
In this work, we consider a large data set of around 25 000 quality assurance experiments of pre-packed columns manufactured and tested under standardized conditions for a period of over 10 years [7]. We first examine the time series of the data set using correlation and autocorrelation analysis to ensure the data are self-consistent and time-independent. We then employ ML methods to find a correlation between column performance (measured in terms of HETP and asymmetry) and qualitative column variables, namely resin backbone, functionalization chemistry, column size (length and diameter), and particle size. The results are finally commented on in relation to the main key variables affecting column performance.

Experimental data set
The data set employed in this work is a subset of that previously employed by Scharl et al consisting of 24 951 quality control runs of pre-packed small-scale columns over a period of about 10 years [7]. The data contain relevant column parameters (i.e., column length and diameter, particle size, backbone material, functional mode, and date of testing) together with reduced HETP (h) and asymmetry ( ). Column diameter and length ranged between 5 and 11.3 mm and 10 and 100 mm, respectively, while particle diameter varied between 15 and 400 μm. 2232 experimental runs (approximately 10%) were removed from the original data set as they lacked one or more column parameter inputs, reducing the data set to a total of 22 359 tests. Columns with the same attributes were manufactured and tested more than once over the 10 years monitored, with some popular types examined hundreds of times (for example, see Table S1). All experiments having the same set of input features were treated as a single entry, with ℎ and averaged over the available runs for that column type. This step was necessary to prevent data leakage in the ML model, that is, the use of the same column type in both the training and testing data sets (see Section 2.3), as well as to prevent overfitting of the most popular column types over the ones infrequently produced. The standard error for ℎ and was always lower than 10%, indicating that the average ℎ and are representative output indicators of column performance for any given column type. After the averaging process, the data set contained a total of 546 independent runs (num = 546).
All columns used to generate the data set were packed by slurry packing under vibration following a standardized procedure developed by the packing company (Atoll, now Repligen). The packing quality of the columns was evaluated using a standardized experimental setup and experimental protocol as reported in Scharl et al [7]. Briefly, the response of the column following an acetone or sodium nitrate injection was measured, and the resulting chromatographic peak was analyzed to extract ℎ and . This simple experiment allowed to isolate the contribution to band broadening associated with hydrodynamic dispersion (which in turn depends on packing quality and extra column dispersion) as the tracers employed are both non-retained (i.e., zero retention factor), with practically the same diffusion coefficients (1.2 × 10 −5 and 1.3 × 10 −5 cm 2 /s for acetone [34] and sodium nitrate [35], respectively), and tested under reduced velocities comprised between 1 and 20 for which the minimum HETP is obtained [14].

Extreme gradient boosting
Extreme gradient boosting (XGBoost) is a scalable ML system for tree boosting [36]. XGBoost is a decision-tree-based ensemble learning method [37] that provides a systematic solution to a given problem by combining the predictive power of several different or same ML algorithms. The algorithm used in XGBoost is the Classification and Regression Tree [38] which employs a binary tree that can be constantly segmented by data features, thus enabling dynamic growth of the tree. The characteristics of the input data will eventually fall into the leaf nodes n the tree, where each leaf node corresponds to a specific score, and the sum of the scores in all the leaf nodes computes the final prediction value of a certain feature, for example, h or A s . Essential details of the mathematical formulation of the XGBoost model are presented in the following, with addi-tional details in the SI. For a given data set with n examples and m features  = {( , )} |( || = , ∈ ℝ , ∈ ℝ), the tree ensemble model uses additive functions to predict the output.̂= is the space of the regression trees. The represents the structure of each tree that maps an example to the corresponding leaf index. is the number of leaves in the tree. Each corresponds to an independent tree structure and tree weights . More mathematical details can be found from the original XGBoost paper [36].
The regularized objective function defined for XGBoost, , can be written as: Here is a differentiable convex loss function that measures the difference between the prediction̂and the target . The second Ω term prevents unnecessary large trees by penalizing the complexity of the model, in turn avoiding overfitting. The additional regularization term 1 2 λ‖ ‖ 2 helps smooth the final learned weights. The shrinkage parameter γ is an additional design to prevent over-fitting. The γ is utilized to multiply the score of each leaf node by a reduction weight during the iteration, which ensures that the influence of each tree is not too large, leaving more space for the tress generated later to optimize.
XGBoost is also used to determine the relative importance of the input features. The definition of relative importance is followed by the study of H. Friedman [39]. For a tree model whose number of terminal nodes is , the relative importance of a given input feature, , is calculated by the sum of the corresponding empirical improvements, 2 , with referring to a non-terminal node and acting as splitting variable for that node. The 2 term is determined from the two sub-region and , wherēand̄are the response means, respectively, and and are the corresponding sums of the weights. In Python, the contribution of each input feature can be automatically transferred into the percentage version.

Data pre-processing and model implementation
Functional modes and backbone are two categorical features that cannot be operated by many ML algorithms directly. One-hot encoding was applied to transfer them into numerical values [40], with each feature normalized between 0 and 1. All other numerical parameters were also normalized between 0 and 1 before input into the ML model as most ML algorithms perform better or converge faster with features on a relatively similar scale [41]. An XGBoost regression model was created in Python 3.6 combining i) GridSearchCV (ten-folds) to select and determine the model's hyper-parameters (for example, learning rate, maximum tree depth, and minimum child weight) [42] and ii) XGBRegressor as the main package to process our data set [43]. The whole data set was then separated randomly into a training set (66.7%) and testing set (33.3%), with the training set utilized for training the ML model and the testing set used for inspecting the final model accuracy. Mean absolute error (MAE) [44] was used as the evaluation metric during model training. The final prediction precision of the model is reported by the mean absolute percentage error (MAPE) between the prediction results and the testing data set. The overall model prediction capability remained the same when changing initial seeding to randomly generate different training and testing data sets.

RESULTS AND DISCUSSION
The main goal of this study was the identification of a general relationship between column parameters (column length, column diameter, particle diameter, functional mode, backbone material) and chromatographic performance (reduced HETP, ℎand peak asymmetry, ) using ML algorithms as an alternative to classical rate models for chromatography. Classical rate models are derived from first principles and thus tend to be the preferred choice when it comes to the modeling of chromatographic separations. However, some of the parameters entering rate models often are either determined through empirical expressions (for example, the Wilson Geankopolis correlation for the estimation of the mass transfer coefficient [45]) or simply adjusted to best-fit experimental results (for example, diffusion or dispersion coefficients [46]). The introduction of a certain degree of empiricism in physical models is necessary to capture important elements of the model hard to describe in mathematical terms. For example, the 3D configuration of chromatographic beds deviates from the theoretical close random packing limit [47], with the resulting bed arrangement strongly influenced by attributes linked to the material and column properties (for example, Young modulus, friction factor, and wall roughness) as well as the packing procedure itself [47,48]. For example, Knox demonstrated that hydrodynamic dispersion in columns packed with smooth non-porous glass beads is smaller than those measured in columns packed with porous glass [49]. Knox explained this result in terms of bed homogeneity and speculated that smooth glass particles are able to form relatively regular packings, while porous glass particles are affected by greater interparticle friction forces, in turn resulting in particle bridging and the formation of pockets where local mixing occurs. These insights were demonstrated experimentally by Patel et al. [50], who confirmed that the A term in the van Deemter equation is primarily associated with radial heterogeneities in the bed. On the opposite front, Malkin et al. showed that submicrometer silica particles tend to pack close to the limit of a face-centered cubic arrangement [51], resulting in reduced plate heights below 1. Khirevich et al. also reported that the local microscopic disorder in packings was highly correlated with eddy dispersion, directly affecting column performance [52]. Along the same line, Gritti et al [53] reported the outstanding performance of columns packed with core-shell particles, partly attributing these results to the propensity that these particles have to create homogeneous beds. More recent studies on 3D printed ordered beds further confirm the advantages of perfectly ordered packing, with simulated reduced plate heights below 0.1 for specific arrangements (for example, octahedral particles in simple cubic configuration) of non-porous stationary phases under nonretained conditions [54].
The concept of "goodness of packing" as proposed by Knox is strongly correlated to the A term of the van Deemter equation [55], with lower A values associated to lower reduced plate heights and hence higher chromatographic efficiency. According to the general rate model for chromatography, the A term can be expressed as [16]: = 2χ (6) or in dimensionless terms: where is the average particle diameter and χ is the dispersivity of the stationary phase. The dispersivity is a characteristic determined by the hydrodynamics in the column, in turn, defined by the type of particles and their packing. For a given column, the dispersivity can be determined through estimation of the plate height under conditions suppressing both axial diffusion (i.e., large velocity, negligible B term) and mass transfer and kinetic resistances (i.e., injection of a small, fast diffusing non-adsorbing tracer, negligible C-term) for which the van Deemter equation reduces to: While this equation represents a relatively rapid method to assess the hydrodynamic properties of a given column, the lack of correlations for the estimation of the dispersivity coefficient represents a limitation to predict band broadening due to axial dispersion. In particular, there exist no quantitative method to assess how the dispersivity depends on different column properties such as: -backbone material and functional mode, closely related to the propensity of theparticles to generate regular packing; -column and particle diameters, that is, the column to particle ratio, in turn determining the importance of non-homogeneities close to the column wall with respect to the rest of the column volume; -column length and column diameter, which are associated with both bed compressibility [56], as well as defining the relative influence of extra-column dispersion effects, for example, due to non-uniformity of the velocity profile resulting from non-idealities in the extracolumn volumes.
Fronting or tailing deviations from the ideal symmetrical peak are often observed in chromatographic practice, negatively impacting the separation performance. Such deviation is often quantified through the asymmetry factor, , defined as the ratio between the width of the tailing end and the peak front at 10% peak height [57,58]. Large asymmetry factors are associated with the heterogeneity of the column packing [59,60], making another excellent descriptor for "goodness of packing". However, search for a quantitative relationship between asymmetry and column parameters has been elusive so far. In this context, ML is an excellent tool to extract poorly understood links between variables such as the column input parameters and the asymmetry factor.
The data set of pre-packed column performance offers an opportunity to quantitatively analyze the dependence of the dispersivity on a range of qualitative and quantitative column attributes. The two performance parameters, h and A s , are measured from the experimental response of an injection of a small non-retained tracer (acetone or sodium nitrate). Same experimental and data analysis methods were used to generate the entire data set [7]. Only resins intended to separate proteins or other larger biomolecules were tested, ensuring much larger pores than that of the tracers. Such conditions ensure only the hydrodynamic dispersion is captured in the experiments and that the Van Deemter equation can be simplified into Equation (8).
In short, we propose here to employ ML as a powerful alternative to traditional chromatographic models to investigate a correlation between the different column input parameters and the output performance parameters. ML is especially valuable in this context given the complexity of the problem described and the qualitative nature of some of the relevant variables such as column backbone and functional mode. ML is also able to suggest the relative importance of the different inputs with respect to the outputs, thus helping the identification of the key descriptors for the performance parameters.

Time series of reduced plate height and asymmetry
Column performance can change over time due to variations in the manufacturing line, for example, improvement in the packing procedures, change of suppliers of raw materials, and aging of the production line. Scharl et al. qualitatively observed that the plate height of the prepacked columns tested was stable over 10 years [7]. However, any interdependence between h and A s with time needs to be either identified or excluded in quantitative terms to avoid any input bias to the ML model. In other words, it is first necessary to determine if time represents an input variable to the ML model, as well as if sampling and testing of the columns changed significantly over time. Autocorrelation and partial autocorrelation analysis was employed onto the data set to address these two aims, respectively. In particular, the autocorrelation function (acf) aims to detect cross-similarities of a signal with itself at a different time (time lag) [61]. In this context, acf helps detect changes in the manufacturing line and in the quality assurance protocols employed over time. The partial autocorrelation function (pacf) instead aims to identify the possibility of confounding variables that are correlated to both variables [62]. In this instance, pacf aims to identify a correlation between time and performance parameters, in turn suggesting if a specific pattern of column types was manufactured over time. Additional details on acf and pacf are also provided in the SI.
The h and A s time series were first resampled by averaging the data set in day intervals, irrespective of the other column parameters. Other than reducing noise, resampling is customary when autocorrelation analysis is executed over large time periods [58,61]. Figure 1 shows the time series of the two performance parameters, h and A s . Over the 10 year time considered, the ℎ values varied between about 7.8 and 2.2, with an average of around 4.5. Variability reduced significantly from F I G U R E 1 Time series of (A) reduced plate height, ℎ, and (B) asymmetry, , for pre-packed purification columns manufactured over the 10-year period monitored 2011 onward, with a slight decrease of plate height in 2012-2013. The asymmetry ranged between about 2 and 0.8, with an average of 1.1. Similar to plate height, the scatter in the asymmetry over the first 5 years is larger than after 2011. According to Scharl et al. [7], industrial quality assurance tests require a column to have ℎ comprised should be smaller than 5 in industry, while the acceptable range for is between 0.8 and 1.6 [7]. The observed variability is a natural consequence of industrial manufacturing, yet the columns produced were within specifications in terms of both ℎ and A S . Figure 2 shows the results from the autocorrelation and partial autocorrelation analysis on ℎ and using lag time of days up to one year. Other lag times were also examined (i.e., weekly, monthly as well as over 2, 3 months) with no significant difference. For both ℎ and , almost all of the acf and pacf coefficients lie within the 95% confidence interval. The low acf demonstrates that the dataset does not have a specific pattern with time, quantitatively confirming that the manufacturing line was stable over the 10-year period here investigated [63]. In addition, low pacf rules out the existence of confounding variables such as certain patterns in terms of column sampling and testing over time. In other words, pacf analysis confirms that column manufacture was unbiased, excluding the possibility that a certain column type (for example, having a specific size and packed with a specific particle) was manufactured predominantly over other columns over time. Overall, acf and pacf demonstrate that all performance tests were timeindependent, making the data set solely dependent on the five input parameters of particle size, column diameter, column length, column backbone, and functional mode.

Influence of column parameters on packing quality
XGBoost was utilized to assess the influence of the column parameters (i.e., the inputs to ML algorithm: particle size, column length, column diameter, functional mode, and resin backbone) on packing quality (i.e., ML outputs of ℎ and ). Other ML algorithms such as artificial neural networks and decision-tree were also employed in a preliminary model assessment (refer to SI for additional information on ML models). XGBoost consistently provided the highest predictive precision, mainly due to its regularization and shrinkage terms (Equations (2) and (3)) being capable of curbing over-fitting, the main cause of poor prediction. Figure 3 summarizes the results obtained with the XGBoost model to predict the experimental data. In particular, Figure 3a,b compares the predicted ℎ and , respectively, against the observed data of the testing data set. The predictions are in good agreement with the experimental results, where the MAPE of predicted results to the observed values are 10% for ℎ and 7% for , with a few outliers in the 40%-50% range. These acceptable errors confirm that the XGBoost model can be applied to this problem with good prediction accuracy. Figure 3c,d reports the contribution importance, (Equations (4) and (5)), of the various input parameters to predict the model outputs. Interestingly, column backbone resulted as the most important descriptor of packing quality, accounting for 68.4% and 77.0% for the prediction of ℎ and , respectively. Functional mode was the second most significant descriptor for the estimation of packing quality, accounting for about 15% contribution importance, F I G U R E 2 Autocorrelation (acf) and partial autocorrelation (pacf) analysis of reduced plate height, ℎ, and asymmetry, . (A) Acf of ℎ; (B) pacf of ℎ; (C) acf of ; (D) pacf of . The blue shaded areas correspond to a 95% confidence interval followed in various order by the other parameters (particle size, column diameter, and column length). Violin plots were employed to further analyze the correlation between input features and column performance (Figure 4). A violin plot is an extension of a box and whisker plot, clearly recognizable inside the "violins", decorated with a curve whose width is related to the probability density.

Resin backbone
Resin backbone was the most influential parameter for the prediction of packing quality. The material making up the resin backbone can be either inorganic, synthetic polymer, or natural polymer. The nature of the material employed determines the number of properties such as surface roughness of the particles [64], particle size distribution (linked to the manufacturing method) [65], the occurrence of microstructural defects, and other mechanical properties such as Young modulus and density [63,64]. All these factors impact column packing, either directly or indirectly, in turn influencing the homogeneity of the resulting chromatographic bed, that is, packing quality. Johnson et al. examined a range of resin materials (agarose, cellulose, ceramic) through X-ray computed tomography (CT) and focused ion beam [66]. They highlighted clear variations in the chemical, physical and mechanical properties of the different materials. Our analysis with the XGBoost model also confirms that resin characteristics strongly influence chromatographic performance. Figure 4a,b presents violin plots of h and A s , respectively, over the eight different backbones tested. It is possible to observe that certain backbones have worse performance than others as measured by both of the two packing quality parameters h and A s . For example, polystyrene-divinylbenzene, inorganic support (IS) and dextran have data widely distributed, with an average h above 5 and an average A s above 1.2. On the other hand, agarose, cellulose, and PVE hydrophilic (PVE) demonstrated consistent results (little data scatter) with average h and A s well below the arbitrary thresholds of 5 and 1.2, respectively. This analysis clearly demonstrates

F I G U R E 3 XGBoost prediction results for (A) ℎ and (B)
over the testing data set. Variable importance contributions of (C) ℎ and (D) are reported. The importance is calculated based on the improvement of the performance measured by each attribute split point, weighted by the number of observations the node is responsible for. The importance contributions, named by Gain in XGBoost (refer to Equations (4) and (5)), were transferred into a percentage the importance of backbone selection, for example, during the process or method development.
It is worth noting that IS was relatively popular in the first 3 years of our data set, while PVE hydrophilic (PVE) matrices were little used at first, becoming more mainstream after 2011. This change in backbone population over time can partly explain the slight decrease of the absolute value of h, as well as the reduced scatter of ℎ and observed from 2011 onward (Figure 1).

Functional mode
The functional mode was the second most important parameter to predict packing quality. Figure 4c,d shows the relation between ℎ and over the different functional modes. The influence of the functionalization chemistry on column packing is less intuitive than for chromatographic backbone. Stickel and Fotopoulos [67] reported the difference in the pressure-flow profiles between sepharose and phenyl sepharose, which was associated with the differing hydrophobic and electrostatic character of the resin beads. Electrostatic and hydrophobic interactions might promote local or temporary bonding of two or more particles into clusters, decreasing the degrees of freedom of the slurry, and thus influencing column packing [68]. Also, functionalization procedures can change the mechanical and surface properties of the beads, for example, as a consequence of the different solvents, chemicals, and temperatures employed for ligand immobilization. This in turn influences the packing process [69], ultimately determining packing quality. The possibility of a correlation between column functionality and backbone was tested both qualitatively (mosaic plot in Figure 5) and statistically by employing the chi-squared test. The size of the mosaic tiles in Figure 5 is proportional to the number of chromatographic columns in the data set having a certain combination of backbone  Figure 5). Such columns are indeed ubiquitous in the downstream processing of biopharmaceuticals. Other backbones find use in specific application domains, for example, dextran is predominantly employed for SEC, and HCIC is purely carried out with cellulosic adsorbents. In addition, a number of A chi-squared test of independence with 63 degrees of freedom, that is, (8 backbones -1) x (10 functional modes -1), and with a sample size of 546 tests indeed showed a significant relationship between the two input variables, χ 2 (63, num = 546) = 693, < 0.01. While a correlation between resin material and functionalisation is apparent, its influence in the ML model was eliminated by averaging all experimental results measured under the same input conditions (see Section 2.1), an especially important step to prevent the same samples from being present in both the training and testing set thus overestimating the accuracy.

3.2.3
Column length The influence of column length on h is presented in Figure 4e. It is possible to observe that the median for ℎ, as well as its propensity to data, scatter, and relatively large values (ℎ above 10) increase with column length. This observation can be explained by a combination of packing consolidation and wall effects. The former is relevant during column manufacture, that is when compression forces transfer through the packing via inter-particle friction as well as friction between particles and the column wall [70]. The uneven stress distribution created between particles in the bulk and at the periphery of the column negatively affects bed consolidation and packing homogeneity. The presence of the wall constrains the resin particles to pack in configurations with higher local porosity in the immediate vicinity of the column wall. The columns investigated in this work were small-scale purification columns (column volume about 1 and 10 mL) with relatively large particle diameters (15-400 μm) and small column diameters (5-11.3 mm). The resulting column diameter to particle diameter ratio was in general around 80, down to 20 for some columns. In this context, Maier et al. [71] reported wall effect on axial dispersion can be observed even for columns with column diameter to particle diameter ratio greater than 100. Reising et al. [72] and Gritti [73] studied the dependence of fluid velocity with radial position, and concluded that the velocity close to the column wall can be up to 2.2 times the bulk velocity, significantly contributing to band broadening and early breakthrough. Flow nonidealities arising from both uneven packing difficulties and wall effects scale with column length, with packing quality and column performance inversely related to it. The contribution of column length on A s is reported in Figure 4f. No significant difference can be observed across the data, other than a minor decrease in the median asymmetry with column length. Asymmetry is heavily determined by extra column band broadening, that is, related to all flow non-idealities present in the extra column volumes such as tubing, fitting, column distributor and collectors, pumps, valves, etc. This effect becomes more prominent for smaller columns, as described by Kaltenbrunner et al. [74] who reported extra column volumes accounting for more than 90% band broadening in small columns.

3.2.4
Column diameter According to ML results, the contribution of column diameter to the prediction of h is 5.2%, while it is only 0.9% for (Figure 3c), and no clear relationship can be observed between column diameter and the two performance output parameters (Figure 4 g,h). All the three-column diameters considered in this work fall in the same order of magnitude (5, 8, and 11.3 mm), thus hiding any potential correlation between column diameter and packing quality. Schweiger et al. [3] analyzed the band broadening arising from the extra-column and in-column contributions of pre-packed columns with different column diameters, and concluded that an increase in column diameter can lead to an increase in peak width as caused by flow nonidealities in the flow distributor and collector. Experimental data for wider columns is required to identify and eventually quantify any possible relationship between column diameter and column performance.

3.2.5
Particle diameter The correlation between particle diameter and ℎ is reported in Figure 4i. Accordingly, to the reduced form of the van Deemter equation (Equation (8)), the magnitude of ℎ is not dependent on particle diameter. ML results indicate that the important contribution of particle diameter to ℎ is 10.7% (Figure 3c). In Figure 4i the median ℎ slightly drops with particle size, possibly resulting from packing difficulties with smaller particles, as also reported by Scharl et al. [7]. No trend between and particle size could be observed (Figure 4j).

CONCLUDING REMARKS
Traditional statistical analysis (for example, autocorrelation analysis, chi-square analysis) and ML were applied to a large data set (546 different combinations of column features) of packing quality (reduced plate height, h, and asymmetry, A s ) for pre-packed columns manufactured with different column sizes (column length and column diameter) and packed with different resins (backbone, functional mode, and particle diameter) over 10 years. Autocorrelation and partial autocorrelation provided a quantitative framework to analyze column quality over time. The results indicate that packing quality was indeed not correlated with time, indicating that column manufacture, sampling, and testing were consistent over the 10 years.
The XGBoost represented an excellent ML model to predict column performance, with MAPE of 10 and 7% on h and A s , respectively. According to the ML tool employed, the column backbone contributed the most to its predictive capability. In other words, the resin material employed had the most significant impact on column performance. A trend between column length and performance was also observed, with ℎ raising slightly as the length increased, consistent with a larger contribution to band broadening due to wall effects and axial dispersion.
Overall, this work demonstrates the capability of ML to evaluate and predict column performance solely from the knowledge of some basic column characteristics (column length and diameter, particle size, backbone material, and functional mode). These results could be employed to extrapolate the expected performance characteristics on new and existing columns types, help set QA protocols for new and existing manufacturing lines for pre-packed chromatography columns, or as a reference benchmark for columns packed traditionally in lab settings, especially for hard to pack columns such as polystyrene-divinylbenzene and ISs. The results presented here can guide further efforts in column optimization, for example, informing potential inefficiencies in the packing process, and suggesting improvements of backbone and functional modes to obtain easy to pack resins prone to form ordered packing arrangements with high chromatographic performance.
More in general, ML provides a quantitative tool to describe complex problems with multiple input features, including categorical features such as resin backbone and functional mode. ML methods can also be employed in other chromatographic areas, for example, for generating accurate retention models, resolving complex chromatography peaks, and for searching column structures with improved performance.

C O N F L I C T O F I N T E R E S T
The authors declare that they have no conflict of interest.

D ATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are available from Repligen. Restrictions apply to the availability of these data, which were used under license for this study. Data are available from Dr. Tim Schroeder or Dr. Theresa Scharl with the permission of Repligen.