SARS-CoV-2 RNA levels in Scotland's wastewater

Nationwide, wastewater-based monitoring was newly established in Scotland to track the levels of SARS-CoV-2 viral RNA shed into the sewage network, during the COVID-19 pandemic. We present a curated, reference data set produced by this national programme, from May 2020 to February 2022. Viral levels were analysed by RT-qPCR assays of the N1 gene, on RNA extracted from wastewater sampled at 122 locations. Locations were sampled up to four times per week, typically once or twice per week, and in response to local needs. We report sampling site locations with geographical coordinates, the total population in the catchment for each site, and the information necessary for data normalisation, such as the incoming wastewater flow values and ammonia concentration, when these were available. The methodology for viral quantification and data analysis is briefly described, with links to detailed protocols online. These wastewater data are contributing to estimates of disease prevalence and the viral reproduction number (R) in Scotland and in the UK.

setbacks and biases are expected in community testing. For example, a large proportion of 48 RT-PCR tests, which are used to monitor case data, are available only for symptomatic 49 people. Additionally, antigen tests can display a large proportion of false negatives, 50 especially in asymptomatic individuals, so they have mostly served as an adjunct to RT-PCR 51 tests 3 . Community testing relies on the population to volunteer and depends on the ready 52 availability of tests. Even in countries that have resources for mass testing, the population in 53 more remote areas might remain untested. 54 A complementary way of monitoring viral diseases is to use a wastewater-based 55 epidemiology (WBE) approach. In this approach, certain viruses or drugs released by the 56 population into the sewage network can be quantified and monitored through time 4,5 . It has 57 been shown that individuals infected with SARS-CoV-2 will excrete a significant number of 58 viral particles in stools 6-8 . It was later shown that SARS-CoV-2 can be detected and quantified 59 in wastewater samples using RT-PCR techniques or next generation sequencing and that the 60 amount of virus in sewage correlated well with the COVID-19 case data obtained from 61 community testing 9-12 . This is especially useful because sewage surveillance can monitor the 62 extent of the virus spread in the community, independent of voluntary testing, and it has the 63 potential to detect disease outbreaks early 9,13,14 . Several countries have used a WBE 64 approach to monitor SARS-CoV-2, including more than 2500 sampling locations in total 15 . 65 The dataset presented here includes the quantification of the SARS-CoV-2 nucleocapsid gene 66 N1 in wastewater in Scotland, UK, together with data that are complementary to the analysis 67 of the viral concentration in each area. The SARS-CoV-2 monitoring programme in Scotland 68 has been conducted by Scottish Water and the Scottish Environmental Protection Agency 69 (SEPA) since May 2020. The data are released in near-real time on the SEPA public dashboard 70 16 . Further statistical analysis and aggregation is performed by Biomathematics and Statistics 71 Scotland (BioSS) and presented in weekly reports from Scottish Government 17 . However, 72 access to data presented in this way is at risk of deteriorating over time. It is then imperative 73 to guarantee long-term preservation of such data and metadata whilst ensuring they adhere 74 to FAIR principles (Findable, Accessible, Interoperable and Reusable) 18 . 75 The dataset and related methodologies presented here are a result of a collective effort in 76 curating and securing future access to the outputs of the SARS-CoV-2 monitoring programme 77 in Scotland, while ensuring they are ready to be re-used. 78 In this dataset, additional to the quantification of the N1 gene, we included site locations 79 with geographical coordinates, the total population for each site, and the information 80 necessary for data normalisation, such as the incoming wastewater flow values and 81 ammonia concentration, when these were available. The methodology for viral 82 quantification and data analysis is briefly described here, with links to detailed protocols on 83 the online platform protocols.io. 84 85

86
The Figure 1 summarizes the overall workflow of the SARS-CoV-2 quantification in 87 wastewater methodology, from sample collection to data analysis and data sharing. 88 The detailed methodology, from wastewater viral RNA isolation to SARS-CoV2 detection 89 using RT-qPCR and data analysis, is described in the protocols published on the online 90 protocol sharing platform protocols.io [19][20][21] . 91 A brief description of the methodology for sample collection and quantification of SARS-CoV-92 2 in wastewater is presented here as follows. 93 94

Sample collection 95
The wastewater samples were collected by Scottish Water (SW) and its operators in a total 96 of 122 sites including all 14 NHS Scotland Health Board areas ( Figure 2). 97 Samples were collected from sewage influent using either autosamplers over a period of 24 98 hours or, in some cases, by means of a grab sample via manholes in samples collected in 99 other locations of the sewage network. 100 For samples taken at the sewage works influent, an empty bottle was put into the 101 autosampler, and a composite sample was built up over a period of 24 hours where many 102 small portions of the influent were taken (Day 0). Once this time period was up the sample 103 was collected (Day 1) and transported to the SW facility. For network samples collected 104 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 13, 2022. ; https://doi.org/10.1101/2022.06.08.22276093 doi: medRxiv preprint from a manhole as a grab sample there was no "Day 0" as the sample was directly 105 transferred to the SW facility. 106 The collected samples were then split in two parts; one part was analysed for its ammonia 107 content while the other part was transferred to SEPA's laboratories in the next working day 108 for SARS-CoV-2 quantification analysis (Day 2). When the number of samples was too high to 109 be analysed at once, the samples were stored and analysed at a later date. The collection 110 and analysis dates were recorded in the dataset (see Data Records section). 111 The frequency of sample collection was variable, but typically samples were collected on a 112 weekly basis, once or twice a week. During outbreaks, for example, sample collection was 113 more frequent in some of the health board areas. 114 Generally, the samples were analysed for SARS-CoV-2 RNA concentration on the day samples 115 were received at SEPA, and the results were reported on SEPA's public dashboard 16 on that 116 same evening or the next morning. 117

Sample analysis: viral RNA extraction and RT-qPCR 119
The majority of the samples were processed and analysed at SEPA's laboratories whereas a 120 few samples in January 2021 were processed and analysed by the Roslin Institute, University 121 of Edinburgh. 122 Prior to processing, the samples were spiked with a known quantity of Porcine Reproductive 123 and Respiratory Syndrome (PRRS) virus as a sample process quality control (see more details 124 on this in the Technical Validation section). Samples were clarified to remove particles and 125 filter-concentrated to obtain sufficient viral RNA. RNA was then extracted, and SARS-CoV-2 126 gene copies (as well as the PRRS control) were quantified using one-step RT-qPCR. More 127 specifically, the amplification of the SARS-CoV-2 nucleocapsid gene N1 was monitored. After 128 obtaining the threshold cycle (Ct) values, the gene copies per litre (gc/L) values could be 129 calculated for each sample. Three or two technical replicates were included for each sample 130 and the mean values were obtained for each sample. In general, the final reported value 131 represented the mean of all replicates, but in some cases a specific replicate was deemed an 132 outlier and excluded from the average. 133 134

Data analysis and description 135
Depending on the amount of gene copies per litre obtained, samples were categorized as 136 "Negative", "Weak Positive", "Positive Detected, Not Quantifiable" ("Positive DNQ") or 137 "Positive". The cut-off values used to apply this description were determined by a standard 138 dilution series analysis, which allowed the identification of the limit of detection (LoD) and 139 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 13, 2022. ; https://doi.org/10.1101/2022.06.08.22276093 doi: medRxiv preprint the limit of quantification (LoQ) values. The LoD is the value at which the test has been 140 determined to detect the virus material with certainty, and in this case it is currently set to 141 be 1,316 gc/L. The LoQ is the value above which the test has been deemed to measure the 142 virus material with a high degree of accuracy, and in this case it is currently set at 11,368 143 gc/L. The category thresholds were slightly altered from one point within the reported 144 interval to reflect a change in lab practice. These instances can be identified by the category 145 "SEPA-Low Volume" in the "Analysis_Lab" column of the dataset. In those particular 146 instances, the analysis volume was reduced, which reflected in changes in the LoD and LoQ 147 values. When this change happened, the LoD increased to 6,640 gc/L and the LoQ reduced 148 to 9,427 gc/L. 149 If two or more replicates produced no Ct, the result was reported as "Negative" and the N1 150 reported value was set to zero. If two out of three replicates or one out of two replicates 151 returned a positive signal, but that signal was calculated as lower than the LoD, the sample 152 was reported as "Weak Positive. If the average value of the three or two replicates was 153 between the LoD and the LoQ (i.e. between 1,316 and 11,386 gc/L), the sample was reported 154 as "Positive (DNQ)" as it could not be quantified in a statistically significant way. 155 Finally, a "Positive" description was given where the average of all replicates was above the 156 LoQ and therefore showed a strong, quantifiable result. 157 In some rare instances where the analysis could not be done due to technical issues, the 158 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 13, 2022. ; https://doi.org/10.1101/2022.06.08.22276093 doi: medRxiv preprint normalisation protocol is provided on protocols.io 21 . In short, to produce a daily value of 175 million gene copies per person, the raw gene copies per litre value is multiplied by the daily 176 flow total and divided by the population served at each site. The flow for a specific site 177 (waterworks location) is either measured directly or estimated using an estimation method 178 that will vary according to the availability of the supporting data. Cov2_RNA_monitoring_ww_scotland.csv" is provided below: 208 209 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 13, 2022. ; https://doi.org/10.1101/2022.06.08.22276093 doi: medRxiv preprint • Health_Board: name of the NHS Scotland health board for that particular site and 210 sample. 211 • Site: name of the site where the sample was collected, which corresponds to the 212 name of the wastewater treatment centre. In some cases, the site name is followed 213 by a dash and a specific sewage location within the main treatment centre network. 214 Ex. "Seafield -Western General". 215 • Date_collected: date at which the wastewater sample was collected. 216 • Date_analysed: date at which the sample was analysed. In some instances, cells in 217 this column contain "(empty)" because the exact date of the analysis was not 218 available. 219 • N1_Repl_1-gc_per_L: the gc/L of replicate 1. Blank fields (null values) mean that a 232 technical issue occurred, and a result was not produced for that sample. The value 233 was set to zero when the replicate did not produce a Ct value. 234 • N1_Repl_2-gc_per_L: the gc/L of replicate 2. Blank fields (null values) mean that a 235 technical issue occurred, and a result was not produced for that sample. The value 236 was set to zero when the replicate did not produce a Ct value. 237 • N1_Repl_3-gc_per_L: the gc/L of replicate 3. Blank fields (null values) mean that a 238 technical issue occurred, and a result was not produced for that sample. Blank fields 239 (null values) in this case might also mean that a third replicate was not included for 240 that sample. The value was set to zero when the replicate did not produce a Ct value. 241 • Calculated_mean: the simple mean of the values obtained for all replicates. 242 • Standard_Deviation: the standard deviation of the values obtained for all replicates. 243 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 13, 2022. ; https://doi.org/10.1101/2022.06.08.22276093 doi: medRxiv preprint • Flow-L_per_day: the measured flow in litres per day for that particular site and 244 sample collection date. As described in the methodology, not all samples have a flow 245 measurement associated. 246 • Ammonia-mg_per_L: the measured ammonia content in milligrams per litre for that 247 particular site and sample collection date. As described in the methodology, not all 248 samples have an ammonia measurement associated. 249 • pH_value: the pH of the sample, when available. 250 • Modelled_flow-L_per_day: the modelled flow in litre per day produced according to 251 the methodology described in the normalisation process. This was particularly used 252 when flow measurements were not available for a specific site and date. 253 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 13, 2022. ; https://doi.org/10.1101/2022.06.08.22276093 doi: medRxiv preprint The file "prevalence_timeseries.csv" contains data in the traditional timeseries format, i.e., 279 each row corresponds to one sample and collection site, the first few columns provide 280 information about the site, and the subsequent columns store SARS-CoV-2 virus levels 281 measurements for each particular sampling date. For example, the column "2020-05-28" 282 contains gene copies per litre from sample collection on 2020-05-28. NA means that a 283 sample was not collected for that site at that date or that the analysis failed. 284 The file "norm_prevalence_timeseries.csv" is equivalent to the "prevalence_timeseries.csv" 285 but the "date" columns contain the normalised virus levels in million gene copiers per person 286 per day. 287 The file "weekly_prevalence_timeseries.csv" -contains weekly averaged data of virus levels 288 in the timeseries format, i.e., each row corresponds to one sample and collection site, the 289 first few columns provide information about the site, and the subsequent columns store 290 averaged SARS-CoV-2 virus levels recorded for a week. For example, 2022-6 contains 291 averaged gene copies per litre from samples collected on the week starting at 2022-02-7 292 (sixth week of the year, 2 nd week of February). NA means that no samples were collected in 293 that week for a site or that all the analysis of that week failed. 294 Finally, the file "weekly_norm_prevalence_timeseries.csv" -is equivalent to the "weekly_ 295 prevalence_timeseries.csv" but the "week" columns contain the averaged normalized virus 296 levels in million gene copies per person per day for that week. 297 As described in the methodology, it took a maximum of two working days from sample 298 collection to reporting the SARS-CoV-2 quantification results on SEPA's public dashboard 16 . 299 During the total period of monitoring described in this dataset (May 2020 to February 2022), 300 9.3% of the analysed samples were "Negative", 7.3% were "Weak Positives", 14% were 301 "Positive DNQs", and 63.4% were "Positive". The analysis failed in 6% of the samples. 302 This includes the sites with highest population such as Seafield (Edinburgh), Shieldhall 305 (Glasgow) and Nigg (Aberdeen) as well as remote sites such as Kirkwall and Lerwick (Orkney 306 and Shetland; Northern Isles) and Stornoway (Western Isles). The normalised data, which 307 describes the number of million gene copies per person per day shows that the wastewater 308 data correlates well with the COVID-19 disease waves reported by local authorities 28 (Fig. 3). 309 Nationally, an increase in cases was observed during the period of December 2020 to 310 February 2021, followed by a period with fewer cases from February 2021 until mid-June 311 2021, when cases started to rise again. This is observable also in wastewater data, where we 312 can visualize the same pattern in the levels of SARS-CoV-2 viral material (Fig. 3). Moreover, in 313 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 13, 2022. ; https://doi.org/10.1101/2022.06.08.22276093 doi: medRxiv preprint September 2021 a higher peak in case data correlated with higher number of viral copies in 314 wastewater data, followed by a slight decrease in cases before another peak was observed 315 from end of December 2021 until February 2022. 316 The were included for each sample, as described above. 335 Additionally, serial dilutions of template RNA were used to produce standard curves and 336 therefore establish the experimental efficiency and reliability and allow for the absolute 337 quantification of the target gene. 338 339 340

341
Mathematical models using wastewater data can be used as a guide to estimate daily cases 342 of COVID-19. For example, modeling based on wastewater monitoring data has been used in 343 Scottish Government to estimate the prevalence of infection, and the value of R 17 . 344 Wastewater data therefore provides estimates of these values that are independent of other 345 data types. These estimates are combined with others to provide the best overall estimate of 346 prevalence and R values, at the Scottish and UK levels. 347 The detailed methodology accompanying this dataset, including normalisation methods, can 348 assist with modelling and predicting surges in SARS-CoV-2 or in similar viral diseases in the 349 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 13, 2022.
future. The methodology for SARS-CoV-2 detection presented here can also be adopted by 350 other institutions interested in WWE monitoring or modified to monitor different viruses. 351 Moreover, longitudinal, geospatial data are costly to obtain -especially taking into 352 consideration the logistics of collecting physical samples -and thus, warrant appropriate 353 preservation strategies. 354 Transparency of data regarding the COVID-19 epidemic is crucial when it comes to 355 government decision making and accountability. For example, availability of such data can 356 potentially help researchers trace back the viral levels in the community and relate them to 357 government actions and the following chain of events that took place. Additionally, knowing 358 that COVID-19 is a new disease but with already more than fifty long-term effects 29 and 359 potentially many more, this data has the potential to help researchers and medical 360 professionals to correlate and understand surges in certain medical conditions in areas that 361 were particularly affected by the virus. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 13, 2022. ; https://doi.org/10.1101/2022.06.08.22276093 doi: medRxiv preprint