Psychometric evaluation of an adapted version of the Perceived Stress Scale for Ecological Momentary Assessment Research

Ecological momentary assessment (EMA) methodologies are commonly used to illuminate the predictors and impacts of experiencing subjective stress in the course of daily life. The validity of inferences from this research is contingent on the availability of measures of perceived momentary stress that can provide valid and reliable momentary stress scores. However, studies of the development and validation of such measures have been lacking. In this study, we use an EMA data collection design to examine the within-and between-person reliability and criterion validity and between-person gender measurement invariance of a brief EMA-adapted measure of a widely used trait measure of stress: the Perceived Stress Scale (PSS). Scores showed high internal consistency reliability and significant correlations with a range of criterion validity measures at both the within-and between-person level. Gender measurement invariance up to the scalar level also held for scores. Findings support the use of the EMA-adapted PSS presented in the current study for use in community-ascertained samples to address research questions relating to the influences on and effects of momentary stress and their gender differences.

Ecological momentary assessment (EMA) is an increasingly popular method for capturing individuals' experiences, including events, cognitions, emotions, and behaviours in the flow of daily life.Reliable and valid measurement of these experiences is essential for making correct inferences from these designs; however, the development and psychometric validation of measures for use in EMA designs is a neglected issue (Murray et al., 2020).In the current study, we provide a psychometric evaluation for a measure of a construct commonly studied in EMA designs, namely perceived or 'subjective' stress.Using data from the n=255 Decades-to-Minutes (D2M) EMA study, we examine the within-and betweenperson reliability and criterion validity and between-person gender measurement invariance of an EMA-adapted version of a popular measure of perceived stress: the Perceived Stress Scale (Cohen, 1988).
EMA designs are argued to have a number of important advantages over traditional survey-based questionnaire methods, including greater ecological validity, a reduced reliance on retrospective recall, and the affordance of the possibility to construct indices of variation and covariation between constructs (e.g., emotional lability or stress reactivity) and examine their variation across individuals (Russell & Gajos, 2020).These advantages have proven particularly valuable within the mental health field where an additional advantage commonly cited is the ability of EMA to provide a foundation for and inform ecological momentary interventions (e.g., Balaskas et al., 2021).However, the validity of inferences drawn from EMA designs andby extensionthese advantages depend on achieving reliable and valid scores from the measures used in EMA designs.For example, when scores are unreliable, it is unclear whether null effects reflect a true lack of effect or excessive measurement variance for scores on the constructs involved.Similarly, the interpretation of effects in EMA should be built on an understanding of how the scores for EMA measures are related to measures of other constructs to which they are, theoretically, more or less closely related (i.e., of their criterion, divergent, and nomological network validity) (Cronbach & Meehl, 1955;Mokkink et al., 2010).
However, despite the importance of utilising EMA measures with strong psychometric properties, there has been a lack of attention paid to the development and validation of EMA measures (Dubad et al., 2018;Murray et al., 2020).Many studies either employ or adapt traditional 'trait' measures without examining their psychometric properties in an EMA context.However, it cannot be assumed that trait measures can be applied in EMA contexts without consequence for their psychometric properties.For example, the value of EMA designs depends on the timescales over which constructs vary and a measure that is reliable for capturing individual differences may nonetheless show insufficient within-person variation to provide reliable scores at the within-person level (Bolger & Laurenceau, 2013).If the behaviour/event is rare enough, it may also provide a poor measure of between-person differences in the construct due to a lack of observations during the EMA measurement period.Another common approach is to use bespoke measures with minimal testing of their psychometric properties, meaning that it is difficult to be sure that the measures are capturing the relevant EMA constructs reliably.However, irrespective of the item development approach, it is important to ensure that any EMA measure used to investigate within-and between-person associations and differences provides valid and reliable scores at both the within-and between-person level.
Noting the importance of reliable and valid EMA measure scores, a small number of studies have begun to specifically develop and/or validate, EMA measures and discuss their psychometric properties (Borah et al., 2018;Forkmann et al., 2018;Jimenez et al., 2022;Mejía et al., 2014;Murray et al., 2020;Versluis et al., 2021;Wieland et al., 2018).For example, Murray et al. (2020) developed and validated a measure of aggression for use in EMA research.They began by generating a large number of candidate items that were reduced and refined to a candidate measure, based first on expert content validity assessments and a pilot data collection (Borah et al., 2018).They evaluated the within-and between-person reliability (Geldhof et al., 2014), and criterion validity and between-person measurement invariance of a four-item measure of momentary aggression: the 'Aggression-ES-A' in a larger sample, finding that these properties were generally supported.Versluis et al. (2021) modified the Levels of Emotional Awareness Scale (LEAS), typically used to assess trait emotional awareness, for use within an EMA context.Alongside the original LEAS items, the measure was recontextualized to assess state emotional awareness by asking participants to describe how they felt in their current social situation, as well as indicate how they believed another person felt in the situation.Higher use of expressive emotional words in each description was coded as higher emotional awareness.The authors demonstrated within-and between-person reliability in the amended LEAS.They also found that a large proportion of the variance (50%) in emotional awareness was accounted for by state scores, whereas trait scores were found to explain only 2%.Another study by Jimenez et al. (2022) addressed the need for psychometrically robust EMA measures which capture both distress and anhedonia.The brief Dysphoria and Well-Being EMA measures were adapted from the Inventory of Depression and Anxiety Symptoms and validated by conducting principal factor analyses and internal consistency analyses on aggregated crosssectional datasets (N=8876).The two EMA scales were further evaluated in an EMA design among 279 college students at the within-and between-person levels, with results indicating that both scales showed acceptable to good internal consistency, strong criterion validity, and generally adequate discriminant validity (Jimenez et al., 2022).Besides this, Forkmann et al. ( 2018) evaluated the psychometric properties of an item set for the assessment of suicidal ideation (i.e., passive and active suicidal ideation) and relevant proximal risk factors (e.g., anxiety, depression).This item set was designed for use in EMA research within suicidology.This item set includes 28 items that were selected from previously developed measures or were newly developed.They, based on a sample of psychiatric inpatients with depressive disorders, found that all items captured moment-to-moment-variability and substantial within-person variance, and generally satisfied reliability at the within-and between-person level, as well as criterion validity.
These studies demonstrate the increasing awareness of the importance of rigorously evaluating the psychometric properties of measures used in EMA studies.However, a key construct for which there is a remaining need for further psychometric development/evaluation is perceived momentary stress.Momentary stress measures have been used in EMA studies to address a diversity of research questions relating to the predictors and impacts of stress in the course of daily life (Beute & de Kort, 2018;Do et al., 2021;Dunton et al., 2019;Huckins et al., 2020;Kou et al., 2020;Lazarides et al., 2020;Mennis et al., 2018;Speyer et al., 2021).Kou et al. (2020), for example, used geographically explicit EMA (GEMA) to examine the links between noise exposure and psychological stress.They found that objective measures of noise exposure were associated with higher stress and this effect was mediated by subjective stress.In the current sample, Speyer et al.
(2022) investigated the role of moment-to-moment dynamics of perceived stress and negative affect in co-occurring ADHD and internalising symptoms.They found that ADHD traits were associated with increased stress reactivity, with stress carry-over further mediating the association between ADHD symptoms and internalising problems.Dunton et al. (2019) measured momentary stress and found that higher levels of maternal momentary stress were related to lower levels of parenting behaviours that encouraged child physical activity.However, no momentary studies of stress have to date provided a comprehensive evaluation of the psychometric properties of the scores from measures used in EMA stress research.
In traditional survey designs, a popular and well-validated measure of perceived stress is the Perceived Stress Scale (PSS; Cohen, 1988).Both the original 14 item, as well as the abbreviated 10 and 4 item versions of the PSS assess respondents' levels of perceived stress (e.g., '…felt that you were unable to control the important things in your life?') and perceived coping (e.g., '…felt confident about your ability to handle your personal problems?')over the past month.In the context of traditional administration formats, the scale has undergone extensive psychometric evaluation across a range of settings and in numerous languages (for a review see Lee, 2012) showing overall good internal consistency (Cronbach's alpha >.7) and high test-retest reliability (ICC >.7).An emerging consensus supports a two subscales structure measuring perceived stress and perceived coping as distinct but correlated dimensions (Bastianon et al., 2020;J. M. Taylor, 2015).
The importance of examining sex/gender invariance in the PSS has also been recognised.Sex and gender differences in stress processes have been a topic of considerable interest in the stress literature, for example, implicated as a mechanism underlying differences in vulnerabilities to some mental and physical health conditions (Bale & Epperson, 2015).A review by Lee (2012) concluded that there were inconsistencies by gender in reporting stress using the PSS.However, robust evidence on sex/gender differences is critically dependent on establishing that the PSS can capture stress in a comparable fashion across males and females.Otherwise, observed differences in stress may partially or wholly reflect differences in the manner in which difference sexes or genders understand and/or respond to (e.g., with different response styles) the items.Gender or sex invariance has been explored in several psychometric studies of the PSS (e.g., Denovan et al., 2019;Lavoie & Douglas, 2012;Liu et al., 2020), generally (albeit with some exceptions) supporting invariance up to the scalar level.
Given the widespread use of the PSS and the existing evidence for its psychometric properties as a means of measuring perceived stress in traditional survey designs, it is a natural instrument to use as a basis for adaptations to EMA designs.We thus here explored the adaptation of the PSS to an EMA version for use in EMA research.The PSS was adapted as part of the D2M EMA study and administered to a sample of n=255 respondents.

Participants
Participants in the current study represented a sub-sample of the Zurich Project on the Social Development from Childhood to Adulthood (z-proso) study sample (Ribeaud et al., 2022).The underlying z-proso sample is from a longitudinal cohort study based in Zurich, Switzerland.The z-proso study began when its participants were entering primary school, aged 7, in 2004.These participants were then followed up in a series of main data collection waves at ages 8,9,10,11,12,13,15, 17 and 20.Z-proso participants were selected at the age 7 baseline using a stratified sampling procedure whereby schools were the sample units and stratification was used to ensure adequate representation of schools from geographical regions varying in socioeconomic background.The overall target sample for z-proso was 1675, with 1571 participants providing data for at least one of its main waves.Previous analyses of non-response and attrition have suggested that aside from a slight underrepresentation of youth from an immigrant background, the sample can be considered to have suffered little non-random participation/attrition (Eisner et al., 2018).
The Decades-to-Minutes (D2M) study (see Murray et al., 2022) took place shortly after the age 20 main wave of z-proso in 2018.During the data collection for this z-proso wave, participants were invited to take part in an additional EMA sub-study.Due to budget constraints, we selected only a proportion of the participants from the subset of participants who agreed to be contacted about D2M (of whom n=255 provided sufficiently complete data to be included in the current study).Monte Carlo power analyses based on similar models to those fit in the present study support this sample size as sufficient to achieve high levels of statistical power (A.L. Murray et al., 2020).This sub-sample has been compared with the main cohort to find that there were more females in the D2M sub-sample (62% of female in D2M compared with 49% in the main sample).D2M participants also have a slightly higher socioeconomic status based on the maximum household ISEI (Ganzeboom et al., 1992) (p<.001), are slightly lower in self-reported aggression based on age 20 aggression questionnaires [t(516.7)=-2.92,p=.004] and are slightly higher on stress [t(440.48)=2.78,p=.006] compared with the main cohort.However, they did not differ significantly from the main cohort on ADHD symptoms [t(434.85)

Ethics
Ethical approval for z-proso and D2M was obtained from the Faculty of Arts and Social Science's Ethics Committee at the University of Zurich.Written informed consent was obtained from participants prior to data collection.

Data collection procedure
After providing informed consent, and with the help of instructions provided by the study team, participants downloaded an EMA application provided by LifeDataCorp LLC on their own smartphones.Over the next 14 days, they received prompt notifications via this application four times a day at quasi-random intervals (randomly within four pre-specified periods of the day).This schedule was chosen to ensure adequate coverage of the day but without using fixed data collection times that could allow participants to anticipate the arrival of prompts and potentially change their behaviour as a result.The data collection times were restricted to be between 10am and 10pm.A 14-day period was selected to ensure that participants provided sufficient numbers of observations to provide good statistical power for a range of relevant models and to allow multiple weekend days to be included.The tokens of appreciation provided were set in collaboration with and implemented by the Decision Science Laboratory (DeSciL) at the proportional to the individual response rates achieved.
Participants could earn up to a maximum of 50 CHF (1 CHF is approximately worth 1 USD) if they downloaded the application and achieved a response rate of >70% for both weeks 1 and 2 of the EMA schedule.The EMA protocol was developed ETH Zurich: https://www.descil.ethz.ch/.

PSS-EMA
The EMA version of the Perceived Stress Scale (PSS) was adapted from the full traditional survey version of the PSS (Cohen, 1988).Adaptations included the selection of only four items from the perceived stress subscale (in order to ensure a brief enough item set that could be completed within the context of brief EMA surveys) and the rephrasing of the item stems to refer to the previous 30 minutes.A single stem was used 'In the last 30 minutes, I felt…' followed by four PSS items: '…that I was unable to control the important things in my life'; '…nervous and 'stressed''; '…I could not cope with all the things I had to do'; '…difficulties were piling up so high that I could not overcome them'.Responses were recorded on a five-point Likert-type scale from very slightly or not at all to extremely.This response scale was used to be consistent with the Positive Affect Negative Affect Schedule Expanded Version (PANAS-X) (Watson & Clark, 1999) which is commonly used in EMA studies, including the present study.Due to the strong need to minimise the burden of EMA surveys, it can be advantageous to use harmonised response scales for different measures wherever possible.The reference frame of the previous 30 minutes was selected to balance the need to capture the most proximal experiences of participants (and minimise reliance on retrospective recall) while sampling a long enough timeframe to capture rarer events (e.g., experiencing a social provocation).It also allowed for comparability across prompts given their quasi-random administration schedule.Specifically, given that the intervals between prompts differed asking respondents about experiences 'since the last prompt' would have different reference periods across different prompts.The EMA measures administered can be found at: https://osf.io/85ax3/The intraclass correlation coefficients (ICC) measuring the proportion of variation at the within-versus between-person levels for the four PSS items were: .42,.51,.46,and .52,indicating variation at both levels.

Between-person criterion validity measures
Between-person criterion validity measures were selected from those available from the most proximal main survey wave of z-proso (i.e., age 20), based on prior theory or empirical evidence suggesting a plausible connection with momentary stress levels.
Measures included here that were expected to be significantly associated with the PSS-EMA scores were: perceived stress measured using a traditional survey method (Cohen, 1988), anxiety and depression (e.g., Pêgo et al., 2009), suicidal ideation (e.g., Zhang et al., 2012), self-harm (e.g., Madge et al., 2011), bullying victimisation (e.g., González-Cabrera et al., 2017), and intimate partner violence victimisation (Yim & Kofman, 2019).These measures were administered earlier in the same year as the EMA measures; however, given the tendency for these psychological and behavioural constructs to show (varying degrees of) stability in their individual differences over the timescales of the main and EMA study (A.L. Murray, Eisner, & Ribeaud, 2019;Schönfeld et al., 2019;Van Dulmen et al., 2012;Zhu et al., 2022), we anticipated that they would positively predict later EMA-measured stress.
Perceived Stress was measured at the age 20 wave of z-proso using a subset of four items from the PSS (Cohen, 1988).The same four items as those used in the EMA-adapted PSS were used.Responses to these items were recorded on 5-point Likert-type scale from never to very often.Global perceived stress scores were estimated within a measurement model in which a single general perceived stress factor was defined by the four PSS items (ω = .88).
Anxiety was assessed at the age 20 wave of z-proso using the self-reported Social Behavior Questionnaire (SBQ; Tremblay et al., 1991) four items of the SBQ measured anxiety (e.g., "I was worried"), which were rated on a five-point scale ranging from never to very often.Global anxiety scores were estimated within a measurement model with a single latent factor (ω =.76).
Depression was measured at the age 20 wave of z-proso using the four items (e.g., "I was sad without knowing why") from the SBQ (Murray et al., 2017;Tremblay et al., 1991).
Responses are recorded on a five-point scale from never to very often.Global depression scores were estimated within a measurement model with a single latent factor (ω =.93) Suicidal ideation was assessed at the age 20 wave of z-proso using an item that asked participants how often they had thought about suicide during the past month, providing ratings on a five-point scale ranging from never to very often.
Self-harm was measured at the age 20 wave of z-proso using a single question that asked how often participants had intentionally self-injured (e.g., cutting an arm, tearing wounds open) during the last month (Steinhoff et al., 2021).Participants reported their selfinjury frequency on a five-point scale ranging from never to very often.
Bullying victimisation was measured using the 4-item Zurich Brief Bullying Scale (ZBBS; Murray et al., 2019).The frequency of four forms of victimization (i.e., physical, verbal, social, and property damage) over the past 12 months was evaluated using a sixpoint scale: 1=never, 2=1 to 2-times, 3=3 to 10-times, 4=about once a month, 5=about once a week, and 6= (almost) every day.Global bullying victimization scores were estimated within a measurement model with a single latent factor based on four ZBBS items (ω = .67).
Intimate partner violence (IPV) exposure was measured at the age 20 wave of zproso using a multi-dimensional scale which asked respondents about their exposure to physical violence (6 items), psychological violence (3 items), sexual violence (4 items), and monitoring (4 items) by a current or former intimate partner (Schuster et al., 2021).The items are adapted from Taylor and Woods (2011) and Zweig et al. (2013).Responses are recorded on a 4-point Likert-type scale from never to over nine times for a reference period of the last 12 months.Scores were estimated within a four-dimensional oblique confirmatory factor analysis (CFA) model, with the four dimensions corresponding to the four subscales of the scale.Ω for physical violence, psychological violence, sexual violence, and monitoring were respectively: .55,.37,.66,.85.These measures were only available for the subset of participants in a romantic relationship (n=775, covariance coverage with EMA measures = 48%)

Within-person criterion validity measures
We hypothesised that provocations and negative affect would be significantly and positively correlated with stress at both the within-and between-person levels (Jacobs et al., 2007;Joseph et al., 2021).
Provocation was measured using a 4-item scale developed by the study team for use in EMA studies of aggression (Murray et al., 2020).The scale is designed to be capture momentary exposure to provocations commonly experienced in daily life with reference to the previous 30 minutes: encountering interference to goal pursuit, interpersonal conflict, angry rumination on a past event, and being insulted.Responses are recorded on a 4-point Likert-type scale from strongly agree to strongly disagree.In this study, provocations were averaged to provide a composite measure of level of provocation experienced (as opposed to used within a latent variable measurement model), as the items are designed for formative rather than reflective measurement (Murray & Booth, 2018).
Negative affect was measured using an abbreviated measure of the negative affect scale from the Positive Affect Negative Affect Schedule Expanded Version (PANAS-X; Watson & Clark, 1999).The version administered in D2M used the items 'afraid', 'scared', 'hostile', 'guilty', 'ashamed', 'upset', and 'distressed' and ask respondents to report on their experiences of these affective states in the previous 30 minutes.Responses are recoded on 5-point Likert-type scale from extremely to very slightly or not at all.Global negative affect scores were estimated within a two-level CFA model with a single latent factor at each level.ω reliability was .81 at the within-person level and .97 at the between-person level.

Statistical Procedure
Full analysis code for the statistical analyses described below can be found at: https://osf.io/5wyjb/.

Missingness treatment
To deal with missingness in the EMA, we did not impose compliance thresholds as this can risk biasing analyses given that compliance in EMA can be related to respondent characteristics (A.L. Murray, Yang, et al., 2022).As such, all observations with some relevant data on the measures were included in each model.For the analyses involving the main survey measures as well as the EMA we likewise used all available observations from the main survey (See Ns in Supplementary Materials Table S2).This was to maximise the information available to estimate the latent factor levels for the criterion validity measures and to help adjust for any non-random selection from the main sample into the D2M sample.
Beyond this, item-wise missing data was dealt with using full information maximum likelihood estimation (FIML), which provided unbiased parameter estimates under an assumption of missing at random.

Within-and between-person factorial validity and reliability
Factorial or 'structural' validity refers to whether the factor structure of a scale conforms to that implied by the design of the measure (e.g., do the items for a single dimension or specified sub-dimensions?)(Mokkink et al., 2010).To evaluate the within-and between-person factorial validity and reliability, we fitted a multi-level (two-level) confirmatory factor model (see https://osf.io/2whax) .This model is depicted in Figure 1.In this model, observations are clustered within individuals and the within-person part of the model is based on the correlations within time-points between different PSS items while the betweenperson part of the model is based on the correlations of the PSS item individual averages across individuals.As the correlations pertain to associations within rather than across time points it is not necessary to explicitly model the time lags.However, it is important to note that the results could depend on the chosen time lag and study period.For example, if too short a time lag is chosen there may be insufficient time for systematic within-person variation in stress to manifest across the different measurement instances and within-person reliability will prove to be poor.That is, in the case of a very short time lag, the covariation between the items that is used to estimate within-person reliability will be limited by the lack of opportunity for the items to vary across time.Similarly, if the study period is very long, the average PSS levels may include some developmental change which could, if not consistent across individuals, attenuate the between-person associations. .
A unidimensional model with a single latent stress factor was specified at both levels of a two-level model.ω reliability at the within and between-level was calculated from the parameter estimates from this model (the within-person reliability was based on the withinlevel parameter estimates and the between-person reliability was based on the betweenlevel parameter estimates), following Geldhof et al. (2014).This provides measures of internal consistency reliability that takes into account both the multi-level structure of EMA data and the fact that item loadings on the relevant within-and between-person latent variables are unlikely to be equal across items.This latter feature makes ω preferable to Cronbach's alpha, which assumes equality of loadings across items.Within-level ω estimates internal consistency reliability when measuring across instances within people and is important for ensuring unbiased estimates of the associations between perceived stress and other variables within people across time, for example, assessing whether perceived stress is lower when individuals are in greenspace (Mennis et al., 2018).The between-level ω estimates internal consistency reliability when measuring individuals' overall levels of stress.This is important for examining sources of individual differences in perceived stress, such as whether individuals with higher levels of depression or ADHD symptoms tend to experience higher overall levels of perceived stress (Speyer et al., 2022).

Within-and between-person criterion validity
Criterion validity refers to the associations between the scores of the focal measure and the scores from other measures, with which it is expected to be associated.These measures could be administered at the same time (thus measuring 'concurrent validity') or at a different time (e.g., 'predictive validity') (Mokkink et al., 2010).To evaluate the within-and between-person criterion validity of the PSS-adapted EMA, we extended the two-level CFA described above to a two-level structural equation models (SEM) incorporating the withinand between-person criterion validity measures mentioned above at the relevant levels (for full modelling details see 'Criterion validity output files' at: https://osf.io/5wyjb/).The EMA measures (provocation and negative affect as well as the PSS EMA items) were modelled at both the within-and between-person level.The main survey measures were included only at the between-person level reflecting the fact that they were measured only once (in these models, the PSS EMA items were modelled at both the within-and between-person levels).
Separate models were fit for each criterion validity measure.For the measures used to assess between-person criterion validity, a single factor latent variable model was used as the measurement model for the relevant constructs with the exception of IPV.For IPV we used an oblique factor model with correlated physical IPV, sexual IPV, psychological IPV and monitoring IPV factors.

Gender measurement invariance
We examined between-person gender measurement invariance in the PSS (e.g., Murray et al., 2021).We used a multi-group CFA approach beginning with a configural model (see: https://osf.io/3brxu) in which only the minimal cross-group constraints necessary for identification were imposed.Here, the latent factor mean and variance for males were fixed to 0 and 1 respectively and the loading and intercept for the first item was fixed equal across groups.We then tested metric invariance by imposing cross-group constraints on loadings (see: https://osf.io/a2g6e).A significant  2 difference test was taken as an indication of a lack of metric invariance.If necessary, modification indices and expected parameter changes were inspected to guide the iterative removal of constraints to attempt to achieve partial metric invariance.If at least partial metric invariance could be achieved, scalar invariance was tested with the addition of cross-group constraints on intercepts (see: https://osf.io/xp5dm).A significant  2 difference test was again taken as an indication of a lack of measurement invariance.In this case, modification indices and expected parameter changes were inspected to guide the iterative removal of constraints in an attempt to achieve partial measurement invariance.

Model estimation
All models were estimated using robust maximum likelihood (MLR) estimation in Mplus version 8.4 (Muthén & Muthén, 2015).MLR is a robust version of FIML that, therefore, provides unbiased parameter estimates under the assumption of 'missing at random', i.e., that missingness is random conditional on the modelled variables (Rubin, 1976).

Results
Descriptive statistics are provided in Table S1 of Supplementary Materials for the EMA measures and Table S2 for the main survey measures.There were approximately 8600 observations (individuals x time points) for the EMA measures.Full model results (output files) for all analyses described below are provided at: https://osf.io/5wyjb/.Some cases showed no variation on some of the EMA measures.These respondents contribute limited information and their presence in a dataset can sometimes cause convergence issues for some models applied to EMA data, such as dynamic structural equation models (Asparouhov & Muthén, 2022); however, because our analyses were limited to two-level CFA/SEM models and no convergence issues arose, these cases were retained.Results for specific analyses are discussed in turn below.

Factorial validity and reliability
A two-level CFA model for the PSS EMA items with a single latent factor at both the within-and between-person levels fit well by conventional criteria for good fit (CFI=.993,TLI= .979, RMSEA=.025,SRMR=.011 for within level and =.024 for between level).The standardised within-and between-person factor loadings for this model are provided in Table 1 and ranged from .58 ('I was unable to control the important things in my life') to .82('I could not cope with all the things that I had to do') at the within-person level and .82('I was unable to control the important things in my life') to .99 ('difficulties were piling up so high that I could not overcome them') at the between-person level.All were statistically significant at p<.001.ω reliability reflecting internal consistency reliability estimated from the factor loadings was .83 at the within-person level and .96at the between-person level.These high internal consistency values (>.70) suggest acceptable reliability for measuring both individual differences and within-person variations in perceived stress for the PSS scores.Taken together, these results suggested factorial validity for a unidimensional PSS score model as well as internal consistency reliability at both the within-and between-person level.

Criterion validity
The correlations between the PSS EMA latent variable and the criterion validity measures discussed in the 'Measures' section are provided in Table 2.The PSS latent variable at the within-and between-level was significantly associated in the expected direction with all criterion validity measures at the corresponding (within-or between-) level, except intimate partner violence and self-injury at the between-person level.The PSS EMA latent variable was most strongly associated with negative affect at both the within-person (r=.77) and between-person level (r=.88), likely reflecting their concurrent measurement and conceptual proximity.

Gender measurement invariance
Fit statistics for the series of models (configural, metric, scalar) fit to evaluate gender invariance are provided in Table 3.The configural model fit well by conventional standards, with CFI=.99,TLI=1.0,RMSEA=.025,SRMR within= .009and SRMR between= .0015.The addition of cross-group equality constraints on the within-and between-person factor loadings did not lead to a significant deterioration in fit [Δ 2 (6) = 3.993, p=.678], supporting metric invariance across males and females.Similarly, the addition of cross-group equality constraints on the intercepts did not result in a significant deterioration in fit [Δ 2 (3) = 3.201, p=.362), suggesting that scalar invariance held.Taken together, these results suggested that gender invariance held up to the scalar level for the EMA-adapted PSS.

Supplementary Analyses
Given that one PSS item had a lower loading than the others, we repeated the reliability and criterion analyses with this item removed.Results of this analysis are provided in Supplementary Materials and suggested that the omega and criterion validity values remained high without this item.This suggested that this item can be removed where there is a desire to use even briefer perceived stress measures.

Discussion
The purpose of the current study was to examine the psychometric properties of an EMA measure of perceived stress based on an adaptation of a popular and widely evaluated traditional survey measure.We found evidence for the factorial validity, internal consistency reliability, and criterion validity of an EMA-adapted Perceived Stress Scale (PSS; Cohen, 1988) at the within-and between-person level.Both a 3-and a 4-item version was supported, meaning that a shorter version could be used when the EMA questionnaire space is more limited.The EMA PSS scores were not significantly correlated with self-injury nor IPV victimisation at the between-person level; however, the IPV analyses were likely impacted by the low internal consistency reliability of the IPV measures and the fact that they were available only for the participants who were in a romantic relationship.We also found that gender measurement invariance held for its scores up to the scalar level.Overall, these results support the use of the EMA-adapted PSS evaluated here for testing hypotheses regarding momentary stress in community-ascertained samples.This includes comparisons across males and females to evaluate gender differences in stress levels and processes in daily life.
These findings are important for underpinning future rigorous studies of stress in daily life by offering evidence for a psychometrically robust item set that can be used in future research.Previous studies have demonstrated the types of insights that can be gained from evaluating stress from a momentary experience perspective in EMA designs (e.g., Beute & de Kort, 2018;Jahnel et al., 2019;Omowale et al., 2021;Speyer et al., 2021).
However, there remains considerable work to be done to illuminate its momentary influences and effects, including how cumulative momentary stress relates to the long-term development of mental and physical health issues (e.g., in multi-timeframe studies; A. L. Murray et al., 2022).Its relation to physiological markers of stress in daily life also remains to be fully explored.Preliminary evidence suggests that salivary cortisol levels may be associated with subjective stress only at the within-but not at the between-person level (Lazarides et al., 2020).However, the study from which this conclusion was drawn used a sample of pregnant women and hypothalamic-pituitary-adrenal (HPA) axis functioning is known to be impacted by pregnancy (Duthie & Reynolds, 2013).Answering these remaining research questions will be facilitated by the availability of a momentary measure of stress that produces reliable, valid, and gender measurement invariant scores.
Our study adds to the still relatively limited evidence base on the psychometric functioning of measures used in EMA contexts (Dubad et al., 2018;Murray et al., 2019).
Although reliable and valid measurement is just as crucial in the context of EMA as it is in traditional survey methods, psychometric development and validation has received considerably less attention in the former context.This may reflect an assumption that measures that have been validated in traditional survey contexts can be applied with confidence in EMA contexts.However, it is important that the reliability and validity of scores from EMA-adapted trait measures is evaluated empirically rather than assumed given that the psychometric functioning of measures is dependent on purpose and context.In this context it is also important to consider both within-and between-person reliability and validity and to ensure that a measure that will be used to measure both within-person changes and between-person differences shows good properties at both levels.Our analyses suggested that the PSS was had strong psychometric properties at both levels.
A second possible reason for the relative lack of attention paid to EMA measure psychometric properties may be the remaining need for methodological guidance on best practices for the development and validation of EMA measures.There has been considerable discussion of a range of design and analysis aspects of EMA, including sample sizes, the number and schedule of prompts, incentives, data management, and statistical modelling techniques (Asparouhov & Muthén, 2020;Bolger & Laurenceau, 2013;Carter & Emsley, 2019;McNeish & Hamaker, 2020;Schultzberg & Muthén, 2018;Trull & Ebner-Priemer, 2020).However, issues surrounding the reliability and validity of the survey measures used in EMA have been comparatively neglected.Future research should, therefore, focus on developing accessible but comprehensive consensus guidelines for developing/adapting and validating EMA measures analogous to guidelines available for traditional survey measures.While many of the concepts and processes are from traditional survey measure development (Mokkink et al., 2010), there are important differences that must be taken into consideration.These include the need to examine the psychometric properties (e.g., reliability, criterion validity) at both the within-and between-person level (Geldhof et al., 2014) and additional pressure to keep measures as brief as possible given the intensive nature of the data collection schedule in EMA.

Limitations and Future Directions
It is important to consider the limitations of the current study.First, the EMA and traditional survey measures were not administered concurrently.Specifically, the main survey data collection period began in the autumn of 2018 while the EMA data collection began in winter 2018.This will have attenuated the between-person correlations between the trait measures and the EMA measures relative to concurrent measurement and in particular complicated any comparison between within-and between-level indices of criterion validity.It is; however, also difficult to directly compare within-versus between-person indices in general.The development of guidance relating to the comparison of reliability and validity coefficients at the within-versus between-level may be beneficial in future work.For example, we observed in the present study that provocations and negative affect were generally more strongly correlated with the PSS-EMA scores at the between-than the within-person level.However, as well as being dependent on the distribution of variation at the between-and within-person levels, these statistics are difficult to directly compare across levels because they may not be estimated with the same accuracy or precision.
Second, we focused on a set of core psychometric properties but did not examine a number of others, including content and face validity, test-rest reliability, and invariance across dimensions of individual differences beyond gender.This was a function of the fact that we used pre-existing data and future studies would be valuable to address this gap.
Future studies could also examine a wider set of convergent and divergent validity markers and assess the psychometric properties of not just stress scores themselves, but indices derived from them such as within-person variation (e.g., stress variability) and covariation (e.g., stress reactivity to events) (Speyer et al., 2021).It would also be valuable to examine the relations between EMA-derived PSS scores and physiological markers of momentary and chronic stress, such as via correlations with hair or salivary cortisol (e.g., Short et al., 2016).We also focused on the validation of an existing shortened measure rather than proceeding through an entire cycle of item development and iterative refinement.While our approach provides support for a scale that can be used immediately, it is possible that a more optimised item set could be identified in the future by progressing through an entire measure development and validation pipeline.The availability of only 4 items also meant that we could not realistically test a wide range of alternative factorial models, beyond a unidimensional model.
Finally, our sample was also slightly selective relative to the main z-proso study, including with respect to stress and can, therefore, not be considered perfectly representative of the underlying same-aged young adult population.It was also a community-ascertained young adult sample and future studies in clinical samples, e.g., samples of individuals with mental health diagnoses, and other age groups will be important to evaluate the functioning of the EMA adapted PSS in other key populations.This is important given that psychometric properties may be population-dependent.

Conclusions
The EMA-adapted PSS presented in the current study showed evidence of withinand between-person reliability and criterion validity, as well as gender measurement invariance.This suggests that the scale can be recommended as a measure of choice in EMA studies seeking to illuminate daily life stress dynamics, predictors, and impacts.

Note.
Within-person correlations refer to the correlations between the within-person stress latent scores based on the EMA adapted PSS and the within-person scores for the criterion variables within a two-level SEM.The between-person correlations refer to the betweenperson stress latent scores based on the EMA adapted PSS and the between-person scores for the criterion variables.EMA variables were specified at both the within-and between-person level and main survey variables were specified only at the between-person level.Model syntax and full results are at: https://osf.io/5wyjb/ =1.40, p=.16], internalising problems [t(430.28)=1.28,p=.20], alcohol use in terms of beer and alcopop-type drinks [t(439.39)=0.77,p=.44] and liquor consumption over the past year [t(433.79)=0.59,p=.55], nor tetrahydrocannabinol use over the past year [t(417.91)=0.18,p=.86].