Extracting statistically significant behaviour from fish tracking data with and without large dataset cleaning

: Extracting a statistically significant result from video of natural phenomenon can be difficult for two reasons: (i) there can be considerable natural variation in the observed behaviour and (ii) computer vision algorithms applied to natural phenomena may not perform correctly on a significant number of samples. This study presents one approach to clean a large noisy visual tracking dataset to allow extracting statistically sound results from the image data. In particular, analyses of 3.6 million underwater trajectories of a fish with the water temperature at the time of acquisition are presented. Although there are many false detections and incorrect trajectory assignments, by a combination of data binning and robust estimation methods, reliable evidence for an increase in fish speed as water temperature increases are demonstrated. Then, a method for data cleaning which removes outliers arising from false detections and incorrect trajectory assignments using a deep learning-based clustering algorithm is proposed. The corresponding results show a rise in fish speed as temperature goes up. Several statistical tests applied to both cleaned and not-cleaned data confirm that both results are statistically significant and show an increasing trend. However, the latter approach also generates a cleaner dataset suitable for other analysis.


Introduction
There has been increasing interest in the use of computer vision methods for the analysis of natural world phenomenon, such as for species abundance and variety inventory, farm animal behaviour monitoring, and many specific scientific investigations.Examples of this increased interest are seen at special workshop series [1,2], the LifeCLEF competitions [3], and in special books [4,5].
One advantage of computer vision methods is the ability to acquire and analyse large amounts of data automatically, which can lead to more statistically sound results based on larger datasets.The downside of the computer vision approach is also linked to the large datasets acquired: it may not be possible to ensure that all of the data comes from the phenomenon of interest or is correctly measured, even with the use of mass ground-truthing capabilities such as facilities like Mechanical Turk.
Typical sources of error include: • Undetected transient sensor failures, such as compression or communication artefacts.• Target detection failures, such as missed detections, detection of overlapping individuals as a single target, false detections due to moving background items, illumination variations, unrelated or unexpected foreground objects etc. • Target tracking failures, such as mis-assignment errors arising when multiple individuals are present, detection failures occur, or from occlusions between multiple targets.• Target mis-identification errors, such as the confusion of one individual or species for another, a common occurrence given the unconstrained target pose and environmental conditions, e.g.dust and lighting.• Measurement errors, such as sizes or positions, which can arise from special cases not considered by the algorithms, e.g.partial occlusions or partial detection failures or unusual poses.
It is possible to develop post-processing filters to detect and remove some of these failures [6], and manual review of the data is certainly possible.However, as we move to the era of 'big data', it is no longer feasible to ensure that the datasets are 100% clean.Hence, computer vision needs to develop methods that are more robust to a wide variety of sources of data errors, not just sensor error (which are generally handled well by current robust methods).
Herein, a dataset consisting of about 4 million fish trajectories is analysed to explore how fish speed varied with respect to water temperature.Investigating the behaviour of coral reef fish species at different temperatures can help to assess their sensitivity to climate change [7].There are marine biology studies [8,9] which explored the fish activity in different water temperatures.In their analysis, a fish tank model, which allows to modify the temperature of the water [8], was used in contrast to our study which explores the data obtained from underwater videos in a natural setting that reflects the seasonal changes in water temperature.Some of our results have been reported previously in [7] from an ecological perspective where interested readers can find a deeper discussion regarding our findings and the findings of other studies.In [7], it was shown that one can still obtain statistically sound results, even with a dataset that contains examples of all the errors listed above.It shows that by a combination of data binning and robust statistics over a large amount of data, statistically sound inferences can be made from the very noisy data (i.e. the trend that the fish speed increases with water temperature).Different from [7], this paper focuses on the computer vision and big qualitative data analysis methods.The main contribution of this paper is a method for cleaning noisy tracking data and estimation of a dataset property (speed) robustly with and without cleaning the dataset.The found trend using all data was also validated by removing outliers (i.e. by cleaning the data) which could represent errors especially false detections and incorrect trajectory assignments.To detect outliers, we propose an effective outlier detection algorithm which is based on cluster cardinality.Clusters are obtained applying a mean-covariance restricted Boltzmann machine (mcRBM) which groups the data such that data points in the same group are more similar to each other than to those in other clusters.
This paper is organised as follows.Section 2 discusses some current considerations about big data, in particularly data cleaning and outlier detection methods that have been used for data cleaning.The dataset used in this work is introduced in Section 3 including the error estimation that was performed using random subset selection.In Section 4, the outlier detection method which is based on mcRBM, the data binning method which is used to analyse the whole data and the cleaned data are presented.We present the experimental results in Section 5. Finally, we conclude the paper with a discussion in Section 6.

Background
Thanks to improvements in computational and storage resources and data acquisition tools, big data analysis has become a central topic in data science research.Similarly, for investigation of animal biometrics, recognising different species and animal behaviour understanding, scientists have started to undertake their analysis using big data which contains not only huge amounts of data but also the data which includes variety such as different species or behaviour classes.Among several works, Palazzo and Murabito [10] proposed a semi-supervised fine-grained fish recognition algorithm which utilised a dataset having 20 million fish images.In [11], convolutional neural networks were used to classify the two different behaviours of Drosophila.In that study [11], (i) standing/ walking and (ii) not in physical contact with the substrate were recognised using over 21,600 labelled training data.To analyse the social behaviour of mice, in [12], a single target tracking method was presented.Once the mice were tracked continuously for 5 days, the corresponding trajectories were analysed to measure the social behaviour of mice.In total, 100k samples were collected and used for the evaluation of the proposed system.Interested readers can refer to a recent survey on visual animal biometrics [13] which also mentioned the importance of developing better platforms to be able to present more efficient algorithms to process massive data.
Although it is not applied to big data, the research most related to this paper is [14] which presented a method for automatic detection of the erroneous fish trajectories (which exist due to object occlusions, tracker mis-associations and background movements).In that work [14], trajectories were represented using Hidden Markov models (HMMs).Later, multi-dimensional scaling (MDS) was applied to project all trajectories onto a lowdimensional fixed length vector space.k-means clustering was then applied to the resulting vectors to model the correct trajectories which were used to detect erroneous trajectories.In detail, to decide whether a new trajectory is erroneous or not, for each cluster (obtained after applying k-means) a corresponding HMM was built.The resulting k-HMMs were then used to evaluate the likelihood that the new trajectory belongs to one of the HMMs and if the maximum likelihood is smaller than a threshold, that trajectory was identified as erroneous.Although this method looks promising, it requires training data to build the HMMs.Additionally, it is not totally parameter free, at least for k-means the number of clusters should be learnt from the training data.Applying this method to big data is not very feasible since the needed size of training data is not clear and there is no guarantee that the selected training data will be representative enough.
The following part of this section focuses on describing big data and addressing the current considerations mainly about data cleaning.One of the most well-known methods for big data cleaning is applying outlier detection [15].As the approach presented here is based on outlier detection, we also review outlier detection methods for the purpose of data cleaning.

Big data and current considerations
The definition of big data is domain specific even though the importance of analysing big data has been generally recognised [6].Although big data mainly refers to vast volumes of raw data, there are also some other concerns.Chen et al. [16] defined big data as masses of unstructured data that traditional information technology equipment such as desktop workstations, non-clustered nodes, typical software tools are not able to gather, store, process, manage and analyse.In [17], big data was defined not only in terms of volume but also in terms of velocity and variety.In [17], the velocity refers to producing and processing the data rapidly and on time to satisfy the demand while variety means various modalities such as image data from a sensor source, and text data from social networks etc.In addition to volume, velocity and variety in [18,19] veracity (accuracy, trustability of the data), variability (inconsistency in the data), value (usefulness of the data for decision making) and complexity (the degree of interconnectedness) were also added to the features of big data.
Big data analytics cover extracting meaningful patterns from massive raw data for prediction and decision making.Especially in the last decade, various companies utilised big data analytics to monitor and analyse their business needs and had a better understanding about their business which can lead to better customer service, improved products, increased sales etc.However, the challenges of big data analytics still continue.Key problems are: exponential growth of data, need for suitable data storage, compression, transmission and data indexing.There are also problems such as variety of the raw data, highly distributed and various sources, high dimensionality, scalability of existing algorithms, imbalanced data, limited labelled data, lack of efficient information retrieval, noisy data and so forth [20].
As more video cameras exist in our lives, such as cameras embedded in our mobile phones, laptops, surveillance cameras in the buildings, ATMs, traffic etc., the image and video generated by such devices has become the largest big data source [21].These large amounts of video and image data have become attractive to the computer vision and machine learning research community, and which present great opportunities and new challenges.Automatic video annotation [22], parallel computing, developing scalable algorithms for scene understanding [23,24], object detection [25], object tracking [26], object recognition [27,28], video data visualisation, image search, image retrieval and indexing [29] can be seen as some applications which become even more challenging with big data.
As mentioned above, one of the big challenges of big data is data cleaning which was also stated in [30] and this can require efficient data querying.Fan et al. [31] proposed a method to querying big data using a small amount of it.In that study [31], a concept called bounded envelope retrieves a sufficiently accurate subset which is a good approximation of the full data set.Similarly, in their previous work on scale independence [32,33] it was shown that the size of small representative dataset is more dependent on the query than the size of the full dataset.In [6], methods to remove false fish detections from a large fish image dataset were presented.It was shown that evaluating big data by sampling smaller subsets is the most convenient way of evaluating the cleaned dataset.
An alternative approach to handling errors in big datasets is not to clean the data, but rather to model the types and results of the errors on a smaller ground-truth dataset, and then use the model to correct the statistical results [34].While this does not correct individual errors, it improves the overall results, under the assumptions that a given level and type of errors are expected in the full dataset.
Recently, active learning-based data cleaning methods were proposed as well.Krishnan et al. [35,36], proposed an iterative data cleaning tool which focuses on the errors such as missing data, incorrect or inconsistent values in the data.The proposed method [35,36] cleans the data while preserving provable convergence properties.
Another popular approach for data cleaning is utilising outlier detection methods which are also efficient on very large datasets [15].An outlier is defined as a datum which deviates significantly from other data points where the quantity of outliers is less than the quantity of inliers [37].In [5], to obtain a clean dataset for the purpose of fish recognition, clustering-based outlier detection (see Section 2.2 for more information) was performed.In that study [5], the fish images were first clustered.The outliers were detected as the images which were not similar to the representative image (cluster centre) of the cluster that the outlier belongs to.Once an outlier was detected, they were removed from the data and the clean data were used for fish recognition task.In the following section, we review recent outlier detection methods which were utilised for data cleaning.

Outlier detection for data cleaning
Outlier detection methods have been proposed for various applications such as fraud detection, network intrusion, weather forecasting and many other data mining tasks.In this work, we focus on outlier detection based on a deep learning technique specifically for data cleaning.
In [15], data cleaning methods were addressed for different types of data for instance both qualitative and categorical.Particularly, for big qualitative data (which is the data type in this study as well), outlier detection was presented as the basis of data cleaning.In that study [15], different outlier detection methods were investigated with their advantages and disadvantages.Outlier detection methods were categorised as (i) non-normality assumption-based methods, (ii) data modelling-based methods, (iii) data partitioning (clustering)-based methods, (iv) model-free detection, (v) distance-based methods, (vi) time-series methods, (vi) re-sampling-based methods and (vi) frequency-based methods [15].
Among many different categories, clustering-based, distancebased and density-based outlier detection methods are the most popular approaches for data cleaning.As an example of distancebased outlier detection methods, Kollios et al. [38] proposed a generic method which can be used for data cleaning as well.In that study [38], a data point is defined as an outlier if at most k other points lie within a defined distance.Similarly, Knorr and Ng [39] proposed a metric to define an outlier threshold which is a percentage of data points at a distance from a data point.Breunig et al. [40] defined the density as the average distance of a data point to its k nearest neighbours where data points having low density are defined as outliers.Loureiro et al. [41] presented an approach which is based on hierarchical clustering to detect the errors in foreign trade data.In that study [41], small clusters were defined as the clusters which contain outliers assuming that they are distinct from the majority of the data.In a very recent study [42], outlier detection based on k-means and k-mediods clustering methods was presented which were applied for data cleaning.
Overall, although the proposed methods are useful, in practice, they usually require data point-to-point distance (similarity) calculations which makes them not scalable and their application for big data becomes not feasible.Motivated by this, in this study we propose an outlier detection method which is based on the clustering capability of mean-covariance restricted Boltzmann machine which does not require any similarity (distance) calculations.

Dataset
The dataset used here was based on video data captured as part of the Fish4Knowledge project [43].The videos were captured from one of four fixed cameras (3.6 mm focal length, 2/3 inches CCD) in uncontrolled open sea conditions in the intake bay of the third nuclear power plant inside Kenting National Park, Taiwan.Simultaneous water temperature readings were stored with each video.The videos were analysed to detect and track fish using a covariance-based tracker [26] while species recognition of individual fish was based on a balance-guaranteed optimised decision tree classifier [27].We selected the data associated with the damselfish Dascyllus reticulatus, which lives in colonies, commonly feeding on zooplankton near coral heads.The camera used in this study was at a depth of 2 m, and a typical image from the video data is shown in Fig. 1.Owing to the shallow camera and coral depth, the scene is greatly affected by illumination variations, arising from both changes in the sky lighting (sun position, clouds) and, more importantly, refraction of the light by the ocean surface causing caustics that can be mistaken for fish, or which might cause fish to be undetected.
In total, 12,247 videos (640 × 480 resolution, 10 min each, 24 frames per second) which is 2041.2h of data were analysed.3,649,007 trajectories of D. reticulatus were identified and used in the analysis.To assess the quality of the automatically detected and analysed data, 1000 of the 3.6 million fish trajectories were manually examined, with 100 trajectories from each of the ten temperature intervals chosen randomly (see Section 5 for the definition of intervals).Manual examination was performed as follows.For each detection of a given trajectory, we examined whether there was any false detection (the detected object is not a fish) or not.This condition included the requirement that the majority of the bounding box should contain the detected fish and if the detected object was a fish then it should have also been the same fish detected in the previous frames.We also checked whether there was any false recognition (the detected fish is not D. reticulatus).If there was at least one false detection or one false recognition, the corresponding trajectory was classified as an erroneous trajectory, otherwise it was classified as a correct trajectory.By looking at consecutive frames in the video, we also assessed whether the linked detections were likely to be from the same fish, i.e. consistent with the fish's direction, motion and neighbours.These 1000 trajectories led to 16,504 detections, of which 16,210 were actually fish, 745 trajectories (11,602 detections) of the 1000 trajectories were correctly tracked from frame to frame.All 745 trajectories were D. reticulatus, although we expect that there are some instances of mis-recognition in the full dataset.Consequently, we estimate that 74.5% of the 3.6 million trajectories are valid.

Methods
The central question to be answered was whether fish speeds (for this species) increased, decreased or had no trend as a function of water temperature.The temperature was measured directly on a per-video basis.
Each trajectory was represented by a set of Traj = {(c t , r t )} t = 1 T such that (c,r) are image column and row positions of the centres of the detection bounding boxes for a trajectory composed of T detections.Although there were no direct 3D position estimates, the observed fish were largely mature adults, with a typical height of 33 mm (see [7] for the reasoning).These fish swim, on average, horizontally, with the result that there is often foreshortening of the apparent length of the fish due to its pose relative to the camera.On the other hand, the foreshortening of the apparent height of the fish is largely due to the distance of the fish from the camera.Consequently, we estimated the 3D position of the fish from the measured height of the bounding box, from which a set of T scene positions were estimated, and then the fish speed was estimated from the time taken to traverse the total distance travelled Complete details of the speed estimation method are given in [7].
The general principles of the method are straightforward.Unfortunately, there are many factors that contribute to cause erroneous speed estimates, some of which are identified here: i.A small percentage of detections (est.2% based on the analysis in the previous section, i.e. for the manually examined data) are false positives, which can lead to unrealistic distances when combined with consecutive true positive detections.ii.A significant percentage of trajectories (est.25%) are formed by connecting detections incorrectly.This can arise easily as D. reticulatus tend towards aggregating into schools, and thus iii.Some fish are larger or smaller than the nominal sizes (using the mode of the fish image height distributions for each bin given in [7], the estimated real height of the fish can be 0.1 times of the nominal height of the fish in average), which lead to erroneous depth and thus position and speed estimates.iv.Some fish are not swimming horizontally and thus their bounding box is larger than expected (for the manually examined data, 38.6% of the bounding boxes are not horizontal).This leads to closer position estimates and thus slower speed estimates.v.Some bounding boxes are larger or smaller than the true fish (for the manually examined data, the mean and the standard deviation of the area of the bounding boxes are 2310.58and 1618.83pixel square), which again lead to erroneous size, distance and speed estimates.vi.There are natural variations in the behaviour (due to changing circumstances) and capability (due to the difference in metabolism, see [5], chapters 2 and 12, for more information).
One might be discouraged based on all these influences and it is quite likely that many of the measurements from individual fish are affected.However, we take the view that one can analyse the data from the general perspective of the 'Law of Large Numbers', whereby individual variations tend to cancel out to expose the underlying trends.Our data does not satisfy the conditions for the Law specifically, but follows the general perspective that enough data cancels fluctuations to reveal the underlying distribution.Fig. 2 shows the histogram of speeds for the 28.1-30.3°Ctemperature band (for more evidence, i.e. showing similar trend, interested readers can refer to supplementary material of [7]).It is clear that there is a significant peak irrespective of the considerable variation due to the factors identified above (and possibly more).
Below, we describe an approach that uses a mean-covariance restricted Boltzmann machine to find clusters which have few samples and are expected to contain fish trajectories that are significantly different from the vast majority of the trajectories, i.e. erroneous trajectories (outliers).We claim that the trend found for fish speed and temperature using the remaining fish trajectories will be the same with the trend found using all fish trajectories.Showing the evidence of such a claim will prove that it is still possible to obtain statistically reasonable results, even with a dataset that contains much noise, once you have very big data.

Validation of the method
In this section, we present an outlier detection method which utilise clustering based on a mean-covariance restricted Boltzmann machine.Thanks to effectiveness of mean-covariance restricted Boltzmann machine, 3.6 million trajectories can be clustered without need of any similarity calculation between samples.Then, the outliers are detected and removed to apply the same pipeline (3D position estimation, speed calculation and data binning) presented in [7] to be able to compare the trends.

Trajectory representation
To cluster the trajectories, the mean-covariance restricted Boltzmann machine requires each trajectory to be represented with a vector of fixed length.Among several trajectory representation methods (such as vector quantisation [44], discrete Fourier transform [45], Chebyshev polynomial approximation [46], the Haar Wavelet transform [46], principal component analysis (PCA) [47], and HMM [48]), in this study, cubic B-spline control points [46,49] is applied.This is mainly because (i) cubic B-spline control points is able to encode the shape and the spatiotemporal property of trajectory, (ii) the control points and weight factors are flexible to encode any simple or complicated trajectory, (iii) its better performance has been shown in studies such as [46,49,50] to detect the abnormal trajectories (which are outliers in other words) and last (iv) it is not based on learning.The disadvantage of this method is to be based on the chosen number of control points which might result in ignoring sharp changes in trajectory if an incorrect number of control points is chosen but the similar disadvantages exist in other popular trajectory representation methods such as HMM and PCA.
Given that each trajectory is defined by a set of T image column and row positions of the centres of the detection bounding boxes, the approximate of cubic B-spline curve is where p is the number of control points, 4 is fixed because the order of the function is 3 for cubic spline, C r and C c are the unknown control points while every control point is based on a Bspline function which is shown as B i, 4 (t).The B-spline basis functions are defined by a knot vector τ (which determines where and how the control points affect the spline curve, see [46] for the values used) as follows [46,49,50]: The p coefficients which minimise the sum of squared errors between the original trajectory (Traj) and its approximation S are found by the Moore-Penrose pseudoinverse operator Φ that can be defined as while the control points are defined as F CR = Φ † Traj CR where F CR is the final trajectory representation.
In this work, we set the number of control point to seven as applied in [46,49,50].Seven is a reasonable number given that the median and the mean values of the trajectory lengths are 12 and 20, respectively.

Mean-covariance restricted Boltzmann machine
Using restricted Boltzmann machines (RBMs) for clustering has some advantages over other clustering methods: (i) Clustering methods like k-means and hierarchical clustering require pairwise similarity (distance) calculations, whereas RBMs do not require this.(ii) There exist clustering methods, e.g.Gaussian mixture model (GMM) and Dirichlet process mixture model (DPM), that also do not need pairwise comparisons.However, for GMM, the number of mixtures (i.e. the number of clusters) should be known.On the other hand, for DPM, although it is non-parametric and can learn the number of mixture components without being specified in advance, the behaviour of the model is sensitive to the choice of prior base measure.Moreover, it needs to calculate mean and covariance for each component, and update covariance with Cholesky decomposition, which may lead to high space and time complexity [51].(iii) As mentioned in [52], other clustering methods (such as GMM) as opposed to RBMs need a huge number of clusters to capture all the variations in the input, whereas a reasonably small RBM can capture very complicated distributions, since RBM is able to discover a rich representation of the input.This is because N hidden units can represent up to 2 N different regions in input space.With other clustering techniques, one would need O 2 N parameters (and/or examples) to capture that many regions, unlike RBMs which require O(N) parameters.
The mcRBM is a type of RBM that is capable of feature learning from real-valued data [53].Similar to all RBM models, mcRBM has a bipartite undirected graph structure.It contains two layers of stochastic random variables, which are also called units.The first layer is a visible layer that represents the observed data, namely visible units (v).The second layer has latent variables that are also referred to as hidden units (h).Different than standard RBM models, the mcRBM has two sets of hidden units: (i) mean units h m and (ii) covariance units h c .The h m units model the mean of the input elements, while the h c ones represent the pairwise dependencies between the visible units, hence modelling their covariance structure.There are no connections between variables within the layers, thus, the variables of a layer are independent of each other (see Fig. 3 for an illustration of the mcRBM).
The mcRBM is a combination of a Gaussian RBM and a covariance RBM [53].Its energy function is composed of two terms and is defined as follows: E c defines a zero-mean Gaussian distribution over the visible variables and is given by where R is the visible-factor weight matrix, P is the factor-hidden (or 'pooling') matrix and d is the hidden bias vector.E m is defined as with W denoting the direct connections from the hidden mean units h m to the visible units (v), b is visible bias and c is the hidden mean bias.
By having E m , different than the cRBM, the mcRBM can produce conditional distributions over the visible units given the hidden units, that have non-zero means.The conditional distribution of the hidden covariance units given the visible unit states v is where σ(x) = 1/(1 + exp( − x)) is the logistic function and the conditional distribution over the mean hidden variables is The resulting conditional distribution over the visible variables is a Gaussian distribution which is in terms of the hidden covariance and hidden mean latent states and is defined as where Σ is given by In this paper, as recommended in [53] and used in [54] for mice behaviour analysis, the normalised version of mcRBM is used to avoid the feature-based disparity.It is trained using the common RBM parameter θ ∈ (R, P, W, d, b) with stochastic gradient ascent: where ⟨ ⋅ ⟩ denotes expectations under the distributions specified by the subscript.To compute the expectations under the model distribution, we use hybrid Monte Carlo on the mcRBM's free energy which allows obtaining reconstructions (as suggested in [54]).Finally, by using the mcRBM, clusters are obtained by using the different binary feature configurations of the model's latent variables which represent different modes of the input data distribution.
The hyperparameters used to train the mcRBM are given in Section 5.All training parameters were randomly initialised to small values as suggested in [53] and match the example configuration of the code accompanying [53].Normalisation of weight matrices (R, P) also matches the code accompanying [53].Regarding convergence, looking at the energy function, we observed that for the current data the network was converging after 1000 training epochs.In order to avoid fast convergence, differently than [53], we did not use annealing of the learning rates, because we observed that after few epochs, the network was not learning anymore.

Outlier detection
An outlier is a datum which is far away from the other data points in a dataset and/or the data point which has (very) low likelihood to be from a modelled/compared distribution.Usually, the cardinality of the outliers is less than the other data samples.In this study, we adapted the outlier detection algorithm presented in [37] such that outliers are the data points located in small clusters.Small clusters are defined as the clusters having fewer trajectories which can be identified by a threshold.This threshold is calculated as follows.The cardinality of each cluster is calculated and the median of them are found.Later, the threshold is defined as A% the found median value (see Section 5 for the values used as A).If a cluster's cardinality is smaller than that threshold then that cluster is determined as a small cluster.All trajectories which belong to a small cluster are detected as outliers and removed for further analysis (3D position estimation, speed calculation and data Fig. 3 Illustration of mcRBM.Adapted from [54] binning) assuming that they are false detections and incorrect trajectories and the remaining dataset is cleaned.

Data binning
Once 3D position estimation and speed calculation are performed for each individual fish trajectory using the method given in [7], the temperature data and the corresponding speeds are grouped into ten bins.Data binning is used to find whether fish speeds (for this species) increased, decreased or had no trend as a function of water temperature.In detail, each bin contains a similar number of trajectories.Since there were more data for some temperatures and much less data available for other temperatures (see Fig. 4), such a data binning was the only reasonable way compared to another alternative which is dividing data into equal temperature intervals.Additionally, if the number of bins was >10, the resulting number of trajectories in each bin is not as similar as when we set the number of bins equal to 10.On the other hand, if the number of bins was <10, this merges the data such that some bins correspond to a very wide temperature interval when compared with other bins.This might make the analysis less accurate in case there are fluctuations in the speed as temperatures rise.
The temperature intervals and the number of trajectories (whole data (not cleaned)-without outlier detection) in each interval are given in Table 1.

Results
As stated previously, the central question to be answered was whether fish speeds increased, decreased or had no trend as a function of water temperature.As discussed in [7], the swimming speed of D. reticulatus was found to increase as the water temperature increased.What is interesting in this paper are the methods used to ensure that this conclusion was sound.

Results without data cleaning
Data binning as described in the previous section helped to cope with the unequal distribution of data, particularly at the higher temperatures where the standard deviation of speeds is higher.Table 2 summarises the data in terms of mean, median, mode (extracted after histogram of the data is smoothed) and the standard deviation of speeds associated with this binning.One can observe that: mean, median and mode speeds are generally increasing with temperature.The trend for speed increase is weak for bins 4-6; however, note that these bins combined cover only about 1°, whereas bins 1 and 10 cover several degrees individually.Thus, it is not surprising that bins 4-6 have similar mean, median and mode speeds.
In Fig. 4a, the speed data versus temperature is given.For each bin, the mean (shown with white circle) and the median (shown with white diamond) were marked by taking the middle temperature value of each bin as the vertical coordinate.Additionally, the box plot representation of the same data is shown in Fig. 4b (adapted from [7]).In that figure, speeds are truncated at 100 mm/s to present the trend clearer (the maximum speed is visible in Fig. 4a).The horizontal axis is not uniformly spaced because we chose to use bins that had approximately equal numbers of samples.The boxes are bounded by the upper and lower quartile of the data, the mid-bar is the data median value, and the whiskers show the most extreme speeds.As seen, as well as the increasing mean, the median had an increasing trend with temperature.Since there was a significant amount of data in each bin, powerful statistical tests could be applied to confirm the hypothesis of an increasing speed trend.The Kruskal-Wallis significance test (which does not assume either normality or homogeneity of variance, i.e. having approximately equal variance on the scores across groups) was applied to assess whether the speeds in the different bins were significantly different.The test showed that the mean ranks for each temperature interval are significantly different from each other, which implies that the speeds for each temperature interval are significantly different (p < 0.05).Tukey-Kramer post hoc analysis was applied to analyse the speeds of each pair of temperature intervals.The results again showed that the speed distributions are significantly different for each pair of temperature intervals.To answer the central question of this paper which is that the speed had a trend, or was random as a function of water temperature, the Mann-Kendall test (p < 0.05) was applied to the mean, median and mode speeds for each temperature interval.As a result, it was obtained that an increasing trend existed for the mean, median and mode speeds with p-values of 0.0056, 0.0095 and 0.0056, respectively.

Results with data cleaning
The same data binning procedure was applied to the cleaned data which was obtained after the proposed outlier detection method was applied.The temperature intervals were kept the same for comparison while a similar number of trajectories were obtained for each bin as well.mcRBM was trained with different parameters for hidden covariance units h c = 14 , hidden mean units h m = {10, 20} , batch size = 256 and epochs = {900, 1600} (see Section 4.3 for the definition of parameters).For outlier detection, the A parameter (see Section 4.4 for the definition) was taken as 1, 5, 10, 15, 20 and 25.By using different parameter settings for mcRBM, we obtained different numbers of clusters.However, the total number of remaining trajectories (even the number of trajectories per bins) did not change much (and therefore the mean, median and standard deviation were changed only in the 0.001 place) no matter which outlier detection parameter was taken.This is perhaps because mcRBM is good at clustering especially when there is a large amount of data (3.6 million).In detail, we observed that by increasing the number of the latent variables, the number of clusters was gradually increasing.We also realised that there is a certain number of hidden units beyond which there is no point of adding more, since they will stay in a non-active state during the experiment, meaning they do not affect the clustering results in anyway.We reached a good balance between computational cost and quality in results using the aforementioned set of latent variables as well.Fig. 5 shows the histogram of speeds for the 28.1-30.3°Ctemperature band after data cleaning is applied.When Fig. 2 is compared with Fig. 5, it is seen that the mode of the histograms did not change much although the mean and the median decreased slightly after data cleaning.In Table 3, the number of trajectories, mean, median, mode (extracted after histogram of the data is smoothed) and the standard deviation of speeds associated with each bin for the cleaned data are given.After removing outliers, in total 2,239,021 trajectories (∼61% of the whole data) remained for analysis (which are not detected as an outlier).These results were obtained when 123 clusters (123 unique activations) were found after applying mcRBM when h c , h m , batch size and epochs were set as 14, 10, 256 and 1600, respectively, and A was taken as 10.The number of visible to hidden covariance factors was kept equal to the number of hidden covariance units in all the experiments, as suggested in [54].When we changed A from 25 to 1 only 10,276 trajectories (0.28% of whole data) more were detected as outliers.Additionally, distributions of control points (features; see Section 4 for definition) in some of the dense and small clusters (not all due to space limitation) are shown in Fig. 6 which corresponds to results given in Table 3.As seen, the behaviour of different clusters is distinctive (there are different patterns) while each cluster (especially the dense clusters which are not outliers) has a low standard deviation per control points.This means that the intracluster similarity is high while inter-cluster similarity is low as expected from a successful clustering algorithm.
The Mann-Kendall test (p < 0.05) was also applied to the mean, median and mode speeds of cleaned data.As a result, an increasing trend was obtained for the mean, median and mode speeds with pvalues of 0.0032, 0.0056, and 0.0032, respectively.The same pvalues were also obtained when different mcRBM parameters and the outlier detection parameters were used.
When Tables 2 and 3 are compared, it can be seen that the mean values of each bin changed which is expected since the proposed method removed (a lot of) outliers.The median values also all went down after data cleaning which can be interpreted as the data

Evaluation of data cleaning results
As mentioned in many studies and also in Section 2, one way of estimating errors in a dataset is evaluating the performance of it on random small subsets of the data.As given in Section 3, such an analysis, which performed by manual examination, showed 25.5% error which is notably close to the number of outliers (∼29% of the whole data) that were automatically detected by the proposed outlier detection algorithm.In detail, (i) 52% of the data which were identified as erroneous trajectories by manual investigation were also detected as outliers by the proposed outlier detection algorithm, (ii) 78% of the data which were identified as correct trajectories by manual investigation were also detected as inliers (not outliers) by the proposed outlier detection algorithm, (iii) 22% of correct trajectories identified by manual investigation were removed as outliers by the proposed algorithm, (iv) 48% of erroneous trajectories identified by manual investigation were detected as correct trajectories by the proposed algorithm.The net result in the cleaned dataset is now estimated to consist of 82.6% correct trajectories (as compared to 74.5%).

Discussion
The analysis presented in this paper showed that it was possible to obtain statistically sound conclusions from image data, even in the presence of many sources of errors that corrupt the data and subsequent results.At a minimum ∼25% of the data suffered from detection and tracking errors, but almost certainly a large amount of the correctly tracked data was also corrupted to some degree by failing to conform to the model of the ideal horizontally moving fish accurately delimited by its bounding box.Nonetheless, we showed that the conclusions are sound, essentially because the major errors led to outlier values that could be easily eliminated, and many other errors led to over and under estimations that tended to balance out, leaving a significant and obvious mode to the estimated speeds.In other words, having on the order of 300 thousand data points allows the mode value to be obvious.Although we had no ground truth to validate the correctness of the estimated speeds, the speeds were reasonable from a marine ecology perspective.More importantly, even if there was a systematic error in the scaling of the speeds (i.e. if the real speeds were greater than the estimated speeds), this error would affect all of the results, but the conclusions about the trend were based on a relationship that was invariant to the actual result scaling.
This paper has presented a specific example of 'big visual data' analysis, but the question is what can readers take from this example.Clearly, the time of having completely clean training and test data has passed, simply because of the volume of data now available.It still makes sense to develop and evaluate algorithms using small clean datasets, but when large data sets start to be analysed, such forms of precision computer science are no longer feasible.Hence, one has to: (i) develop algorithms that aim for unbiased error distributions (so the law of large numbers could apply), (ii) aim for measurements and properties that are robust to the many uncontrollable factors that affect the data and (iii) evaluate performance on random small subsets of the data.A fourth approach is to develop outlier detection methods capable of cleaning out a lot of erroneous data although this also risks eliminating unanticipated subclasses of true positive data whose behaviour did not conform to the data model.
In addition to all the discussions given above, it should not be ignored that this work presents reliable evidence for an increase in fish speed as water temperature increases (here, readers should consider the temperature interval used and also the fish species investigated).Robust statistics (i.e.median values, also presented in our early work [7]) and the proposed pipeline for data cleaning (trajectory representation, clustering using mcRBM and outlier detection) are the approaches applied to prove that the found trend is reliable.It is shown that this scientific conclusion is significant even the data includes erroneous trajectories.In this paper, the key point is to show that the cleaning process gives the same conclusion, but using a cleaner dataset has its own intrinsic value because one can now do other analyses better with the cleaner data, e.g.abundance estimation, size distribution estimation etc.
The justified fish swimming speed trend is directly related to fish behaviour understanding which is an important research topic for marine biology.The trend is complementary with findings of marine biologists (see [7] for more information), which in a way shows that using fully automated computer vision systems like [43] can be helpful to marine biologist in their analysis.
In addition, one can adapt the presented work (trajectory representation, mcRBM and outlier detection parts) for trajectory analysis (not only for analysis of fish trajectories but for other types of trajectories) for instance to detect the abnormal trajectories (in other words; rare trajectories, unusual trajectories or anomaly detection).Another application can be using the trajectory representation algorithm and the mcRBM for automatic biometric identification (recognition) based on handwriting.

Fig. 4
Fig. 4 speed versus temperature without data cleaning

( a )
Mean (shown with white circle) and median (shown with white diamond) were marked by taking the middle temperature value of each bin as the vertical coordinate.There are more data for some temperatures and much less data available for other temperatures, (b) Box plots for each bin.Speeds are truncated at 100 mm/s.Outliers found by robust statistics are shown individually with red plus signs (appears as thick bars).The boxes are bounded by the upper and lower quartile of the data, the mid-bar is the data median value, and the whiskers show the most extreme speeds excluding the outliers.This figure is based on[7]

Fig. 5 Fig. 6
Fig. 5 of trajectories (in thousands) versus speeds (in mm/s, shown until 200 mm/s) after data cleaning was applied in the 28.1-30.3°Ctemperature band

Table 2
Summary of estimated speed in each temperature bin when whole data is analysed

Table 3
Summary of estimated speed in each temperature bin when cleaned data is analysed Bin Number of trajectories, mm/s Mean speed, mm/s Median speed, mm/s Mode speed, mm/s Standard deviation IET Comput.Vis., 2018, Vol. 12 Iss.2, pp.162-170 © The Institution of Engineering and Technology 2017