Big data- Analytics and Statistical perspective. A work in Progress

By Dr Joseph Aluya, Posted:2014-03-16 09:31:57


See all
Social Network
Social Network Facebook social network.

Disruptive Systems
Disruptive Systems Must read for leaders in position of making contructive decisions that contributes to global human development

New Social Network
New Social Network Sleeky allows you to meet people in your field and ample opportunities.

Big data - The Analytic Perspective A work in Progress by Dr. Joseph Aluya and Dr. Ossian Garraway


Big data – Statistics:

Analytically, the objective function has been to maximize insight from big data subject to (a) a fixed budget, (b) monotonic increases in inferential power as the data grew, and (c) estimated inflection points on all explanatory variables given their respective errors.  The modern enterprise has enhanced the value of the big data phenomenon by using sophisticated statistical techniques to leverage meaningful insights. In statistics, big data\'s signature has been characterized by applying measurements to massive datasets and testing these datasets for exploratory and confirmatory purposes.  The statistical focus of interest of big data has been in the granularity of the data as opposed to the volume.  The statistical problem with big data has been that the numbers of hypotheses to be tested, error rates and row data to be regressed upon, have been expected to coincide with the exponential growth of big data because the models have been including more variables to enhance insights about dependent variables and their patterns of behavior. The prodigious increases in the data available have induced consumers to demand an expanded set of granularity. As a result, this has affected the delivery of statistical results in a timely and meaningful way given the continuous shortening life span of the data collected.   In practice, as statisticians processed more data, the volume of data processed driven by consumer ambition, has correlated to the increase in error rates, which in turn has increased distortions about the future. The growing pursuit of infinite ambition by the modern firm has led to increases in latent or combinations of inferences. In addition, as the algorithms behind the sophisticated statistical techniques became less linear, the greater was the likelihood of risk with respect to the pay off (Jordan, 2013). 

In statistics, the big data problem has been characterized by massive sample sizes and high dimensionality; both of which have given rise to more computational methods, new statistical modes of thought as well as challenges to the traditional statistical procedures currently in use (Bickel et al., 2009a).   Big data has afforded a deeper understanding of heterogeneity and commonality across various sub-populations. The deeper understanding has allowed researchers to explore the hidden structures governing sub-populational data which otherwise might have had to be treated as statistical outliers (Bickel, 2008).  On the other hand, statisticians have had to contend with the challenges of high dimensionality which have spawned increases in noise accumulation; as well as increased instances of spurious correlations and ancillary homogeneity. Other statistical challenges have included increases in computational costs and algorithmic instability as a result of the growing volume of data, and the diversity of data sources. To design effective statistical procedures for big data analytics, statisticians have been nudged to devise new ways to triangulate statistical accuracy with computational efficiency subject to the disturbances of heterogeneity, noise, spurious correlations and incidental endogeneity. Since increases in the aforementioned disturbances reduced the accuracy and stability of statistical procedures applied to big data, researchers have shown statistical accuracy to be inversely related to the scalability of dimensions and selections in the data.   For example, there has been anecdotal evidence that classical statistical techniques applied to high dimensional data performed no better than a random guess as a result of noise accumulation (Hall, Pittelkow & Ghosh; 2008).

In terms of computational efficiency, big data has forced researchers and practitioners to come up with blended cross-discipline procedures to improve algorithmic scalability, speed and stability when applied to the growing volume of data available to the enterprise. Cross-pollinations among the different fields have included  restating statistical problems as deterministic models in optimization theory to improve the efficiency of a solution with respect to time and storage resources (Donohio & Elad, 2003; Efron, Hastie, Johnstone & Tibshirani; 2004). In macroeconomics, where vector auto regression (VAR) was used to get the linear interdependencies among multiple time series, the conventional model normally included no more than ten variables. However with the current surge of big data, econometricians have had to make adjustments to evaluate multivariate time series using hundreds of variables. As a result, practitioners have had to employ sparsity assumptions to develop new statistical tools to prevent over fitting and poor prediction performance (Song and Bickel, 2011; Han and Liu, 2013b). In portfolio optimization, estimators have had to devise new procedures to reduce cumulative error coming from large covariance and inverse covariance matrices to estimate the returns on portfolios\' assets (Cochrane, 2001).  In the post 20th century era, statistics applied to big data has changed the enterprise\'s thinking about queries to include error bars to question the veracity of the data.  Computer science as a discipline has been mainly concerned with setting up polynomial vectors to manage resources like time, space and energy subject to constraints. Historically, data has not been viewed as a resource in computer sciences, but instead as a target to which one would apply its models. For example when applied to data analytics, the signal to noise ratios did not apply to the computer sciences. Subsequently, there was a dearth in the computer sciences community to address the quality and the pertinent insights extracted from the data collected (Jordan, 2013). These aforementioned instances have pointed to the catalytic nature of big data as a disruptive technology to drive innovation in applied business statistics.

High dimensionality has been rooted in the claim that for a given level of accuracy, the sample count required to estimate any function will increase exponentially in relation to its dimensionality. By implication, the number of required data items in its corresponding dataset will increase exponentially as well. High dimensionality has been rooted in the claim that for a given level of accuracy, the sample count required to estimate any function will increase exponentially in relation to its dimensionality. By implication, the number of required data items in its corresponding dataset will increase exponentially as well. While modern technological innovations have made it possible to process large data sets at relatively low cost, the concomitant challenges of high dimensionality have continued to abound. For instance, knowledge discovery from large datasets have usually relied on: (a) feature extraction to reduce dimensionality; and (b) variable selection to return various subsets of relevant variables from the features that were originally extracted. When applied to combinatorial optimization, the traditional variable selection techniques like Process capability (Cp) , the Akaike Information Criterion  (AIC) and the Bayesian information criterion (BIC) have led to increases in computational time that were exponentially related to increases in dimensionality. To solve the statistical challenges caused by the growing interest in high dimensional data (i.e. Big data), scholars have had to employ cross fertilizations by finding solutions steeped in a mélange of disciplines like statistics forged with computational mathematics and or machine learning (Fan & Li, 2006; Donoho & Elad,2003; Donoho & Huo, 2001). With respect to the curse of high dimensionality, scholars have argued that increasing the number of dimensions on the nearest neighbor search algorithm has eroded the meaning of the nearest neighbor search in that for some distance function d(p,q), the ratio of the variance of the distance between p and q, and the mean distance between them will converge to zero as the number of dimensions n approached infinity —hence,                       

Therefore, the interval to the nearest neighbor and the interval to the farthest neighbor would likely converge in response to increases to the number of dimensions.

While the propagation of outliers in data has lessened the likelihood that the data set conformed to a normal distribution, data analysts and statisticians have deemed it necessary to consider multivariate outliers in instances requiring the need to detect those disparate populations accounting for more than a small portion of the data. Despite its extreme sensitivity to outliers, analysts have continued to use the sample mean to estimate location parameters in statistical inference and data analysis. However, given contaminants like dimension, computation time, sample size, contamination fraction, and the distance of the contamination from the main body of data; in most cases, the median has been a more robust measure than the mean when outliers formed a significant part of a sample. The reason for such robustness has been its independence from the data\'s underlying measurement scale and coordinate system, even though in special cases like the spatial median and the coordinate-wise median, the property of affine equivariance might have been lacking (Zuo, 2004). Furthermore, most estimators of multivariate location and shape have failed when the fraction of contamination exceeded   where p represented the dimension of the data. The issue has been exacerbated when the number of dimensions increased. Therefore, in cases of high dimensionality, a small portion of outliers have been known to give erroneous estimates. Identifying outliers has been most difficult when clusters of outliers with a similar shape to the main data have been present. In addition, recognizing outliers has been inversely related to the density of the cluster (Donoho, 1982; Maronna, 1976; Stahel 1981; Rocke & Woodruff, 1996). In high dimension space, outliers with a covariance matrix resembling the main data have been the hardest type of outliers to find. (Hawkins 1980, Rocke and Woodruff 1993)

One of the feature extraction techniques to reduce dimensionality has been Principal Component Analysis (PCA). This technique has been based on the assertion of a mapping function that linked a set of vectors {x1,...,xn}  from some dimension k to a set of vectors in another dimension d  such that  scalar multiplication and addition were preserved;  and in n-dimensional space, the transpose of the matrix X must equal the inverse of the matrix X , meaning  XT = X-1 = I where I was the identity matrix. At the end of the transformation, the vectors will have been linearly independent. These linearly independent vectors were called principal components. It should be noted that the mapping can be n x n, or n x n-j for j > 0 . The ordering of the principal components has been such that each component was ranked by the size of its variance in descending order. The rationale of this statistical procedure has been that in high dimension space, dependence was correlated to noise. Specifically, with more of the total variance condensed in the first principal components as related to the same noise variance, the corresponding effect of the noise would diminish since the first minor components would achieve a higher signal-to-noise ratio, while the later principal components could be discarded since they may be dominated by noise. Therefore, since PCA was used to reduce dimensionality without loss of generality, its use has contributed to a reduction of Gaussian noise in high dimensional data.  This reduction in dimensionality has been a useful phase for processing and envisioning high-dimensional datasets and has been applied in face recognition and image compression techniques where there has been a need to find patterns in large high dimensional data (Smith, 2002).


No likes yet

Add Comment

Login to comment or Register

Comments on JOFDT


I think you got the wrong Cp. Where are the references? Seems like scat -- the singing kind.

Posted: 2014-03-19 11:32:50

Comments on Facebook