Probability
&
Statistics
In this tutorial I will try to explain basic and fundamental concepts in statistics and probability that every Data Analyst should know. We don't need to know sophisticated math concepts, just basic knowledge of addition, subtraction, division and multiplication should be enough.
Statistics
According to Wikipedia statistics is the discipline that concerns the collection, organization, analysis, interpretation and presentation of data.
Two main statistical methods are used in data analysis: descriptive statistics, which summarize data from a sample using indexes such as the mean or standard deviation, and inferential statistics, which draw conclusions from data that are subject to random variation (e.g., observational errors, sampling variation). Descriptive statistics are most often concerned with two sets of properties of a distribution (sample or population).
Data Type
Various attempts have been made to produce a taxonomy of levels of measurement. The psychophysicist Stanley Smith Stevens defined nominal, ordinal, interval, and ratio scales.
Nominal measurements do not have meaningful rank order among values, and permit any one-to-one (injective) transformation.
Ordinal measurements have imprecise differences between consecutive values, but have a meaningful order to those values, and permit any order-preserving transformation.
Interval measurements have meaningful distances between measurements defined, but the zero value is arbitrary (as in the case with longitude and temperature measurements in Celsius or Fahrenheit), and permit any linear transformation.
Ratio measurements have both a meaningful zero value and the distances between different measurements defined, and permit any rescaling transformation.
Because variables conforming only to nominal or ordinal measurements cannot be reasonably measured numerically, sometimes they are grouped together as categorical variables, whereas ratio and interval measurements are grouped together as quantitative variables, which can be either discrete or continuous, due to their numerical nature.
Quantitative
· Graphs
o Line chart
o Relative frequency histogram
o Dot plot
o Stem and leaf
Qualitative
· Measurement
o Frequency
o Relative frequency
o Percentage
· Graphs
o Pie Chart
o Bar chart
Center of the data
Mean (Average)
Sum of the numbers divided by how many numbers are being averaged.
Median
This is the mid point on the data, which means we need to sort the data and chose the one in the middle (equal number of data points are above and bellow). In case if having even data points, we need to take the average of the two middle data point.
Mode
This is the data point that is repeated the most in our data set. A data set can have zero or even two or more modes. The frequency of the mode can give valuable info about our data. A histogram for Mode can virtualize the frequency of a data point.
By using the combination of the above description, we will be able to describe the data better.
Variability of the data
Range
difference between smallest and largest data point in our dataset.
Standard Deviation
This represent the average squared distance from the mean
Outlier
It is a data point in a data set that is an abnormal distance from other values. Outliers can be virtualize by Tables or Charts.