Data Science is the sexiest job in knowledge and salary. If you want to perform excellently in this field, you need to have a strong foundation in Statistics. These Statistics for Data Science introduce various concepts to solve the data science problems. You can learn statistics for data science and how to apply these concepts to different types of data.
Statistics is the mathematical science of gathering, describing, analyzing, and Interpretation of data or numerical descriptions of simple data. It is widely used to understand the complex problems of the real world and simplify them to make good decisions. Several statistical principles, functions, and algorithms can be used to analyze the data, build a statistical model, and predict the outcomes. Let’s get started with this statistics tutorial.
Analysis can be done in two ways:
- Statistical Analysis: It is the science of collecting, exploring, and presenting a large amount of data to identify patterns and trends. It is also known as Quantitative Analysis.
- Non-Statistical Analysis: It provides you with general information of text, images, sound, moving images. It is also known as Qualitative Analysis.
Types of Statistics:
There are mainly two categories of Statistics:
- Descriptive Statistics: Helps to organize data and focuses on the main characteristics of the data. It provides a numerical or graphical summary of the data. Numerical measures like average, standard deviation, mode, and correlation are used to describe the features of a data set.
- Inference Statistics: It generalizes the large data set and applies the probability theory to give conclusions. It allows mathematical equations to describe the interrelationship between two or more variables.
Where we use:
- Stock Marketing
- Weather Forecasting
- Medical Studies
- Genetics etc.
- Population: It is the group from which data to be collected
- Sample: Subset of Population
- Variable: Feature characteristic of any member of a population differing in quality or quantity from another member variable
- Quantitative Variable: Variable differ in Quantity
- Qualitative Variable: Variable differ in Quality
- Discrete Variable: No value can be assumed between the two give values
- Continuous Variable: Any value can be assumed between the two given values
- Measures of Frequency: No of times particular data value occurs in the given data set. Number and Percentage.
- Measures of Central Tendency: Indicate whether data values accumulate in the middle of distribution or the towards the end. Mean, Median, Mode.
- Measures of Spread: How similar or varied the set of observed values for a particular variable. Standard deviation, Variance, Quartiles.
- Measures of Position: Exact location of the particular data value in the given data set. Percentiles, Quartiles, Standard Scores.
Distribution helps to know the probability of an event occurs. These are suitable for comparing ranges and distribution for a group of numerical data. These can help to visualize a large amount of data to identify outliers. These will show the summary of data distribution.
- Box and Whisker Plot: Shows the five-number summary of the given data set. It is used to find out the outliers in the data set. It is mainly used to know whether the distribution is skewed or not.
You can observe the lower quartile and upper quartile at the end of the box. The vertical line inside the box indicates the median. Two lines that are extended outside the box are the minimum and maximum values of the observations.
Seaborn library will provide visualization plots. It also has an in-built data set. We are using the tips data set in which people come to a restaurant and pay tips.
import seaborn as sns
import matplotlib.pyplot as plt
from warnings import filterwarnings
df = sns.load_dataset(‘tips’)
DistPlot: It can be viewed by using the histogram. It is applicable on a univariate data set.
ax = sns.histplot(df[‘total_bill’], kde = False, color =’green’, bins = 25)
output: Here we are plotting total_bill column only because it is used for univariate set of data.
If there are two variables one is univariant and another one is bivariant then we will use join plot. It will combine both the plots univariant and bivariant.
ax2 = sns.jointplot(x =’total_bill’, y =’tip’, data = df, color = ‘blue’)
sns.jointplot(x =’total_bill’, y =’tip’, data = df, kind =’kde’, color = ‘blue’)
Here, KDE is nothing but a Kernel Density Estimation. It shows the density where we are getting most points.
PairPlot: It helps to show the pairwise relationship between the data set. It will help to know the relation between every variable.
sns.pairplot(df, hue =”sex”, palette =’coolwarm’)
RugPlot: It takes only one variable. It is the same as distplot. It counts the dashes.
These can be classified into two types:
- Discrete Probability Distribution: Allow to take limited no of values in the given range.
- Binomial Distribution
- Poisson Distribution
- Continuous Probability Distribution: Allow to take any value within a given range.
- Normal Distribution.
It is nothing but a random experiment repeatedly performed by particular no of trials (n). In every trial we will get two different outcomes, one is successful and another one is failure. Success(p), failure (q)
p + q = 1
p = p-1
p(n=r) = ncrprqn-r to find out the r success in n trails.
If the no of trials n is infinite and the probability of success p is zero the binomial distribution is not applicable to use. Poisson distribution is used when the successful events (p) are less compared to the total no of events (n). p<n.
Probability of ‘r’ times success P(r) = (e-m.mr)/r!
e = 2.71828,
m = np (Athematic mean)
It is also called Gaussian Distribution. This variable can take any value within the given range. Here, the probability distribution is continuous. It is in bell shape. Mean, Median and Mode lies at the center.
For more details follow our statistics data science tutorial as shown in below link https://www.greatlearning.in/academy/learn-for-free/courses/statistics-for-data-science