Exploratory Data Analysis API PNG PDF EDA is one of the very first steps of my data science projects. Purpose of EDA : The questions to be answered serve as guides in EDA Polish the Questions Check if the questions to be answered are valid or well stated; If not, modify them or come up with new ones Validate Data I/O Methods Check and validate the methods to load and save the datasets Is the Dataset Good Enough for the Problem? Are the features/variables required for the project included? If not, what other data should be included. What is the General Quality of the Dataset Can one answer the questions semi-quantitatively using the data? Retrieve Domain Knowledge and Anomalies Propose the Next Steps Communicate with Domain Experts : What are the features? Pay attention to the units Do the results from EDA make sense to the experts? What do the experts want to know from the data? Workflow : Polish the Questions Will there be any new restrictions to the solutions? related to the dataset Data Quality and Summary Data Quality and Summary Statistics Report Data Quality and Summary Does the result make sense? This is a crucial step in EDA. Use techniques such as Fermi estimates to evaluate the summary. Consistencies between the summary and expert expectations Data Quality and Summary API PNG PDF Validating the data quality and generate summary statistics reports. Rows and Columns : The questions to be answered serve as guides in EDA Rows Descriptions What does the row mean? Count Columns Descriptions What does the column mean? Count How many columns? Possible values or ranges List the theoretical limits on the values and validate against the data. Types and Formats : Data Types What is each column consists of? Types of data Ordinal, Nominal, Interval, Generative, etc Is the type of the data correct Data Formats Are the dates loaded as dates? Are the numbers loaded as numbers? Are they strings? Are the financial values correct? Are they strings or numbers? EU format, US format? Missing Values : Are there missing values in each column Different types of missing values Notations of missing values are different in different datasets. Read the documentation of the dataset to find out. Standard missing values nan, nat, None, na, null... Represented with a specific value -1, 0, MISSING, ... Percentage of missing values in each column Visualizations e.g., missingno python package Duplications : Are there duplications of rows/columns? Validate by yourself Do not trust the metadata and documentation of the dataset. Duplications of fields may occur when the documentation says they are unique. Distributions : What is the generation process? Is it a histogram analysis of another row? Is it a linear combination of other rows? Visualize the distributions of the values Know all the values Value count bar plot For descrete data, list all possible values and counts Histogram and KDE for continuous data, use histograms or KDE. Boxplot Boxplot is easier to understand for business people Scatter plot Gut feeling of where the data points are located Contour plot Numerical Summarization Use summary statistics to find out the moments. Locations Mean, median, quartiles, mode... Spreads range, variance, standard deviation, IQR Skewness asymmetries Kurtosis Correlations, Similarities : Pairplot Correlations Pearson, Kendall Tau Correlation Distances Calculate the distance between features or rows to understand the relations between them; Euclidean distance, Mahalanobis distance, Minkowski distance, Jaccard distance, ... Size : How much space will the data take on our storage device? Memory usage To estimate the hardware requirements when deploying the model Storage on Hard Drive in Different Formats How much space will the dataset take in different formats? Combining Data Files : One dataset may come in different files, combine them carefully. Concat The files should be concated with caution. Validate overlap Check if there is an overlap between the files.