
Hands-On Data Preprocessing in Python: Learn how to effectively prepare data for successful data analytics
- Length: 602 pages
- Edition: 1
- Language: English
- Publisher: Packt Publishing
- Publication Date: 2022-01-21
- ISBN-10: 1801072132
- ISBN-13: 9781801072137
- Sales Rank: #3828308 (See Top 100 Books)
This book will make the link between data cleaning and preprocessing to help you design effective data analytic solutions
Key Features
- Develop the skills to perform data cleaning, data integration, data reduction, and data transformation
- Get ready to make the most of your data with powerful data transformation and massaging techniques
- Perform thorough data cleaning, such as dealing with missing values and outliers
Book Description
Data preprocessing is the first step in data visualization, data analytics, and machine learning, where data is prepared for analytics functions to get the best possible insights. Around 90% of the time spent on data analytics, data visualization, and machine learning projects is dedicated to performing data preprocessing.
This book will equip you with the optimum data preprocessing techniques from multiple perspectives. You’ll learn about different technical and analytical aspects of data preprocessing – data collection, data cleaning, data integration, data reduction, and data transformation – and get to grips with implementing them using the open source Python programming environment. This book will provide a comprehensive articulation of data preprocessing, its whys and hows, and help you identify opportunities where data analytics could lead to more effective decision making. It also demonstrates the role of data management systems and technologies for effective analytics and how to use APIs to pull data.
By the end of this Python data preprocessing book, you’ll be able to use Python to read, manipulate, and analyze data; perform data cleaning, integration, reduction, and transformation techniques; and handle outliers or missing values to effectively prepare data for analytic tools.
What you will learn
- Use Python to perform analytics functions on your data
- Understand the role of databases and how to effectively pull data from databases
- Perform data preprocessing steps defined by your analytics goals
- Recognize and resolve data integration challenges
- Identify the need for data reduction and execute it
- Detect opportunities to improve analytics with data transformation
Who this book is for
Junior and senior data analysts, business intelligence professionals, engineering undergraduates, and data enthusiasts looking to perform preprocessing and data cleaning on large amounts of data will find this book useful. Basic programming skills, such as working with variables, conditionals, and loops, along with beginner-level knowledge of Python and simple analytics experience, are assumed.
Table of Contents
- Review of the Core Modules of NumPy and Pandas
- Review of Another Core Module – Matplotlib
- Data – What Is It Really?
- Databases
- Data Visualization
- Prediction
- Classification
- Clustering Analysis
- Data Cleaning Level I – Cleaning Up the Table
- Data Cleaning Level II – Unpacking, Restructuring, and Reformulating the Table
- Data Cleaning Level III- Missing Values, Outliers, and Errors
- Data Fusion and Data Integration
- Data Reduction
- Data Transformation and Massaging
- Case Study 1 – Mental Health in Tech
- Case Study 2 – Predicting COVID-19 Hospitalizations
- Case Study 3: United States Counties Clustering Analysis
- Summary, Practice Case Studies, and Conclusions
Hands-On Data Preprocessing in Python Contributors About the author About the reviewers Preface Who this book is for What this book covers To get the most out of this book Download the example code files Download the color images Conventions used Get in touch Share Your Thoughts Part 1:Technical Needs Chapter 1: Review of the Core Modules of NumPy and Pandas Technical requirements Overview of the Jupyter Notebook Are we analyzing data via computer programming? Overview of the basic functions of NumPy The np.arange() function The np.zeros() and np.ones() functions The np.linspace() function Overview of Pandas Pandas data access Boolean masking for filtering a DataFrame Pandas functions for exploring a DataFrame Pandas applying a function The Pandas groupby function Pandas multi-level indexing Pandas pivot and melt functions Summary Exercises Chapter 2: Review of Another Core Module – Matplotlib Technical requirements Drawing the main plots in Matplotlib Summarizing numerical attributes using histograms or boxplots Observing trends in the data using a line plot Relating two numerical attributes using a scatterplot Modifying the visuals Adding a title to visuals and labels to the axis Adding legends Modifying ticks Modifying markers Subplots Resizing visuals and saving them Resizing Saving Example of Matplotilb assisting data preprocessing Summary Exercises Chapter 3: Data – What Is It Really? Technical requirements What is data? Why this definition? DIKW pyramid Data preprocessing for data analytics versus data preprocessing for machine learning The most universal data structure – a table Data objects Data attributes Types of data values Analytics standpoint Programming standpoint Information versus pattern Understanding everyday use of the word "information" Statistical use of the word "information" Statistical meaning of the word "pattern" Summary Exercises References Chapter 4: Databases Technical requirements What is a database? Understanding the difference between a database and a dataset Types of databases The differentiating elements of databases Relational databases (SQL databases) Unstructured databases (NoSQL databases) A practical example that requires a combination of both structured and unstructured databases Distributed databases Blockchain Connecting to, and pulling data from, databases Direct connection Web page connection API connection Request connection Publicly shared Summary Exercises Part 2: Analytic Goals Chapter 5: Data Visualization Technical requirements Summarizing a population Example of summarizing numerical attributes Example of summarizing categorical attributes Comparing populations Example of comparing populations using boxplots Example of comparing populations using histograms Example of comparing populations using bar charts Investigating the relationship between two attributes Visualizing the relationship between two numerical attributes Visualizing the relationship between two categorical attributes Visualizing the relationship between a numerical attribute and a categorical attribute Adding visual dimensions Example of a five-dimensional scatter plot Showing and comparing trends Example of visualizing and comparing trends Summary Exercise Chapter 6: Prediction Technical requirements Predictive models Forecasting Regression analysis Linear regression Example of applying linear regression to perform regression analysis MLP How does MLP work? Example of applying MLP to perform regression analysis Summary Exercises Chapter 7: Classification Technical requirements Classification models Example of designing a classification model Classification algorithms KNN Example of using KNN for classification Decision Trees Example of using Decision Trees for classification Summary Exercises Chapter 8: Clustering Analysis Technical requirements Clustering model Clustering example using a two-dimensional dataset Clustering example using a three-dimensional dataset K-Means algorithm Using K-Means to cluster a two-dimensional dataset Using K-Means to cluster a dataset with more than two dimensions Centroid analysis Summary Exercises Part 3: The Preprocessing Chapter 9: Data Cleaning Level I – Cleaning Up the Table Technical requirements The levels, tools, and purposes of data cleaning – a roadmap to chapters 9, 10, and 11 Purpose of data analytics Tools for data analytics Levels of data cleaning Mapping the purposes and tools of analytics to the levels of data cleaning Data cleaning level I – cleaning up the table Example 1 – unwise data collection Example 2 – reindexing (multi-level indexing) Example 3 – intuitive but long column titles Summary Exercises Chapter 10: Data Cleaning Level II – Unpacking, Restructuring, and Reformulating the Table Technical requirements Example 1 – unpacking columns and reformulating the table Unpacking FileName Unpacking Content Reformulating a new table for visualization The last step – drawing the visualization Example 2 – restructuring the table Example 3 – level I and II data cleaning Level I cleaning Level II cleaning Doing the analytics – using linear regression to create a predictive model Summary Exercises Chapter 11: Data Cleaning Level III – Missing Values, Outliers, and Errors Technical requirements Missing values Detecting missing values Example of detecting missing values Causes of missing values Types of missing values Diagnosis of missing values Dealing with missing values Outliers Detecting outliers Dealing with outliers Errors Types of errors Dealing with errors Detecting systematic errors Summary Exercises Chapter 12: Data Fusion and Data Integration Technical requirements What are data fusion and data integration? Data fusion versus data integration Directions of data integration Frequent challenges regarding data fusion and integration Challenge 1 – entity identification Challenge 2 – unwise data collection Challenge 3 – index mismatched formatting Challenge 4 – aggregation mismatch Challenge 5 – duplicate data objects Challenge 6 – data redundancy Example 1 (challenges 3 and 4) Example 2 (challenges 2 and 3) Example 3 (challenges 1, 3, 5, and 6) Checking for duplicate data objects Designing the structure for the result of data integration Filling songIntegrate_df from billboard_df Filling songIntegrate_df from songAttribute_df Filling songIntegrate_df from artist_df Checking for data redundancy The analysis Example summary Summary Exercise Chapter 13: Data Reduction Technical requirements The distinction between data reduction and data redundancy The objectives of data reduction Types of data reduction Performing numerosity data reduction Random sampling Stratified sampling Random over/undersampling Performing dimensionality data reduction Linear regression as a dimension reduction method Using a decision tree as a dimension reduction method Using random forest as a dimension reduction method Brute-force computational dimension reduction PCA Functional data analysis Summary Exercises Chapter 14: Data Transformation and Massaging Technical requirements The whys of data transformation and massaging Data transformation versus data massaging Normalization and standardization Binary coding, ranking transformation, and discretization Example one – binary coding of nominal attribute Example two – binary coding or ranking transformation of ordinal attributes Example three – discretization of numerical attributes Understanding the types of discretization Discretization – the number of cut-off points A summary – from numbers to categories and back Attribute construction Example – construct one transformed attribute from two attributes Feature extraction Example – extract three attributes from one attribute Example – Morphological feature extraction Feature extraction examples from the previous chapters Log transformation Implementation – doing it yourself Implementation – the working module doing it for you Smoothing, aggregation, and binning Smoothing Aggregation Binning Summary Exercise Part 4: Case Studies Chapter 15: Case Study 1 – Mental Health in Tech Technical requirements Introducing the case study The audience of the results of analytics Introduction to the source of the data Integrating the data sources Cleaning the data Detecting and dealing with outliers and errors Detecting and dealing with missing values Analyzing the data Analysis question one – is there a significant difference between the mental health of employees across the attribute of gender? Analysis question two – is there a significant difference between the mental health of employees across the Age attribute? Analysis question three – do more supportive companies have mentally healthier employees? Analysis question four – does the attitude of individuals toward mental health influence their mental health and their seeking of treatments? Summary Chapter 16: Case Study 2 – Predicting COVID-19 Hospitalizations Technical requirements Introducing the case study Introducing the source of the data Preprocessing the data Designing the dataset to support the prediction Filling up the placeholder dataset Supervised dimension reduction Analyzing the data Summary Chapter 17: Case Study 3: United States Counties Clustering Analysis Technical requirements Introducing the case study Introduction to the source of the data Preprocessing the data Transforming election_df to partisan_df Cleaning edu_df, employ_df, pop_df, and pov_df Data integration Data cleaning level III – missing values, errors, and outliers Checking for data redundancy Analyzing the data Using PCA to visualize the dataset K-Means clustering analysis Summary Chapter 18: Summary, Practice Case Studies, and Conclusions A summary of the book Part 1 – Technical requirements Part 2 – Analytics goals Part 3 – The preprocessing Part 4 – Case studies Practice case studies Google Covid-19 mobility dataset Police killings in the US US accidents San Francisco crime Data analytics job market FIFA 2018 player of the match Hot hands in basketball Wildfires in California Silicon Valley diversity profile Recognizing fake job posting Hunting more practice case studies Conclusions Why subscribe? Other Books You May Enjoy Packt is searching for authors like you Share Your Thoughts
How to download source code?
1. Go to: https://github.com/PacktPublishing
2. In the Find a repository… box, search the book title: Hands-On Data Preprocessing in Python: Learn how to effectively prepare data for successful data analytics
, sometime you may not get the results, please search the main title.
3. Click the book title in the search results.
3. Click Code to download.
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.