Hands-On Data Preprocessing in Python: Learn how to effectively prepare data for successful data analytics

Length: 602 pages
Edition: 1
Language: English
Publisher: Packt Publishing
Publication Date: 2022-01-21
ISBN-10: 1801072132
ISBN-13: 9781801072137
Sales Rank: #3828308 (See Top 100 Books)

This book will make the link between data cleaning and preprocessing to help you design effective data analytic solutions

Key Features

Develop the skills to perform data cleaning, data integration, data reduction, and data transformation
Get ready to make the most of your data with powerful data transformation and massaging techniques
Perform thorough data cleaning, such as dealing with missing values and outliers

Book Description

Data preprocessing is the first step in data visualization, data analytics, and machine learning, where data is prepared for analytics functions to get the best possible insights. Around 90% of the time spent on data analytics, data visualization, and machine learning projects is dedicated to performing data preprocessing.

This book will equip you with the optimum data preprocessing techniques from multiple perspectives. You’ll learn about different technical and analytical aspects of data preprocessing – data collection, data cleaning, data integration, data reduction, and data transformation – and get to grips with implementing them using the open source Python programming environment. This book will provide a comprehensive articulation of data preprocessing, its whys and hows, and help you identify opportunities where data analytics could lead to more effective decision making. It also demonstrates the role of data management systems and technologies for effective analytics and how to use APIs to pull data.

By the end of this Python data preprocessing book, you’ll be able to use Python to read, manipulate, and analyze data; perform data cleaning, integration, reduction, and transformation techniques; and handle outliers or missing values to effectively prepare data for analytic tools.

What you will learn

Use Python to perform analytics functions on your data
Understand the role of databases and how to effectively pull data from databases
Perform data preprocessing steps defined by your analytics goals
Recognize and resolve data integration challenges
Identify the need for data reduction and execute it
Detect opportunities to improve analytics with data transformation

Who this book is for

Junior and senior data analysts, business intelligence professionals, engineering undergraduates, and data enthusiasts looking to perform preprocessing and data cleaning on large amounts of data will find this book useful. Basic programming skills, such as working with variables, conditionals, and loops, along with beginner-level knowledge of Python and simple analytics experience, are assumed.

Review of the Core Modules of NumPy and Pandas
Review of Another Core Module – Matplotlib
Data – What Is It Really?
Databases
Data Visualization
Prediction
Classification
Clustering Analysis
Data Cleaning Level I – Cleaning Up the Table
Data Cleaning Level II – Unpacking, Restructuring, and Reformulating the Table
Data Cleaning Level III- Missing Values, Outliers, and Errors
Data Fusion and Data Integration
Data Reduction
Data Transformation and Massaging
Case Study 1 – Mental Health in Tech
Case Study 2 – Predicting COVID-19 Hospitalizations
Case Study 3: United States Counties Clustering Analysis
Summary, Practice Case Studies, and Conclusions

Hands-On Data Preprocessing in Python
Contributors
About the author
About the reviewers
Preface
    Who this book is for
    What this book covers
    To get the most out of this book
    Download the example code files
    Download the color images
    Conventions used
    Get in touch
    Share Your Thoughts
Part 1:Technical Needs
Chapter 1: Review of the Core Modules of NumPy and Pandas
    Technical requirements
    Overview of the Jupyter Notebook
    Are we analyzing data via computer programming?
    Overview of the basic functions of NumPy
        The np.arange() function
        The np.zeros() and np.ones() functions
        The np.linspace() function
    Overview of Pandas
        Pandas data access
        Boolean masking for filtering a DataFrame
        Pandas functions for exploring a DataFrame
        Pandas applying a function
        The Pandas groupby function
        Pandas multi-level indexing
        Pandas pivot and melt functions
    Summary
    Exercises
Chapter 2: Review of Another Core Module – Matplotlib
    Technical requirements
    Drawing the main plots in Matplotlib
        Summarizing numerical attributes using histograms or boxplots
        Observing trends in the data using a line plot
        Relating two numerical attributes using a scatterplot
    Modifying the visuals
        Adding a title to visuals and labels to the axis
        Adding legends
        Modifying ticks
        Modifying markers
    Subplots
    Resizing visuals and saving them
        Resizing
        Saving
    Example of Matplotilb assisting data preprocessing
    Summary
    Exercises
Chapter 3: Data – What Is It Really?
    Technical requirements
    What is data?
        Why this definition?
        DIKW pyramid
        Data preprocessing for data analytics versus data preprocessing for machine learning
    The most universal data structure – a table
        Data objects
        Data attributes
    Types of data values
        Analytics standpoint
        Programming standpoint
    Information versus pattern
        Understanding everyday use of the word "information"
        Statistical use of the word "information"
        Statistical meaning of the word "pattern"
    Summary
    Exercises
    References
Chapter 4: Databases
    Technical requirements
    What is a database?
        Understanding the difference between a database and a dataset
    Types of databases
        The differentiating elements of databases
        Relational databases (SQL databases)
        Unstructured databases (NoSQL databases)
        A practical example that requires a combination of both structured and unstructured databases
        Distributed databases
        Blockchain
    Connecting to, and pulling data from, databases
        Direct connection
        Web page connection
        API connection
        Request connection
        Publicly shared
    Summary
    Exercises
Part 2: Analytic Goals
Chapter 5: Data Visualization
    Technical requirements
    Summarizing a population
        Example of summarizing numerical attributes
        Example of summarizing categorical attributes
    Comparing populations
        Example of comparing populations using boxplots
        Example of comparing populations using histograms
        Example of comparing populations using bar charts
    Investigating the relationship between two attributes
        Visualizing the relationship between two numerical attributes
        Visualizing the relationship between two categorical attributes
        Visualizing the relationship between a numerical attribute and a categorical attribute
    Adding visual dimensions
        Example of a five-dimensional scatter plot
    Showing and comparing trends
        Example of visualizing and comparing trends
    Summary
    Exercise
Chapter 6: Prediction
    Technical requirements
    Predictive models
        Forecasting
        Regression analysis
    Linear regression
        Example of applying linear regression to perform regression analysis
    MLP
        How does MLP work?
        Example of applying MLP to perform regression analysis
    Summary
    Exercises
Chapter 7: Classification
    Technical requirements
    Classification models
        Example of designing a classification model
        Classification algorithms
    KNN
        Example of using KNN for classification
    Decision Trees
        Example of using Decision Trees for classification
    Summary
    Exercises
Chapter 8: Clustering Analysis
    Technical requirements
    Clustering model
        Clustering example using a two-dimensional dataset
        Clustering example using a three-dimensional dataset
    K-Means algorithm
        Using K-Means to cluster a two-dimensional dataset
        Using K-Means to cluster a dataset with more than two dimensions
        Centroid analysis
    Summary
    Exercises
Part 3: The Preprocessing
Chapter 9: Data Cleaning Level I – Cleaning Up the Table
    Technical requirements
    The levels, tools, and purposes of data cleaning – a roadmap to chapters 9, 10, and 11
        Purpose of data analytics
        Tools for data analytics
        Levels of data cleaning
        Mapping the purposes and tools of analytics to the levels of data cleaning
    Data cleaning level I – cleaning up the table
        Example 1 – unwise data collection
        Example 2 – reindexing (multi-level indexing)
        Example 3 – intuitive but long column titles
    Summary
    Exercises
Chapter 10: Data Cleaning Level II – Unpacking, Restructuring, and Reformulating the Table
    Technical requirements
    Example 1 – unpacking columns and reformulating the table
        Unpacking FileName
        Unpacking Content
        Reformulating a new table for visualization
        The last step – drawing the visualization
    Example 2 – restructuring the table
    Example 3 – level I and II data cleaning
        Level I cleaning
        Level II cleaning
        Doing the analytics – using linear regression to create a predictive model
    Summary
    Exercises
Chapter 11: Data Cleaning Level III – Missing Values, Outliers, and Errors
    Technical requirements
    Missing values
        Detecting missing values
        Example of detecting missing values
        Causes of missing values
        Types of missing values
        Diagnosis of missing values
        Dealing with missing values
    Outliers
        Detecting outliers
        Dealing with outliers
    Errors 
        Types of errors
        Dealing with errors
        Detecting systematic errors
    Summary
    Exercises
Chapter 12: Data Fusion and Data Integration
    Technical requirements
    What are data fusion and data integration?
        Data fusion versus data integration
        Directions of data integration
    Frequent challenges regarding data fusion and integration
        Challenge 1 – entity identification
        Challenge 2 – unwise data collection
        Challenge 3 – index mismatched formatting
        Challenge 4 – aggregation mismatch
        Challenge 5 – duplicate data objects
        Challenge 6 – data redundancy
    Example 1 (challenges 3 and 4)
    Example 2 (challenges 2 and 3)
    Example 3 (challenges 1, 3, 5, and 6)
        Checking for duplicate data objects
        Designing the structure for the result of data integration
        Filling songIntegrate_df from billboard_df
        Filling songIntegrate_df from songAttribute_df
        Filling songIntegrate_df from artist_df
        Checking for data redundancy
        The analysis
        Example summary
    Summary 
    Exercise
Chapter 13: Data Reduction
    Technical requirements
    The distinction between data reduction and data redundancy
        The objectives of data reduction
    Types of data reduction
    Performing numerosity data reduction
        Random sampling
        Stratified sampling
        Random over/undersampling
    Performing dimensionality data reduction
        Linear regression as a dimension reduction method
        Using a decision tree as a dimension reduction method
        Using random forest as a dimension reduction method
        Brute-force computational dimension reduction
        PCA
        Functional data analysis
    Summary
    Exercises
Chapter 14: Data Transformation and Massaging
    Technical requirements
    The whys of data transformation and massaging
        Data transformation versus data massaging
    Normalization and standardization
    Binary coding, ranking transformation, and discretization
        Example one – binary coding of nominal attribute
        Example two – binary coding or ranking transformation of ordinal attributes
        Example three – discretization of numerical attributes
        Understanding the types of discretization
        Discretization – the number of cut-off points
        A summary – from numbers to categories and back
    Attribute construction
        Example – construct one transformed attribute from two attributes
    Feature extraction
        Example – extract three attributes from one attribute
        Example – Morphological feature extraction
        Feature extraction examples from the previous chapters
    Log transformation
        Implementation – doing it yourself
        Implementation – the working module doing it for you
    Smoothing, aggregation, and binning
        Smoothing
        Aggregation
        Binning
    Summary
    Exercise
Part 4: Case Studies
Chapter 15: Case Study 1 – Mental Health in Tech
    Technical requirements
    Introducing the case study
        The audience of the results of analytics
        Introduction to the source of the data
    Integrating the data sources
    Cleaning the data
        Detecting and dealing with outliers and errors
        Detecting and dealing with missing values
    Analyzing the data
        Analysis question one – is there a significant difference between the mental health of employees across the attribute of gender?
        Analysis question two – is there a significant difference between the mental health of employees across the Age attribute?
        Analysis question three – do more supportive companies have mentally healthier employees?
        Analysis question four – does the attitude of individuals toward mental health influence their mental health and their seeking of treatments?
    Summary
Chapter 16: Case Study 2 – Predicting COVID-19 Hospitalizations
    Technical requirements
    Introducing the case study
        Introducing the source of the data
    Preprocessing the data
        Designing the dataset to support the prediction
        Filling up the placeholder dataset
        Supervised dimension reduction
    Analyzing the data
    Summary
Chapter 17: Case Study 3: United States Counties Clustering Analysis
    Technical requirements
    Introducing the case study
        Introduction to the source of the data
    Preprocessing the data
        Transforming election_df to partisan_df
        Cleaning edu_df, employ_df, pop_df, and pov_df 
        Data integration
        Data cleaning level III – missing values, errors, and outliers
        Checking for data redundancy
    Analyzing the data
        Using PCA to visualize the dataset
        K-Means clustering analysis
    Summary
Chapter 18: Summary, Practice Case Studies, and Conclusions
    A summary of the book
        Part 1 – Technical requirements
        Part 2 – Analytics goals
        Part 3 – The preprocessing
        Part 4 – Case studies
    Practice case studies
        Google Covid-19 mobility dataset
        Police killings in the US
        US accidents
        San Francisco crime
        Data analytics job market
        FIFA 2018 player of the match
        Hot hands in basketball
        Wildfires in California
        Silicon Valley diversity profile
        Recognizing fake job posting
        Hunting more practice case studies
    Conclusions
    Why subscribe?
Other Books You May Enjoy
    Packt is searching for authors like you
    Share Your Thoughts

To access the Link, solve the captcha.

How to download source code?

1. Go to: https://github.com/PacktPublishing

2. In the Find a repository… box, search the book title: Hands-On Data Preprocessing in Python: Learn how to effectively prepare data for successful data analytics, sometime you may not get the results, please search the main title.