**by Rafael A Irizarry**

DescriptionTable of ContentsDetailsReport an issue ### Book Description

The demand for skilled data science practitioners in industry, academia, and government is rapidly growing. This book introduces concepts and skills that can help you tackle real-world data analysis challenges. It covers concepts from probability, statistical inference, linear regression and machine learning. It also helps you develop skills such as R programming, data wrangling with dplyr, data visualization with ggplot2, algorithm building with caret, file organization with UNIX/Linux shell, version control with Git and GitHub, and reproducible document preparation with knitr and R markdown. The book is divided into six parts: R, Data Visualization, Data Wrangling, Probability, Inference and Regression with R, Machine Learning, and Productivity Tools. Each part has several chapters meant to be presented as one lecture. The book includes dozens of exercises distributed across most chapters. ### Table of Contents

### Book Details

### Related Books

This open book is licensed under a Creative Commons License (CC BY-NC-SA). You can download Introduction to Data Science ebook for free in PDF format (55.8 MB).

Part I

R

Chapter 1

Getting Started with R and RStudio

Chapter 2

R Basics

Chapter 3

Programming basics

Chapter 4

The tidyverse

Chapter 5

Importing data

Part II

Data Visualization

Chapter 6

Introduction to data visualization

Chapter 7

ggplot2

Chapter 8

Visualizing data distributions

Chapter 9

Data visualization in practice

Chapter 10

Data visualization principles

Chapter 11

Robust summaries

Part III

Statistics with R

Chapter 12

Introduction to Statistics with R

Chapter 13

Probability

Chapter 14

Random variables

Chapter 15

Statistical Inference

Chapter 16

Statistical models

Chapter 17

Regression

Chapter 18

Linear Models

Chapter 19

Association is not causation

Part IV

Data Wrangling

Chapter 20

Introduction to Data Wrangling

Chapter 21

Reshaping data

Chapter 22

Joining tables

Chapter 23

Web Scraping

Chapter 24

String Processing

Chapter 25

Parsing Dates and Times

Chapter 26

Text mining

Part V

Machine Learning

Chapter 27

Introduction to Machine Learning

Chapter 28

Smoothing

Chapter 29

Cross validation

Chapter 30

The caret package

Chapter 31

Examples of algorithms

Chapter 32

Machine learning in practice

Chapter 33

Large datasets

Chapter 34

Clustering

Part VI

Productivity tools

Chapter 35

Introduction to productivity tools

Chapter 36

Organizing with Unix

Chapter 37

Git and GitHub

Chapter 38

Reproducible projects with RStudio and R markdown

Title

Introduction to Data Science

Subject

Computer Science

Publisher

Leanpub

Published

2019

Pages

722

Edition

1

Language

English

PDF Size

55.8 MB

License

This engaging and clearly written textbook/reference provides a must-have introduction to the rapidly emerging interdisciplinary field of data science. It focuses on the principles fundamental to becoming a good data scientist and the key skills needed to build systems for collecting, analyzing, and interpreting data. The Data Science Design Manual...

OpenIntro Statistics offers a traditional introduction to statistics at the college level. This textbook is widely used at the college level and offers an exceptional and accessible introduction for students from community colleges to the Ivy League. The textbook has been thoroughly vetted with an estimated 20,000 students using it annually.
Ext...

We've all heard it: according to Hal Varian, statistics is the next sexy job. Five years ago, in What is Web 2.0, Tim O'Reilly said that "data is the next Intel Inside." But what does that statement mean? Why do we suddenly care about statistics and about data? This report examines the many sides of data science - the technologi...

For many researchers, Python is a first-class tool mainly because of its libraries for storing, manipulating, and gaining insight from data. Several resources exist for individual pieces of this data science stack, but only with the Python Data Science Handbook do you get them all - IPython, NumPy, Pandas, Matplotlib, Scikit-Learn, and other relate...

R is one of the most popular, powerful data analytics languages and environments in use by data scientists. Actionable business data is often stored in Relational Database Management Systems (RDBMS), and one of the most widely used RDBMS is Microsoft SQL Server. Much more than a database server, it's a rich ecostructure with advanced analytic ...

This thoroughly revised guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You'll learn how to combine small yet powerful command-line tools to quickly obtain, scrub, explore, and model your data. To get you started, author Jeroen Janssens provides a Docker image packe...