Data Science at the Command Line

Obtain, Scrub, Explore, and Model Data with Unix Power Tools

by Jeroen Janssens

Subscribe to new books via dBooks.org telegram channel

Book Description

This thoroughly revised guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You'll learn how to combine small yet powerful command-line tools to quickly obtain, scrub, explore, and model your data. To get you started, author Jeroen Janssens provides a Docker image packed with over 100 Unix power tools - useful whether you work with Windows, macOS, or Linux.You'll quickly discover why the command line is an agile, scalable, and extensible technology. Even if you're comfortable processing data with Python or R, you'll learn how to greatly improve your data science workflow by leveraging the command line's power. This book is ideal for data scientists, analysts, engineers, system administrators, and researchers.

- Obtain data from websites, APIs, databases, and spreadsheets;
- Perform scrub operations on text, CSV, HTML, XML, and JSON files;
- Explore data, compute descriptive statistics, and create visualizations;
- Manage your data science workflow;
- Create your own tools from one-liners and existing Python or R code;
- Parallelize and distribute data-intensive pipelines;
- Model data with dimensionality reduction, regression, and classification algorithms;
- Leverage the command line from Python, Jupyter, R, RStudio, and Apache Spark.

This open book is licensed under a Creative Commons License (CC BY-NC-ND). Free download in PDF format is not available. You can read Data Science at the Command Line book online for free.

Chapter 1

Introduction

Chapter 2

Getting Started

Chapter 3

Obtaining Data

Chapter 4

Creating Command-line Tools

Chapter 5

Scrubbing Data

Chapter 6

Project Management with Make

Chapter 7

Exploring Data

Chapter 8

Parallel Pipelines

Chapter 9

Modeling Data

Chapter 10

Polyglot Data Science

Chapter 11

Conclusion

Book Details

Title

Data Science at the Command Line

Subject

Computer Science

Publisher

O'Reilly Media

Published

2021

Pages

282

Edition

Language

English

ISBN13 Digital

9781492087915

ISBN10 Digital

1492087912

License

Related Books

The Data Journalism Handbook

When you combine the sheer scale and range of digital information now available with a journalist's "nose for news" and her ability to tell a compelling story, a new world of possibility opens up. With The Data Journalism Handbook, you'll explore the potential, limits, and applied uses of this new and fascinating field. This ...

Data Science with Microsoft SQL Server 2016

R is one of the most popular, powerful data analytics languages and environments in use by data scientists. Actionable business data is often stored in Relational Database Management Systems (RDBMS), and one of the most widely used RDBMS is Microsoft SQL Server. Much more than a database server, it's a rich ecostructure with advanced analytic ...

What Is Data Science?

We've all heard it: according to Hal Varian, statistics is the next sexy job. Five years ago, in What is Web 2.0, Tim O'Reilly said that "data is the next Intel Inside." But what does that statement mean? Why do we suddenly care about statistics and about data? This report examines the many sides of data science - the technologi...

The Data Science Design Manual

This engaging and clearly written textbook/reference provides a must-have introduction to the rapidly emerging interdisciplinary field of data science. It focuses on the principles fundamental to becoming a good data scientist and the key skills needed to build systems for collecting, analyzing, and interpreting data. The Data Science Design Manual...

The Linux Command Line

The Linux Command Line takes you from your very first terminal keystrokes to writing full programs in Bash, the most popular Linux shell (or command line). Along the way you'll learn the timeless skills handed down by generations of experienced, mouse-shunning gurus: file navigation, environment configuration, command chaining, pattern matchin...

Data Journeys in the Sciences

This groundbreaking, open volume analyses and compares data practices across several fields through the analysis of specific cases of data journeys. It brings together leading scholars in the philosophy, history and social studies of science to achieve two goals: tracking the travel of data across different spaces, times and domains of research pra...

Data Science at the Command Line

Obtain, Scrub, Explore, and Model Data with Unix Power Tools

Subscribe to new books via dBooks.org telegram channel

Book Description

Table of Contents

Book Details

Related Books