Book Description
Web scraping or crawling is the art of fetching data from a third party website by downloading and parsing the HTML code to extract the data you want. It can be hard. From bad HTML code to heavy Javascript use and anti-bot techniques, it is often tricky. Lots of companies use it to obtain knowledge concerning competitor prices, news aggregation, mass email collect.
This book will teach you how to extract data from any website, how to deal with AJAX / Javascript heavy websites, break captchas, deploy your scrapers in the cloud and many other advanced techniques.
This open book is licensed under a Creative Commons License (CC BY). You can download Java Web Scraping Handbook ebook for free in PDF format (4.7 MB).
Table of Contents
Chapter 1
Introduction to Web scraping
Chapter 2
Web fundamentals
Chapter 3
Extracting the data you want
Chapter 4
Handling forms
Chapter 5
Dealing with Javascript
Chapter 6
Captcha solving, PDF parsing, and OCR
Chapter 7
Stay under cover
Chapter 8
Cloud scraping