Data Structures and Algorithms
for Big Data Applications

A technical book about popular space-efficient data structures and fast algorithms that are extremely useful in modern Big Data applications.

ISBN: 978-3748190486 Paperback: 220 pages

Table of Contents Where to buy Follow @gakhov

The best Probabilistic Algorithms ebooks of all time

About the book

Probabilistic data structures is a common name for data structures based mostly on different hashing techniques. Unlike regular (or deterministic) data structures, they always provide approximated answers but with reliable ways to estimate possible errors. Fortunately, the potential losses or errors are fully compensated for by extremely low memory requirements, constant query time, and scaling, three factors that become important in Big Data applications.

While it is impossible to cover all the existing amazing solutions, this book is to highlight their common ideas and important areas of application, including membership querying, counting, stream mining, and similarity estimation.

The purpose of this book is to introduce technology practitioners, including software architects and developers, as well as technology decision makers to probabilistic data structures and algorithms. Reading this book, you will get a theoretical and practical understanding of probabilistic data structures and learn about their common uses.

  •  Learn how to solve practical issues of massive data handling
  •  Master the theoretical aspects of probabilistic data structures
  •  Identify the right data structures for your particular problems

This is not a book for scientists, but to gain the most out of it you will need to have basic mathematical knowledge and an understanding of the general theory of data structures and algorithms.

Table Of Contents


We give a brief overview of popular hash functions and hash tables that are widely used in probabilistic data structures.

  • Cryptographic hash functions
  • Non-Cryptographic hash functions
  • Hash tables


We study the most well-known known probabilistic data structures that are used to answer approximate membership queries.

  • Bloom filter
  • Counting Bloom filter
  • Quotient filter
  • Cuckoo filter


We learn data structures that allow to estimate the number of unique elements in a dataset.

  • Linear Counting
  • Probabilistic Counting
  • LogLog and HyperLogLog


We discuss solutions for common problems in streaming applications, such as finding frequency of some element, filtering most frequent elements in the stream, detecting trending elements, etc.

  • Majority algorithm
  • Frequent algorithm
  • Count Sketch
  • Count–Min Sketch


We study data structures to estimate quantiles and percentiles in a data stream using only one pass through the data.

  • Random sampling
  • q-digest
  • t-digest


We learn how to effciently solve the nearest neighbor problem for large datasets using sub-linear in time solutions.

  • Locality-Sensitive hashing
  • MinHash
  • SimHash

Andrii Gakhov, Ph.D.

Andrii Gakhov is a mathematician and software engineer holding a mathematical modeling and numerical methods. He has been a teacher in the School of Computer Science at V. Karazin Kharkiv National University in Ukraine for a number of years and currently works as a software practitioner for ferret go GmbH, the leading community moderation, automation, and analytics company in Germany. His fields of interests include machine learning, stream mining, and data analysis.

Where to buy

Paperback edition
Published by BoD

Country Store Link
 United States Visit
 Canada Visit
 Mexico Visit
 Brazil Visit
 United Kingdom Visit
 Germany BoD (Verlag) Visit
Thalia Visit Visit
 Austria BoD (Verlag) Visit Visit
 France Visit
 Spain Visit
 Italy Visit
 Netherlands BoD (Verlag) Visit
 India Visit
 Japan Visit
 Australia Visit
BoD Publisher delivers  to all EU countires for FREE.

E-book edition
Exclusively at Amazon

Country Store Link
 United States Visit
 Canada Visit
 Mexico Visit
 Brazil Visit
 United Kingdom Visit
 Germany Visit
 Austria Visit
 France Visit
 Spain Visit
 Italy Visit
 Netherlands Visit
 India Visit
 Japan Visit
 Australia Visit

Get a free preview
This book on Google Books

Download Download


If you have found any error that is not in the list below, please contact us by email As soon as the error is confirmed and fixed, its description will appear in this section (details are omitted in mobile view due to space restrictions).

Edition Location Original text Corrected text
1st, 2019 p. vii footnote "What Is Big Data?" [removed]
1st, 2019 p. 31 "... by bitwise XOR, however ..." "... by bitwise AND, however ..."
1st, 2019 p. 38, Algorithm 2.8 $$\mathbf{QF}[i] \gets prev[i]$$ $$\mathbf{QF}[i] \gets prev$$
1st, 2019 p. 41, Example 2.9 ... \(16\)-bit fingerprints ... ... \(32\)-bit fingerprints ...
1st, 2019 p. 41, Example 2.9 $$h(x) = \text{MurmurHash3}(x) \bmod 16.$$ $$h(x) = \text{MurmurHash3}(x).$$
1st, 2019 p. 41, Example 2.9 $$p=13$$ $$p=29$$
1st, 2019 pp. 41, 43, Example 2.9 $$2^{13}$$ $$2^{29}$$
1st, 2019 pp. 42-45 ... remainder \(f_q\) ... ... remainder \(f_r\) ...
1st, 2019 p. 62, Example 3.1 "... of unique elements is 23 GB." "... of unique elements is 2.3 GB."
1st, 2019 p. 65, Algorithm 3.2 $$\mathbf{Add}(e, LINEARCOUNTER)$$ $$\mathbf{Add}(x, LINEARCOUNTER)$$
1st, 2019 p. 66, Example 3.3 $$V = \frac{9}{16} = 0.5625$$ $$V = \frac{7}{16} = 0.4375$$
1st, 2019 p. 66, Example 3.3 "... which is pretty close to ..." "... which is close but overestimates ...
1st, 2019 p. 66, Example 3.3 $$n \approx - 16 \cdot \ln{0.5625} \approx 9.206$$ $$n \approx - 16 \cdot \ln{0.4375} \approx 13.22$$
1st, 2019 p. 67 ... a fixed constant \(O(N)\), where \(N\) is the total number of elements, including duplicates. Thus, the algorithm has \(O(N)\) time complexity. ... a fixed constant, and thus the algorithm has \(O(N)\) time complexity, where \(N\) is the total number of elements, including duplicates.
1st, 2019 p. 81, Formula 3.12 $$\hat n \approx \alpha_{m}\cdot m^{2} \cdot \left(\sum_{j=0}^{m-1}2^{-COUNTER[j]}\right)$$ $$\hat n \approx \alpha_{m}\cdot m^{2} \cdot \left(\sum_{j=0}^{m-1}2^{-COUNTER[j]}\right)^{-1}$$
1st, 2019 p. 96, Example 4.2 "... 500 billion ..." "... 500 million ..."
1st, 2019 p. 135, Example 5.3 "... remaining empty buffer \(B_{2}\)." "... remaining empty buffer \(B_{4}\)."
1st, 2019 p. 173, Example 6.8 $$c(d_1, d_4) = \frac{255 \cdot 0 + 0 \cdot 191 + 0 \cdot 255}{255 \cdot 255} = 0$$ $$c(d_1, d_4) = \frac{255 \cdot 0 + 0 \cdot 191 + 0 \cdot 255}{\sqrt{255^{2} + 0^2 + 0^2} \cdot \sqrt{0^{2} + 191^2 + 255^2}} = 0$$