Free cookie consent management tool by TermsFeed Policy Generator

Probabilistic
Data Structures and Algorithms
for Big Data Applications

A technical book about popular space-efficient data structures and fast algorithms that are extremely useful in modern Big Data applications.

ISBN: 978-3748190486 Paperback: 220 pages

Table of Contents Where to buy Follow @gakhov

The best Probabilistic Algorithms ebooks of all time

About the book

Probabilistic data structures is a common name for data structures based mostly on different hashing techniques. Unlike regular (or deterministic) data structures, they always provide approximated answers but with reliable ways to estimate possible errors. Fortunately, the potential losses or errors are fully compensated for by extremely low memory requirements, constant query time, and scaling, three factors that become important in Big Data applications.

While it is impossible to cover all the existing amazing solutions, this book is to highlight their common ideas and important areas of application, including membership querying, counting, stream mining, and similarity estimation.

The purpose of this book is to introduce technology practitioners, including software architects and developers, as well as technology decision makers to probabilistic data structures and algorithms. Reading this book, you will get a theoretical and practical understanding of probabilistic data structures and learn about their common uses.

  •  Learn how to solve practical issues of massive data handling
  •  Master the theoretical aspects of probabilistic data structures
  •  Identify the right data structures for your particular problems

This is not a book for scientists, but to gain the most out of it you will need to have basic mathematical knowledge and an understanding of the general theory of data structures and algorithms.

Table Of Contents

Hashing

We give a brief overview of popular hash functions and hash tables that are widely used in probabilistic data structures.

  • Cryptographic hash functions
  • Non-Cryptographic hash functions
  • Hash tables

Membership

We study the most well-known known probabilistic data structures that are used to answer approximate membership queries.

  • Bloom filter
  • Counting Bloom filter
  • Quotient filter
  • Cuckoo filter

Cardinality

We learn data structures that allow to estimate the number of unique elements in a dataset.

  • Linear Counting
  • Probabilistic Counting
  • LogLog and HyperLogLog

Frequency

We discuss solutions for common problems in streaming applications, such as finding frequency of some element, filtering most frequent elements in the stream, detecting trending elements, etc.

  • Majority algorithm
  • Frequent algorithm
  • Count Sketch
  • Count–Min Sketch

Rank

We study data structures to estimate quantiles and percentiles in a data stream using only one pass through the data.

  • Random sampling
  • q-digest
  • t-digest

Similarity

We learn how to effciently solve the nearest neighbor problem for large datasets using sub-linear in time solutions.

  • Locality-Sensitive hashing
  • MinHash
  • SimHash

Andrii Gakhov, Ph.D.

Andrii Gakhov is a mathematician and software engineer holding a Ph.D.in mathematical modeling and numerical methods. He has been a teacher in the School of Computer Science at V. Karazin Kharkiv National University in Ukraine for a number of years and currently works as a software practitioner for ferret go GmbH, the leading community moderation, automation, and analytics company in Germany. His fields of interests include machine learning, stream mining, and data analysis.

Where to buy

Paperback edition
Published by BoD

Country Store Link
America
 United States Amazon.com Visit
 Canada Amazon.ca Visit
 Mexico Amazon.com.mx Visit
 Brazil Amazon.com.br Visit
Europe
 United Kingdom Amazon.co.uk Visit
 Germany BoD (Verlag) Visit
Thalia Visit
Amazon.de Visit
 Austria BoD (Verlag) Visit
Amazon.de Visit
 France Amazon.fr Visit
 Spain Amazon.es Visit
 Italy Amazon.it Visit
 Netherlands BoD (Verlag) Visit
Asia
 India Amazon.in Visit
 Japan Amazon.co.jp Visit
Oceania
 Australia Amazon.com.au Visit
BoD Publisher delivers  to all EU countires for FREE.

E-book edition
Exclusively at Amazon

Country Store Link
America
 United States Amazon.com Visit
 Canada Amazon.ca Visit
 Mexico Amazon.com.mx Visit
 Brazil Amazon.com.br Visit
Europe
 United Kingdom Amazon.co.uk Visit
 Germany Amazon.de Visit
 Austria Amazon.de Visit
 France Amazon.fr Visit
 Spain Amazon.es Visit
 Italy Amazon.it Visit
 Netherlands Amazon.nl Visit
Asia
 India Amazon.in Visit
 Japan Amazon.co.jp Visit
Oceania
 Australia Amazon.com.au Visit

Get a free preview
This book on Google Books


Download Download

Errata

If you have found any error that is not in the list below, please contact us by email pdsa@gakhov.com. As soon as the error is confirmed and fixed, its description will appear in this section (details are omitted in mobile view due to space restrictions).

Edition Location Original text Corrected text
1st, 2019 p. vii footnote "What Is Big Data? https://www.ibm.com/software/data/bigdata/what-is-big-data.html" [removed]
1st, 2019 p. 31 "... by bitwise XOR, however ..." "... by bitwise AND, however ..."
1st, 2019 p. 38, Algorithm 2.8 $$\mathbf{QF}[i] \gets prev[i]$$ $$\mathbf{QF}[i] \gets prev$$
1st, 2019 p. 41, Example 2.9 ... \(16\)-bit fingerprints ... ... \(32\)-bit fingerprints ...
1st, 2019 p. 41, Example 2.9 $$h(x) = \text{MurmurHash3}(x) \bmod 16.$$ $$h(x) = \text{MurmurHash3}(x).$$
1st, 2019 p. 41, Example 2.9 $$p=13$$ $$p=29$$
1st, 2019 pp. 41, 43, Example 2.9 $$2^{13}$$ $$2^{29}$$
1st, 2019 pp. 42-45 ... remainder \(f_q\) ... ... remainder \(f_r\) ...
1st, 2019 p. 62, Example 3.1 "... of unique elements is 23 GB." "... of unique elements is 2.3 GB."
1st, 2019 p. 65, Algorithm 3.2 $$\mathbf{Add}(e, LINEARCOUNTER)$$ $$\mathbf{Add}(x, LINEARCOUNTER)$$
1st, 2019 p. 66, Example 3.3 $$V = \frac{9}{16} = 0.5625$$ $$V = \frac{7}{16} = 0.4375$$
1st, 2019 p. 66, Example 3.3 "... which is pretty close to ..." "... which is close but overestimates ...
1st, 2019 p. 66, Example 3.3 $$n \approx - 16 \cdot \ln{0.5625} \approx 9.206$$ $$n \approx - 16 \cdot \ln{0.4375} \approx 13.22$$
1st, 2019 p. 67 ... a fixed constant \(O(N)\), where \(N\) is the total number of elements, including duplicates. Thus, the algorithm has \(O(N)\) time complexity. ... a fixed constant, and thus the algorithm has \(O(N)\) time complexity, where \(N\) is the total number of elements, including duplicates.
1st, 2019 p. 81, Formula 3.12 $$\hat n \approx \alpha_{m}\cdot m^{2} \cdot \left(\sum_{j=0}^{m-1}2^{-COUNTER[j]}\right)$$ $$\hat n \approx \alpha_{m}\cdot m^{2} \cdot \left(\sum_{j=0}^{m-1}2^{-COUNTER[j]}\right)^{-1}$$
1st, 2019 p. 96, Example 4.2 "... 500 billion ..." "... 500 million ..."
1st, 2019 p. 109, Example 4.5 $$s_1(4) = 1,\text{ }s_2(4) = 1,\text{ }s_2(4) = -1.$$ $$s_1(4) = 1,\text{ }s_2(4) = 1,\text{ }s_3(4) = -1.$$
1st, 2019 p. 135, Example 5.3 "... remaining empty buffer \(B_{2}\)." "... remaining empty buffer \(B_{4}\)."
1st, 2019 p. 159, Example 5.10 $$q(c_3) + \frac{1}{20} = \frac{6 + 11 + 1}{20} + \frac{1}{20} = 0.95,$$ $$q(c_3) - \frac{1}{20} = \frac{6 + 11 + 1}{20} - \frac{1}{20} = 0.85,$$
1st, 2019 p. 173, Example 6.8 $$c(d_1, d_4) = \frac{255 \cdot 0 + 0 \cdot 191 + 0 \cdot 255}{255 \cdot 255} = 0$$ $$c(d_1, d_4) = \frac{255 \cdot 0 + 0 \cdot 191 + 0 \cdot 255}{\sqrt{255^{2} + 0^2 + 0^2} \cdot \sqrt{0^{2} + 191^2 + 255^2}} = 0$$