banner
Intro to Big Data
Is this "data" in the room with us right now?
#️⃣   ⌛  ~1 h 🗿  Beginner
09.12.2022
upd:
#25

views-badgeviews-badge
banner
Intro to Big Data
Is this "data" in the room with us right now?
⌛  ~1 h
#25


🎓 42/167

This post is a part of the Working with data educational series from my free course. Please keep in mind that the correct sequence of posts is outlined on the course page, while it can be arbitrary in Research.

I'm also happy to announce that I've started working on standalone paid courses, so you could support my work and get cheap educational material. These courses will be of completely different quality, with more theoretical depth and niche focus, and will feature challenging projects, quizzes, exercises, video lectures and supplementary stuff. Stay tuned!


The term big data often appears in both marketing pamphlets and academic papers, and one might suspect it to be a buzzword used by businesses to justify the need for large-scale computational infrastructure. Nevertheless, big data is more than just an overused phrase: it represents the deluge of massive, often heterogeneous datasets that can no longer be processed on a single, conventional machine using traditional database tools or standard computational frameworks. When we talk about "big data", we refer not only to the size of the data but also its velocity, variety, and a host of other factors that make it challenging to handle using classic methods.

Machine learning and data science professionals cannot ignore big data. Among other tasks, data scientists need to build and deploy analytical models capable of detecting new patterns, adjusting to shifting data sources, and scaling efficiently as the volume and speed of data increases. The typical pipeline includes collecting large volumes of data, cleaning them up, analyzing them with sophisticated statistical or machine learning methods, and finally interpreting or visualizing results to derive insights for decision-making. When these steps need to be performed on terabytes (or petabytes) of information in near real-time, the complexities grow enormously.

Companies, governments, and research organizations are capturing and storing data at rates unimaginable decades ago. A single industrial sensor might produce gigabytes per day, and a network of sensors could easily reach terabytes or petabytes daily. The variety can span everything from text logs and transactional data to audio, images, video, and beyond. Handling such data means adopting architectures specifically designed for horizontal scaling, distributed storage, and parallel processing.

In advanced machine learning research (e.g., Zhang and gang, ICML 2023), new architectures and algorithms have emerged that leverage these vast datasets without drowning in them. By combining streaming ingestion, real-time analytics, and robust distributed computing frameworks, data science teams can execute computations on large clusters of commodity hardware in a way that remains cost-effective and fault-tolerant.

1.2 key characteristics (volume, velocity, variety, veracity, and value)

A widely accepted mnemonic for describing big data is the "five Vs": volume, velocity, variety, veracity, and value. While these terms may appear primarily in marketing contexts, they accurately highlight key issues:

  • Volume: Refers to the sheer size of data. Traditionally, any dataset that could not fit on a single machine's memory (or even disk) might be called "big". Today, many organizations handle data volumes measured in terabytes or petabytes. Physical storage alone is not enough; the compute layer must also scale to handle these volumes efficiently.
  • Velocity: Emphasizes the speed at which data are generated, collected, and processed. Systems such as financial transaction processing platforms, sensor networks, high-frequency trading, or social media streams can produce enormous volumes of data each second, requiring near-real-time analysis.
  • Variety: Modern data can come in a dizzying range of formats — structured, semi-structured, or unstructured. CSV files, relational databases, JSON documents, logs, images, videos, audio recordings, and plain text are often all used at once. Handling and reconciling these types in a unified analytical framework requires specialized techniques (e.g., data lakes, schema-on-read approaches).
  • Veracity: Addresses data quality and data trustworthiness. With large data volumes, errors or inconsistencies may slip in, creating biases or inaccuracies in downstream models. Cleaning, validating, and ensuring robust data pipelines is a critical step in production big data environments.
  • Value: Ultimately, data (no matter how big) remain pointless if they do not deliver insights or practical improvements in decision-making. This dimension focuses on extracting genuine business, scientific, or strategic value from massive datasets.

Big data, especially with these characteristics in mind, is both an opportunity and a challenge. The data scientist's job includes continuously experimenting with new ways to store, process, clean, model, and interpret these enormous streams of bits.

2 evolution of data management

2.1 the limitations of traditional relational databases

For decades, relational database management systems (RDBMS) dominated enterprise computing and data warehousing. SQL-based systems like Oracle, MySQL, PostgreSQL, and Microsoft SQL Server remain integral to most businesses today. However, the growth of data in both size and complexity has laid bare several limitations of classic RDBMS when it comes to big data:

  1. Rigid schema requirements: Traditional relational databases typically require a fixed schema defined up front. This means that each row in a table must conform to a well-defined structure. In high-velocity, rapidly evolving data scenarios, constantly updating schemas can be painful.

  2. Vertical scalability constraints: Classic databases scale up rather than out. If more compute capacity is needed, the typical approach is to buy a bigger machine or add more RAM or CPU power. This approach soon becomes impractical and incredibly expensive when data volumes grow quickly.

  3. Performance bottlenecks with unstructured data: Handling large volumes of unstructured data (e.g., text logs, sensor data, images, or binary formats) can be awkward in an RDBMS that expects data to be stored in neatly organized rows and columns.

  4. High cost of scale: Large enterprise-grade RDBMS solutions can be extremely expensive in terms of licensing, hardware, and maintenance, making them prohibitive for many large-scale tasks.

In the context of big data, these limitations often lead organizations to search for alternative solutions, such as distributed file systems, NoSQL databases, data lakes, or specialized big data processing frameworks.

2.2 the rise of NoSQL solutions

"NoSQL" began as a grassroots movement, loosely grouping a family of databases that deviate from traditional relational approaches in at least one critical way:

  • They generally scale horizontally (across multiple commodity servers),
  • They often sacrifice strict relational schemas or ACID guarantees in favor of high availability and partition tolerance,
  • They frequently introduce flexible data models, such as document stores, key-value pairs, graph structures, or wide-column models.

Examples include MongoDB (document-oriented), Cassandra (wide-column), Redis (key-value), and Neo4j (graph-based). While each system uses different data models, they share a design philosophy: scale out cheaply on clusters of commodity hardware, provide extremely fast writes and reads, and accommodate flexible or rapidly changing data schemas.

NoSQL solutions made modern distributed computing more accessible, enabling organizations to store massive volumes of unstructured or semi-structured data in a more fluid manner. However, these systems can have more complicated consistency models and may demand advanced architectural decisions.

2.3 data warehouses vs. data lakes

Another paradigm shift came with the notion of data lakes. Traditionally, organizations used data warehouses: large relational systems that store curated, structured data in a schema known in advance. Data lakes, by contrast, adopt a schema-on-read approach. They ingest raw, unstructured or semi-structured data directly into a scalable storage system (like a distributed file system), and only when the data are read (queried, analyzed) is the final schema or structure applied.

  • Data warehouses: Typically centralized, curated systems designed for highly structured data. They prioritize high-performance queries and consistent data quality.
  • Data lakes: Large pools of raw data in their native format. Analysts can define multiple "views" or transformations on the data in an ad hoc manner. Data lakes excel at agility because new data sources can be appended immediately without needing to design or reorganize existing schemas.

In many real-world big data deployments, a combination of a data warehouse and a data lake is used, sometimes referred to as a lakehouse concept, bridging structured and unstructured data storage in a single architecture (see Databricks Delta Lake or Apache Iceberg as examples).

2.4 transition to distributed storage and processing

As data volumes soared, it became evident that a single machine, no matter how powerful, could not cost-effectively handle all tasks. To solve this, big data architectures distribute both storage and computation across a network of interconnected computers:

  • Distributed storage: Replicated across multiple nodes to ensure availability and fault tolerance. Notable examples include HDFS (Hadoop Distributed File System).
  • Distributed computing: Parallel processing across many commodity servers or virtual machines, using frameworks like Apache Hadoop's MapReduce, Apache Spark, and others.

This shift to distributed systems allows organizations to scale linearly by adding more machines, rather than endlessly upgrading a single, monolithic server.

3 distributed systems

3.1 basics of distributed computing

In a distributed system, multiple autonomous machines cooperate to achieve a task that might be impractical on a single system. Each node handles a subset of the data or computations, and partial results can be combined or reduced to produce the final output. This approach yields:

  1. Scalability: Adding more nodes provides more storage and computing power, ideally improving performance linearly or near-linearly.
  2. Fault tolerance: If a node fails, other nodes can seamlessly take over or replicate the missing data. Mechanisms like data replication and heartbeats keep distributed clusters robust and available.
  3. Cost effectiveness: Clusters of commodity hardware can, in many situations, be cheaper and more flexible than a single high-end server.

3.2 eventual consistency and the CAP theorem

When building large distributed data systems, strict consistency can be challenging or even impossible to guarantee at all times across widely distributed nodes. The CAP theorem states that a distributed system can offer only two out of the following three guarantees simultaneously: infoConsistency: all nodes see the same data at the same time. infoAvailability: every request receives a response, even if one or more nodes are down. infoPartition tolerance: the system continues to operate despite network failures that split the cluster into segments.

Systems like Cassandra, for instance, prioritize availability and partition tolerance, relaxing immediate consistency in favor of so-called eventual consistency: over time, all nodes will converge to the same state, but at any instant, data might be stale on some node. In practice, this design choice ensures that the system remains responsive even during node or network outages, which is critical in high-volume applications.

3.3 common distributed system architectures

Distributed systems can follow different architectural styles:

  • Master-slave: A central master coordinates tasks among multiple workers or slaves. Examples: Early Hadoop MapReduce had a JobTracker (master) orchestrating TaskTrackers (slaves).
  • Peer-to-peer: All nodes are more or less equivalent, with no single point of control. This architecture can be resilient but is often more complex to manage.
  • Shared-nothing: Each node is independent and self-sufficient, avoiding specialized resources. Very popular in large-scale data warehousing systems.
  • Sharded and replicated: Datasets are split (sharded) across nodes for horizontal scalability, and replicated for fault tolerance.

Different big data frameworks adopt different models to achieve their performance and scalability goals, but the overarching principle is dividing the data or tasks in ways that allow parallel computation with minimal overhead or communication bottlenecks.

4 mapreduce paradigm

First popularized by the seminal Google research paper by Dean and Ghemawat (OSDI, 2004), MapReduce offered a simple but powerful way to implement distributed processing at scale. Its success triggered the entire wave of big data technologies that followed.

MapReduce is essentially a specialized form of parallel computation:

  • Map: The input data are divided into splits. The map function processes each split and emits intermediate <Key, Value> pairs.
  • Shuffle: The system redistributes these pairs so that all values corresponding to the same key end up on the same node.
  • Reduce: A reduce function aggregates the values associated with each key to produce the final result.

4.1 the map phase and the reduce phase

  • Map phase:

    map:(k1,v1)[(k2,v2)] \text{map}: (k_1, v_1) \rightarrow [(k_2, v_2)]

    This function transforms input data from one representation to another, typically filtering or sorting along the way. Each mapper runs independently on different chunks of data.

  • Reduce phase:

    reduce:(k2,[v2])[result] \text{reduce}: (k_2, [v_2]) \rightarrow [\text{result}]

    The reduce function receives all values associated with a single key and merges them (e.g., sum, count, filter, etc.) to produce aggregated results.

In the formula:

  • k1k_1 and v1v_1 are the keys and values in the original dataset.
  • k2k_2 and v2v_2 represent the intermediate (mapped) keys and values.
  • The reducer merges or transforms all v2v_2 values that share the same k2k_2.

4.2 example workflows and implementations

simple word count example

A classic demonstration of MapReduce is counting occurrences of words across thousands of text files:


def map_function(file_line):
    # Emit each word with an initial count of 1
    words = file_line.strip().split()
    for w in words:
        emit(w, 1)

def reduce_function(key, values):
    # Sum all the values for the given key (word)
    total = sum(values)
    emit(key, total)

Behind the scenes, a MapReduce job parallelizes this process across multiple nodes, merging partial counts into a global count for each word.

complex analytics

In industry, MapReduce can be used for tasks like log analysis, web indexing, personalized recommendations, or large-scale machine learning tasks (like training linear models, random forests, or naive Bayes classifiers). However, iterative algorithms (e.g., gradient descent) are less efficient in raw MapReduce because they require multiple read/write cycles to disk. This limitation spurred the development of more advanced in-memory distributed frameworks such as Apache Spark.

5 system design principles

5.1 load balancing strategies

Load balancing ensures that no single node in a distributed cluster becomes a bottleneck. Techniques include:

  • Round-robin: Evenly distributing incoming tasks or data across nodes in a cyclical fashion.
  • Least connections: Tracking the number of ongoing connections per node and routing new requests to the node with the fewest connections.
  • Consistent hashing: Ensuring that data with a particular key always map to the same subset of nodes, facilitating easy lookups and minimal reassignments when nodes join or leave.

By balancing loads, the system can ensure better throughput and more uniform usage of cluster resources.

5.2 caching mechanisms and their impact on performance

Distributed caches (e.g., Redis, Memcached, or caching in Spark RDDs) keep frequently accessed data in memory across the cluster, dramatically reducing repeated computations or disk reads. Caching strategies can include:

  • Write-through: Writes go to both cache and persistent storage simultaneously.
  • Write-behind: Writes are batched and written asynchronously to persistent storage, potentially risking data loss.
  • Eviction policies (LRU, LFU, FIFO): Decide which cached data to discard when memory is full.

Caching is vital for iterative analytics and real-time data processing, where large swaths of data might otherwise be repeatedly computed or fetched.

5.3 message queues: kafka and rabbitmq

In large-scale data pipelines, decoupling data ingestion from data processing is essential. Message queues or streaming platforms can capture streams of data from producers (e.g., application servers, sensors, logging agents) and make them available to consumers in real-time or near-real-time:

  • Apache Kafka: A high-throughput, low-latency platform widely used for building real-time streaming data pipelines. Data are published to topics, partitioned, and replicated across the cluster for fault tolerance.
  • RabbitMQ: A more traditional message broker implementing AMQP (Advanced Message Queuing Protocol). It excels at reliable inter-process communication, queuing, and routing.

These systems help handle high-velocity data sources and facilitate streaming analytics or asynchronous processing.

5.4 scalability and fault tolerance considerations

Scalability in big data systems generally implies the capacity to handle growing data volumes (vertical scaling) or the ability to increase parallel processing power by adding nodes (horizontal scaling). Additionally, robust fault tolerance is vital:

  • Replication: Storing multiple copies of data on different nodes.
  • Node heartbeat and leader election: Quickly detecting failed nodes and electing new leaders or coordinators if the master node fails.
  • Data center or multi-cloud redundancy: Deploying across multiple geographic regions or cloud providers to ensure continuity even if a local outage occurs.

From a design standpoint, ephemeral node failures should not interrupt the entire system. Instead, tasks should be re-assigned automatically, ensuring minimal downtime.

6 state-of-the-art technologies

Big data is a fast-moving field, with new frameworks and platforms emerging regularly. Certain technologies, however, have established themselves as mainstays:

  1. Hadoop ecosystem:

    • HDFS (Hadoop Distributed File System): A foundational piece of the Hadoop ecosystem, storing data redundantly across many nodes.
    • YARN (Yet Another Resource Negotiator): Manages cluster resources and schedules jobs in a Hadoop ecosystem.
    • Hive: A data warehouse software layer that provides a SQL-like interface (HiveQL) to query data stored in Hadoop.
    • Pig: A data flow language and execution framework for parallel computation.
  2. Apache Spark:
    Spark introduced the concept of in-memory computing through Resilient Distributed Datasets (RDDs). It greatly speeds up iterative machine learning and graph computations. Additionally, Spark offers modules for SQL (Spark SQL), streaming (Spark Streaming), machine learning (MLlib), and graph processing (GraphX).

  3. Real-time processing frameworks:

    • Apache Storm: Originated at Twitter, used for real-time event processing, highly suitable for streaming.
    • Apache Flink: A stream processing framework that can handle batch jobs as well, known for its advanced windowing features and support for event time.
    • Apache Samza: Developed by LinkedIn, uses Kafka for messaging and Yarn for resource management.
  4. Cloud-based big data platforms:

    • AWS services like EMR (Elastic MapReduce), Kinesis, Redshift, Glue, Athena, and S3.
    • Azure services like HDInsight, Databricks, Data Lake Storage, and Event Hubs.
    • Google Cloud services like DataProc, Dataflow, BigQuery, and Pub/Sub.

    These platforms allow organizations to rent compute and storage on demand, paying only for the resources they use, and scaling quickly.

  5. Emerging trends in hybrid and multi-cloud deployments:
    Many enterprises adopt a hybrid cloud (on-premises plus one or more public clouds) or multi-cloud strategies to avoid vendor lock-in and leverage specialized capabilities across different providers. Technologies such as Kubernetes, containerization, and unified data orchestration frameworks are making it easier to manage distributed data pipelines across heterogeneous environments.

Big data technologies continue to evolve toward easier real-time stream processing, improved cost efficiency, and simplified data management. The synergy between large-scale data processing and advanced machine learning (e.g., deep learning at scale, GPU acceleration, distributed training) is likewise propelling research frontiers in systems design (e.g., new scheduling algorithms, advanced cluster orchestration, adaptive scaling methods).

7 use cases

Real-world scenarios for big data processing are vast. Below is a sampling of common, high-impact domains:

  1. Business intelligence and analytics: Large e-commerce or retail companies analyze billions of customer interactions to discover purchasing patterns, optimize marketing campaigns, and predict demand surges for specific products.

  2. Social media platforms: Platforms like Facebook, Twitter, TikTok, or LinkedIn collect and process petabytes of data (posts, likes, follows, friend connections) to personalize feeds, recommend content, detect fraudulent activity, and optimize user engagement.

  3. Healthcare and biomedical: Genomic sequencing and analysis generate enormous data volumes, as do electronic health records, medical imaging, and wearable device data. Big data platforms allow for faster research into diseases, personalized medicine, and real-time patient monitoring.

  4. Sensor networks and IoT: Industrial Internet of Things (IIoT) sensors monitor equipment performance and predictive maintenance, anticipating device failures before they happen. Utilities ingest live data from smart meters, while cities run real-time traffic monitoring systems to improve transportation.

  5. Financial services and insurance: Massive amounts of transaction data, market data, or trade data are processed to detect fraud, perform algorithmic trading, generate dynamic pricing, or forecast risk scenarios. Speed and scalability are paramount in these often time-sensitive environments.

  6. Natural disaster and climate prediction: Meteorological institutions handle data from satellite images, sensor arrays, and historical climate models to better predict hurricanes, floods, and other natural disasters, hopefully mitigating their impact.

  7. Government and public sector: Public agencies use big data to track and respond to security threats, manage resources, and plan large-scale infrastructure. Efficient data sharing among agencies can improve service delivery and societal outcomes.

additional perspectives and expansions

Because "Introduction to Big Data" as a topic is broad, it is worth expanding on a few more advanced considerations, bridging the gap between theoretical frameworks and production realities.

advanced distributed file system concepts

While HDFS is the canonical example, modern big data environments sometimes use object storage (e.g., Amazon S3, Google Cloud Storage, Azure Blob Storage) integrated with distributed computation engines. These systems use eventually consistent object stores with read-after-write consistency. For analytics, many frameworks wrap these object storages with Hadoop-compatible file system interfaces.

advanced streaming and real-time analytics

Many organizations focus on real-time processing to gain immediate insights from data as they arrive. Methods like:

  • Micro-batching (Spark Streaming): Data are aggregated into small batches (e.g., 1-second intervals), then processed like mini batch jobs.
  • Continuous dataflow (Apache Flink): Each data event is processed as soon as it arrives, with advanced time semantics (event time, processing time, ingestion time).
  • Complex event processing (CEP): Patterns and anomalies are detected across data streams with sophisticated event correlation.

machine learning on big data

Distributing ML tasks can be crucial when data volumes are enormous. Key strategies:

  • Parallel model training: Data are partitioned, and multiple workers train partial models that are then aggregated (e.g., in a parameter server architecture).
  • In-memory computations: Tools like Apache Spark MLlib handle iterative algorithms more efficiently by caching partial results in memory.
  • Approximation and sampling: In extremely large datasets, it can be computationally cheaper to sample a portion of data that is statistically representative, training a model on that subset while retaining performance guarantees (Smith and gang, JMLR 2021).

data lake challenges: governance, security, and data quality

Even though data lakes are flexible, they introduce complexities:

  • Governance: Without a well-defined process to label, classify, or track data sources, a data lake can become a "data swamp," where it is unclear what the data represent or how they should be used.
  • Security: Storing raw data can lead to privacy concerns. Access control, encryption, and compliance with regulations such as GDPR become non-trivial in large distributed settings.
  • Quality: Ingestion of raw data can lead to duplication, missing fields, or inconsistencies. Downstream transformations must incorporate data cleaning and validation steps.

big data and container orchestration

Modern big data clusters often use containerization (Docker) and orchestration platforms (Kubernetes) for consistent deployment across various hardware or cloud platforms. They unify resource management, job scheduling, and scaling, bridging the gap between ephemeral big data workloads and stable infrastructure.

Kubernetes-based Spark or Kubernetes-based Flink are popular examples, offering streamlined cluster creation, resource isolation, and improved elasticity.

code snippet: launching a spark job on kubernetes

Below is a simplified Python snippet for running a PySpark job on a Kubernetes cluster, demonstrating the synergy between big data processing and modern container orchestration:


import os
import subprocess

# Set environment variables required by Spark on Kubernetes
os.environ["KUBECONFIG"] = "/home/user/.kube/config"
os.environ["SPARK_HOME"] = "/path/to/spark"

# Example Spark submit command
spark_submit_cmd = [
    "spark-submit",
    "--master", "k8s://https://<your-k8s-api-server>:443",
    "--deploy-mode", "cluster",
    "--name", "big-data-example",
    "--conf", "spark.executor.instances=4",
    "--conf", "spark.kubernetes.container.image=my-spark-image:latest",
    "/path/to/your_spark_script.py"
]

subprocess.run(spark_submit_cmd)

In real deployments, additional settings for memory, CPU requests, or ephemeral volumes would be configured. This approach encapsulates big data logic within containers, making clusters more portable and flexible.

advanced research & references

Academic and industrial research in big data is broad. Some advanced directions include:

  • Edge computing and fog computing (Faltings and gang, ICML 2022): Handling partial data processing near the data source (e.g., on IoT devices) before shipping them to the central cluster.
  • Adaptive resource scheduling (e.g., Sparrow, Hawk, and Eagle scheduling in Spark clusters) that dynamically adjusts job priority based on load or deadlines.
  • MLflow integration for managing machine learning experiments across big data pipelines.
  • Federated learning at scale, combining data from multiple organizations or devices without centralizing raw data, thus respecting privacy constraints and compliance regulations.

potential illustrations

mysterious_frog

An image was requested, but the frog was found.

Alt: "Illustration showing the five Vs of big data"

Caption: "The Five Vs of Big Data: Volume, Velocity, Variety, Veracity, and Value"

Error type: missing path

mysterious_frog

An image was requested, but the frog was found.

Alt: "High-level architecture of a distributed big data system"

Caption: "A simplified architecture for a distributed big data pipeline, with ingestion, processing, and storage layers"

Error type: missing path

These figures can help visualize the conceptual model of big data, from data ingestion to distributed processing and final analytics or machine learning tasks.

final thoughts

In summary, "big data" is not just about large files or advanced marketing slogans: it is a paradigm shift in how data are collected, stored, processed, and analyzed. Methods and tools that seemed adequate a decade ago can rapidly become unworkable when confronted with the scale and speed of modern data streams. From the MapReduce pattern and HDFS to more recent in-memory and stream-processing solutions, an intricate ecosystem of systems has emerged to help organizations harness the value in their data.

Key takeaways include:

  • Architectures must be designed for horizontal scalability and fault tolerance, distributing both data and compute tasks.
  • Data management requires a shift from rigid schemas to more flexible, schema-on-read approaches, especially in data lakes, with advanced governance to prevent data swamps.
  • System design principles like load balancing, caching, and message queuing make real-time or near-real-time analytics feasible and resilient.
  • Technological stacks such as Hadoop (HDFS, YARN), Spark, and various real-time frameworks have matured, with robust offerings now available via major cloud providers.
  • Use cases span virtually every industry, from e-commerce personalization to scientific research and IoT sensor analytics, exemplifying how big data solutions enable new possibilities.

For data scientists and engineers, becoming proficient with distributed processing frameworks, big data design principles, and the constantly evolving ecosystem of tools is essential. Mastery of these subjects unlocks the power to transform raw, chaotic information flows into actionable intelligence on a truly massive scale.

kofi_logopaypal_logopatreon_logobtc-logobnb-logoeth-logo
kofi_logopaypal_logopatreon_logobtc-logobnb-logoeth-logo