Big Data

Big Data is the term for datasets so large, fast-moving or varied that traditional databases and tools cannot capture, store or process them within a reasonable time. It is characterised by the "3 Vs" — volume, velocity and variety — and requires distributed architectures to extract usable business value.

The 3 Vs: what makes data "big"

The defining feature of Big Data is not a fixed file size but the moment data outgrows the capacity of a single conventional server and relational database. The concept is commonly framed around three core dimensions, originally introduced by analyst Doug Laney.

  • Volume — the sheer quantity of data, often measured in terabytes or petabytes, generated by transactions, sensors, logs, applications and user interactions.
  • Velocity — the speed at which data arrives and must be processed, ranging from periodic batches to continuous real-time streams.
  • Variety — the mix of formats: structured data (database tables), semi-structured data (JSON, XML, logs) and unstructured data (text, images, video, audio).

Two further Vs are frequently added in practice: veracity (the reliability and quality of the data) and value (the actual business benefit that can be extracted). Data only qualifies as a Big Data problem when at least one of these dimensions exceeds what classic tools can handle efficiently.

Traditional approach vs Big Data approach

The fundamental difference lies in scaling strategy. Traditional systems scale vertically (a bigger, more powerful machine), while Big Data systems scale horizontally by distributing storage and computation across many machines.

CriterionTraditional data processingBig Data processing
Data volumeGigabytes, fits on one serverTerabytes to petabytes, distributed
Scaling modelVertical (upgrade the machine)Horizontal (add more machines)
Data structureMostly structured, fixed schemaStructured, semi-structured and unstructured
Typical storageRelational database (SQL)Distributed file systems, NoSQL, data lakes
ProcessingSingle server, sequential queriesDistributed/parallel (e.g. MapReduce, Spark)

Common technologies in this space include the Hadoop ecosystem and its HDFS distributed file system, Apache Spark for in-memory processing, NoSQL databases (such as document, column-family or key-value stores), and data lakes that retain raw data in its native format until it is needed.

Business use cases

For an SME or mid-market company, Big Data becomes relevant once data sources multiply and a single database can no longer answer the questions the business asks of it. Concrete applications include:

  • Predictive maintenance — analysing sensor and machine telemetry to anticipate equipment failures before they happen.
  • Customer behaviour analysis — aggregating clickstreams, transactions and support interactions to segment audiences and personalise offers.
  • Fraud and anomaly detection — processing high-velocity transaction streams to flag suspicious patterns in near real time.
  • Supply chain and logistics optimisation — combining inventory, demand and external data to improve forecasting and routing.
  • Recommendation engines — using large interaction histories to suggest products or content.

Big Data is also the foundation for machine learning and AI projects, which depend on large, varied training datasets. In practice, the technical challenge is rarely storing the data — it is engineering pipelines that turn raw, high-volume data into reliable, queryable information that supports decisions.

Questions fréquentes

There is no universal size threshold. Data is considered Big Data when its volume, velocity or variety exceeds what a single conventional server and relational database can process efficiently. The trigger is the technical limit of traditional tools, not a specific number of gigabytes.

A regular relational database stores structured data on a single server and scales by upgrading that machine. Big Data systems distribute storage and processing across many machines, handle structured and unstructured formats, and scale by adding more servers. The two are complementary rather than mutually exclusive.

No. Big Data is about storing and processing very large, varied datasets, while AI and machine learning are about building models that learn from data. However, the two are closely linked: most machine learning models depend on large, high-quality datasets, so Big Data infrastructure often serves as the foundation for AI projects.

Not necessarily. Many organisations operate well with traditional databases and analytics tools. Big Data technologies become worthwhile only when data volume, speed or format diversity overwhelms classic systems, or when the business needs real-time processing of large streams. The right choice depends on actual data characteristics, not on the trend.

Building a custom software project? We design bespoke software aligned with your roadmap.

See our custom software expertise