Cassandra is open source and is in development at Apache.
The Apache Cassandra project brings together Dynamo’s fully distributed design
and Bigtables Column family based data model. Cassandra is adapting to recent
advances in distributed algorithms like Accural style failure detection and
others. Cassandra is proven as it is in use by Digg, Facebook, Twitter, Reddit,
Rackspace, Cloudkick, Cisco. The largest production cluster has over 100 TB of
data in over 150 machines. It is Fault tolerant, decentralizes and gives the
control to developers to choose between synchronous and asynchronous data
replication. It offers rich data model, to efficiently compute using key and
value pairs. It is highly scalable both in terms of storage volume and request
throughput while not being subject to any single point of failure. It is
durable and supports third party applications. Cassandra aims to run on top of
an infrastructure of hundreds of nodes (possibly spread across different data
centers).
At this scale, small and large components fail
continuously. The way Cassandra manages the persistent state in the face of
these failures drives the reliability and scalability of the software systems
relying on this service. While in many ways Cassandra resembles a database and
shares many design and implementation strategies therewith, Cassandra does not
support a full relational data model; instead, it provides clients with a
simple data model that supports dynamic control over data layout and format.
Cassandra system was designed to run on cheap commodity hardware and handle
high write throughput while not sacrificing read efficiency.
What is Cassandra?
Apache Cassandra is a highly scalable and
high-performance distributed database management system that can serve as both
an operational datastore (the “system of record”) for online/transactional
applications, and as a read-intensive database for business intelligence
systems. Cassandra is able to manage the distribution of data across multiple
data centers and offers incremental scalability with no single points of
failure. It is a NoSQL database that is decentralized (No single point of
failure), elastic (Linear Scalability), fault Tolerant(Replication), optimized
for writes, reads. It is a structured storage system over a P2P network.
Cassandra uses a synthesis of well-known techniques to achieve scalability and
availability. Cassandra is a distributed storage system for managing structured
data that is designed to scale to a very large size across many commodity
servers, with no single point of failure.
The idea is to run on top of an infrastructure of
hundreds of nodes, where small and large components in the data centers fail
continuously. Over the edge, Cassandra achieves scalability, high performance,
high availability and applicability. It does not support a full relational data
model. Instead it provides clients with a simple data model as explained later.
Many modern businesses have outgrown the typical RDBMS
use case and are in need of data management software that offers more. Sharding
was a stop-gap measure, but architectural limitations, and the management
complexity it requires, make it unacceptable for many mainstream organizations.
Successful web companies like Facebook, Yahoo, Google, and others like them,
first exposed the need for a more forward-thinking method beyond sharding that
managed all types of data, but it wasn’t long before that need became prevalent
in nearly every industry.
“Apache Cassandra is an open source, distributed,
decentralized, elastically scalable, highly available, fault-tolerant, tuneably
consistent, column-oriented database that bases its distribution design on
Amazon’s Dynamo and its data model on Google’s Bigtable. Created at Facebook,
it is now used at some of the most popular sites on the Web.” Here we see a lot
of complicated words such as distributed, decentralized, elastically scalable,
highly available, fault-tolerant, tuneably consistent, column-oriented etc. so
let’s examine them in brief.
Comparing the Cassandra Data Model to a Relational
Database:
The Cassandra data model is designed for distributed
data on a very large scale. Although it is natural to want to compare the
Cassandra data model to a relational database, they are really quite different.
In a relational database, data is stored in tables and the tables comprising an
application are typically related to each other. Data is usually normalized to
reduce redundant entries, and tables are joined on common keys to satisfy a
given query. For example, consider a simple application that allows users to
create blog entries. In this application, blog entries are categorized by
subject area (sports, fashion, etc.).
In Cassandra, the keyspace is the container for your
application data, similar to a database or schema in a relational database.
Inside the keyspace are one or more column family objects, which are analogous
to tables. Column families contain columns, and a set of related columns is
identified by an application-supplied row key. Each row in a column family is
not required to have the same set of columns. Cassandra does not enforce
relationships between column families the way that relational databases do between
tables: there are no formal foreign keys in cassandra, and joining column
families at query time is not supported. Each column family has a
self-contained set of columns that are intended to be accessed together to
satisfy specific queries from your application.
For example, using the blog application example, you
might have a column family for user data and blog entries similar to the
relational model. Other column families (or secondary indexes) could then be
added to support the queries your application needs to perform. For example, to
answer the queries "what users subscribe to my blog" or "show me
all of the blog entries about fashion" or "show me the most recent
entries for the blogs I subscribe to", you would need to design additional
column families (or add secondary indexes) to support those queries. Keep in
mind that some denormalization of data is usually required.
Cassandra Architecture:
Cassandra can satisfy many data-driven application use
cases through a carefully thought-out architecture designed to manage all forms
of modern data, scale to meet the requirements of “big data” management, offer
linear performance scale-out capabilities, and deliver the type of high
availability that most every online, 24x7 application needs. At its foundation,
Cassandra is a peer-to-peer distributed data management system where every node
is essentially the same with respect to how it functions in the cluster. In
Cassandra, there is no concept of a “master node” or anything similar, with the
benefit being derived that no single point of failure exists for any key
process or function.
The scale-out aspect of Cassandra allows node
additions to occur with no disruption to application uptime. Capacity to handle
increasing I/O traffic or incoming data volumes is added easily and requires no
special ETL processes or other data movement work to be performed manually.
Instead, Cassandra automatically partitions data across nodes once one or more
nodes have been added to a cluster and “seeds” the new nodes from existing
machines in the cluster. Data redundancy to protect against hardware failure
and other data loss scenarios is also built into and managed transparently by
Cassandra. Further, this capability can be configured to be quite sophisticated
so data can be distributed across multiple, geographically dispersed data
centers, between different physical racks in a data center, and between public
cloud providers and on-premise managed data centers.
Comments
Post a Comment