Cassandra 101 : Understanding What Cassandra Is

加入dbDao.com Oracle技术学习QQ群:171092051

 

如果自己搞不定可以找诗檀软件专业ORACLE数据库修复团队成员帮您恢复!

诗檀软件专业数据库修复团队

服务热线 : 400-690-3643 备用电话: 13764045638 邮箱:service@parnassusdata.com

 

As some of you may know, in my current role at Pythian, I am tackling OSDB and currently Cassandra is on my radar. So one of the things I have been trying to do is learn what Cassandra is, so in this series, I’m going to share a bit of what I have been able to learn.

According to the whitepaper “Solving Big Data Challenges for Enterprise Application Performance Management” , Cassandra is a “distributed key value store developed at Facebook. It was designed to handle very large amounts of data spread out across many commodity servers while providing a highly available service without single point of failure allowing replication even across multiple data centers as well as for choosing between synchronous or asynchronous replication for each update.”

Cassandra, in layman’s terms, is a NoSQL database developed in JavaOne. One of Cassandra’s many benefits is that it’s an open source DB with deep developer support. It is also a fully distributed DB, meaning that there is no master DB, unlike Oracle or MySQL, so this allows this database to have no point of failure. It also touts being linearly scalable, meaning that if you have 2 nodes and a throughput of 100,000 transactions per second, and you added 2 more nodes, you would now get 200,000 transactions per second, and so forth.

2015-03-12_1145

Cassandra is based on 2 core technologies, Google’s Big Table and Amazon’s Dynamo, which Facebook uses to power their Inbox Search feature and released it as an open source project on Google Code and then incubated at Apache, and is nowadays a Top-Level-Project. Currently there exists 2 versions of Cassandra:

Since Cassandra is a distributed system, it follows the CAP Theorem, which is awesomely explained here, and it states that, in a distributed system, you can only have two out of the following three guarantees across a write/read pair:

  • Consistency.- A read is guaranteed to return the most recent write for a given client.
  • Availability.-A non-failing node will return a reasonable response within a reasonable amount of time (no error or timeout).
  • Partition Tolerance.-The system will continue to function when network partitions occur.

Also Cassandra is a BASE (Basically Available, Soft state, Eventually consistent) type system, not an ACID (Atomicity, Consistency, Isolation, Durability) type system, meaning that the system is optimistic and accepts that the database consistency will be in a state of flux, not like ACID which is pessimistic and it forces consistency at the end of every transaction.

Cassandra stores data according to the column family data model where:

  • Keyspace is the container for your application data, similar to a schema in a relational database. Keyspaces are used to group column families together. Typically, a cluster has one keyspace per application.It also defines the replication strategy and data objects belong to a single keyspace
  • Column Family is a set of  one,two or more individual rows with a similar structure
  • Row is a collection of sorted columns, it is the the smallest unit that stores related data in Cassandra, and any component of a Row can store data or metadata
    •  Row Key uniquely identifies a row in a column family

      •  Column key uniquely identifies a column value in a row
      •  Column value stores one value or a collection of values
keyspace

Also we need to understand the basic architecture of Cassandra, which has the following key structures:

  • Node is one Cassandra instance and is the basic infrastructure component in Cassandra. Cassandra assigns data to nodes in the cluster, each node is assigned a part of the database based on the Row Key. Usually corresponds to a host, but not necessarily, specially in Dev or Test environments.
  • Rack is a logical set of nodes
  • Data Center is a logical set of Racks, a data center can be a physical data center or virtual data center. Replication is set by data center
  • Cluster contains one or more data centers and is the full set of nodes which map to a single complete token ring
Cassandra_Arch

Conclusion

Hopefully this will help you understand the basic Cassandra concepts. In the next series, I will go over architecture concepts of what a Seed node is, the purpose of the Snitch and topologies, the Coordinator node, replication factors, etc

Note 1:

André Araújo, a great friend of mine and previous Pythianite, wrote about his first experience with Cassandra : My First Experience with Cassandra – Part 1

Note 2:

This post was originally published in my personal blog: rene-ace.com

关注dbDao.com的新浪微博

扫码关注dbDao.com 微信公众号:

关注dbDao.com的新浪微博

扫码加入微信Oracle小密圈,了解Oracle最新技术下载分享资源

TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569