Thursday 9 February 2017

Understanding MemSQL in 5 Points


MemSQL is a relational database management system with a SQL interface and, it stores data in memory and runs in a cluster.
1. What is MemSQL?
MemSQL is a relational database management system with a SQL interface. It has the following additional properties:
·         Distributed architecture: MemSQL typically runs on a “cluster” of servers. However, users address a single interface to get data in or out of MemSQL. MemSQL handles details like which servers store certain volumes of data, and distributed query execution logic. Users benefit from the resources of many servers without needing distributed systems expertise.

·         Memory-optimized: MemSQL gives users the option to read and write data directly to and from main memory. Accessing data in DRAM is orders of magnitude faster than disk or even flash/SSD, and this provides extreme performance benefits for high-throughput transaction processing and real-time analytics. However, MemSQL does not require that all data fit in memory, and includes the option to store data, in a compressed format, on flash/SSD/disk.

·         Software only: MemSQL is distributed as software, and can run in variety of Linux environments including on physical hardware (“bare metal”), in virtual machines, or in software containers. It can be run on premises, in your data center, or in a public cloud. Unlike legacy database vendors that require the use of proprietary hardware, MemSQL can run on commodity servers.
2. Where does MemSQL store its data?
When creating a table in MemSQL, you specify whether the data for that table will reside in memory or on disk.
MemSQL has two table types: rowstore tables and columnstore tables.
Rowstore tables are primarily used for operational or transactional workloads, particularly workloads that require rapid updates. MemSQL rowstore tables store data entirely in memory, with logs and full database snapshots written to disk for durability.
Columnstore tables, on the other hand, are typically used for analytical workloads and data warehousing, as the format naturally lends itself to compression, efficient scanning, and rapidly appending data. MemSQL columnstore tables store data primarily on disk, but cache data in memory when possible.
3. How does MemSQL ensure durability for data stored in memory?
MemSQL writes logs and full database snapshots to disk, which can be used to recover state in the event that a machine turns off. MemSQL writes logs as data changes, and periodically triggers a full backup of the data in memory (snapshot). Users can configure the frequency of full database snapshots.
MemSQL supports transaction processing, and exposes parameters for the user to tune performance. The most common question we get is how do we ensure durability when processing transactions in memory. In MemSQL, a transaction is “committed” when it has been written to the in-memory transaction buffer. MemSQL keeps a thread running in the background that is constantly writing blocks of data from the in-memory transaction buffer to disk as logs. The size of the transaction buffer is configurable. For instance, setting the size of the transaction buffer to 0 MB means that a transaction will not be committed until it has been logged to disk. In practice, it is not necessary to set the transaction buffer to 0 MB since using a buffer allows MemSQL to write large chunks of data to disk all at once, and because doing so requires only sequential I/O.

4. How is MemSQL architected?
A MemSQL cluster consists of two types of nodes: aggregator and leaf nodes. Client applications connect to an aggregator, which serves as the query router. Aggregators are aware of the entire cluster and know where specific data reside. Leaf nodes handle data storage and most of the computation during query execution. When the client sends a query, the aggregator splits it into several queries which are sent to each leaf node. The leaf computes its query and sends the result back to the aggregator. The aggregator consolidates the results from each leaf and sends the final result back to the client.
Every cluster has a special aggregator called the Master Aggregator (non-Master aggregators are called child aggregators). Any aggregator, master or child, can process data manipulation language (DML) queries including SELECT, INSERT, UPDATE, and DELETE. Data definition language (DDL) commands, such as CREATE, DROP, or ALTER TABLE, must run on the Master Aggregator. In the event that the Master Aggregator machine fails, an administrator can “promote” a child aggregator to Master.
Database administrators can add (and remove) nodes to the cluster at any time while keeping the cluster online, even while running a workload. Simply provision additional nodes, then add them to the MemSQL cluster. The easiest way to do this is through MemSQL Ops, but it can also be accomplished by sending commands to the Master Aggregator.
5. How does MemSQL licensing work? How do I know how many machines I will need in my cluster?
MemSQL is licensed based on the total RAM capacity of a cluster (the sum of the RAM in each server/VM/container in the cluster). Unlike some vendors in the database space, we don’t license based on number of CPU cores. This allows our customers to maximize CPU resources without licensing overhead.
When planning, note that the cluster must have enough RAM for the following:
·         Rowstore data storage in memory
·         Rowstore and columnstore query execution
·         The operating system (kernel)
MemSQL enables several types of workloads ranging from stream processing and real-time analytics, to transaction processing, to data warehousing. Each of these workloads has different performance characteristics and latency requirements. 

No comments:

Post a Comment