Author: MC Brown

MariDB to Hadoop in Spanish

Nicolas Tobias has written an awesome guide to setting up replication from MariaDB to Hadoop/HDFS using Tungsten Replicator, in Spanish! He’s planning more of these so if you like what you see, please let him know!

Semana santa y yo con nuevas batallas que contar.
Me hayaba yo en el trabajo, pensando en que iba a invertir la calma que acompa;a a los dias de vacaciones que libremente podemos elegir trabajar y pense: No seria bueno terminar esa sincronizacion entre los servidores de mariaDB y HIVE?
Ya habia buscado algo de info al respecto en Enero hasta tenia una PoC montada con unas VM que volvi a encender, pero estaba todo podrido: no arrancaba, no funcionba ni siquiera me acordaba como lo habia hecho y el history de la shell er un galimatias. Decidi que si lo rehacia todo desde cero iba a poder dejarlo escrito en un playbook y ademas, aprenderlo y automatizarlo hasta el limite de poder desplegar de forma automatica on Ansible.

via De MariaDB a HDFS: Usando Continuent Tungsten. Parte 1 | run.levelcin.co

End of an Era: Neither MC nor Continuent Attending Percona Live

Continuent have been a long term sponsor of the Percona Live conference, and the MySQL conference as it was before that, for many years. We have attended the conference both as a Diamond sponsor, and members of our staff attending and presenting our products and experience at the conference.
The nature of these conferences always changes over time, and we have seen over the last few years how the Percona Live conference has moved from being a pure MySQL conference to an open source database conference. Although Continuent continue to provide open source software and integrate with many open source databases, our core operation still revolves around MySQL clustering and replication for MySQL and Oracle.
Continuent is also evolving and changing and we are increasingly deploying and moving towards pure cloud-based environments, building and developing products that are used on the cloud or explicitly leverage cloud computing technology. We have a number of new products and initiatives specifically targeting these areas.
Over the course of the next year we will be releasing cloud editions of our clustering, replication and new backup and proxy services both directly and through our partners.
As such, this year we have made the difficult decision not to sponsor or attend the Percona Live conference, directing our energies to other conferences, webinars and meetups. We will be attending the AWS conference, for example, and we fully intend to be at some other select conferences this year dealing with analytics, in-memory computing, and cloud-based deployments.
To stay up-to-date with what Continuent are doing, keep reading the Continuent blog and follow us on Twitter and Facebook.

Replicating into Elasticsearch Webinar Followup

We had a great webinar on Wednesday looking at how we can use Tungsten Replicator for moving data into Elasticsearch, whether that’s for analytics, searching, or reporting.
You can ahead and watch the video recording of that session here
We had one question on that session, which I wanted to answer in full:
Can UPDATE and DELETE be converted to INSERT?
One of the interesting issues with replicating databases between databases is that we don’t always want the information in the same format when it goes over another side. Typically when replicating within a homogeneous environment the reason we are using replication is that we want an identical copy over on the other side of the process. In heterogeneous, especially when we move the data out to an analytics environment like Elasticsearch, we might not. That covers a whole range of aspects, from the datatypes and other columns.
When it comes to the point in this question, what we’ve really got is an auditing rather than strict data replicating style scenario, and it sets up a different set of problems.
The simple answer is that currently, no, we don’t. 
The longer answer is that we want to do this correctly and it needs some careful considerations depending on your target environment.
Usually, we want to update the target database with a like-for-like operation for the target database, and we do that by making use of a primary key or other identifiers to update the right information. There can only be one primary key or identifier and if we are tracking modifications rather than the current record state, we need to not use the primary ID. But that also means we don’t normally track what operation created the entry we just modify the record accordingly which would additionally require another field. That means for the above request to make sense within the typical replication deployment, we need to do two things:

Stop using a primary key as the record identifier
Record the operation information (and changes) in the target. For most situations, this means you want to know the old and new data, for example, in an update operation.

Fortunately, within Elasticsearch we can do this a bit more easily, first by using the auto-generated ID, and second by using the document structure within Elasticsearch to model the old/new information, or add a new field to say what operation that is.
We haven’t that within Elasticsearch yet but have learnt some good practices for handling that type of change within our Kafka applier. Those changes will be in a release shortly and we’ll be adding them into Elasticsearch soon after.
Have many more questions about Elasticsearch or Tungsten Replicator? Feel free to go ahead and ask!
 
 
 

Moving data in real-time into Amazon Redshift follow-up Questions

We had a really great session yesterday during the webinar on Amazon Redshift replication, but sadly ran out of time for questions. So as promised, let’s try and answer the questions asked now!
Are you going to support SAP IQ as a target ?
This is the first request we’ve ever received, and so therefore there are no firm plans for supporting it. As a target, SAP IQ supports a JDBC interface, and therefore is not hugely complicated to achieve compared to more complex, customer appliers.
Can you replicate from SAP ASE ?
No, and no current plans.
What about latin1 character ? Is it supported ?
Yes, although the default mode is for us to use UTF-8 for all extract and apply jobs. You can certainly change the character set, although leaving UTF-8 should also work.
Is it a problem not to have primary keys on Redshift ?
Yes, and no. The replicator needs some way to identify a unique row within the target database so that we don’t inadvertently corrupt the data. Because of the way the data is loaded, we also need to have enough information to be able to run a suitable materialise process, and that relies on having that primary key information to use to identify those rows. Obviously, primary keys provide us with the information we need to do that.
In a future release, we will support the ability to add an explicit custom primary key to tables, even if the source/target table doesn’t have explicit primary keys. This means that if you can use multiple columns to uniquely identify a row, we can use this to perform the materialisation.
What if someone runs a truncate on the source ?
Currently, TRUNCATE is not supported as it is a DDL operation that is normally filtered. With the new DDL translation functionality however we have the ability to support this, and decide how it is treated. Support for TRUNCATE is not yet supported even in the new DDL translation support. In a future release we will process a TRUNCATE operation as we would any other DDL, and therefore choose whether it deletes or truncates the table, or creates a copy or archive version of the table.
Is it possible to filter out DML operations per table ? Forbid DELETEs on a given table.
Yes. We have a filter called SkipEventByType, which allows you to select any table and decide whether operations apply to it. That means that you can configure by table and/or schema whether you want to allow INSERT, UPDATE or DELETE operations.
Is replication multi-threaded ?
Yes. The replicator already handles multiple threads internally (so for example the reading of the THL and the applying are two separate threads). You can also configure the applier to handle data by multiple threads and this works with our ‘sharding’ system. This is not sharding of the data, but instead sharding of the THL stream within the replication process. To ensure we don’t corrupt data that crosses multiple tables within transactions (the replicator is always transactionally consistent) the sharding is handled on a per-schema basis. So if you are replicating 10 schemas, you can configure 10 threads, and each thread will be handled and applied separately to the target. Also keep in mind that batch-based appliers like Redshift automatically handle multiple simultaneous targets and threads for each table, since each table is loaded individually into the target.
If you are condensing multiple streams (i.e. multiple MySQL or Oracle sources), then each stream also has it’s own set of threads.
However, even with multiple threads, it’s worth remembering that both the MySQL binary log and the Oracle Redo logs are sequential transaction logs, and so we don’t extract with multiple threads because there is only one source of transaction changes, and therefore we only ever extract complete applied and committed transactions.

Hopefully this has answered all the questions (and some other information). But if you have more, please feel free to ask.

Analytical Replication Performance from MySQL to Vertica on MemCloud

I’ve recently been trying to improve the performance of the Vertica replicator, particularly in the form of the of the new single schema replication. We’ve done a lot in the new Tungsten Replicator 5.3.0 release to improve (and ultimately support) the new single schema model.
As part of that, I’ve also been personally looking to Kodiak MemCloud as a deployment platform. The people at Kodiak have been really helpful (disclaimer: I’ve worked with some of them in the past). MemCloud is a high-performance cloud platform that is based on hardware with high speed (and volume) RAM, SSD and fast Ethernet connections. This means that even without any adjustment and tuning you’ve got a fast platform to work on.
However, if you are willing to put in some extra time, you can tune things further. Once you have a super quick environment, you find you can tweak and update the settings a little more because you have more options available.  Ultimately you can then make use of that faster environment to stretch things a little bit further. And that’s exactly what I did when trying to determine how quickly I could get data into Vertica from MySQL.
In fact, the first time I ran my high-load test suite on MemCloud infrastructure, replicating data from MySQL into Vertica, I made this comment to my friend at Kodiak:
The whole thing went so quick I thought it hadn’t executed at all!
In general, there are two things you want to test when using replication to move data from a transactional environment into an analytical one:

Latency of moving the data
Apply rate for moving the data

The two are subtly different. The first measures how long it takes to get the data from the source to the target. The second measures how much data you can move in a set period of time.
Depending on your deployment and application, either, or both, can be critical. For example, if you are using analytics to perform real-time analysis and charging on your data, the first one is the most important, because you want the info up to date as quickly as possible. If you performing log analysis or longer-term trends, the second is probably more important. You may not worry about being a few seconds, but you want many thousands of transactions to be transferred. I concentrated on the former rather than the latter, because latency in the batch applier is something you can control by setting the batch interval.
So what did I test?
At a basic level, I was replicating data from MySQL directly into Vertica. That is, extracting data from the MySQL binary log, and writing that into a table within HPE Vertica cluster using 3 nodes. Each is running in MemCloud, which each running with 64GB of RAM and 2TB of SSD disk space. I’ve deliberately made no configuration changes to Vertica at this point.
The first thing I did was set-up a basic replication pipeline between the two. Replication into Vertica works by batch-loading data through CSV files into Vertica tables and then ‘materialising’ the changes into the carbon copy tables. Because it’s done in batches, the latency is effectively governed by the batch apply settings, which were configured for 10,000 rows or 5 seconds.
To generate the load, I’ve written a script that generates 100,000 rows of random data, then updates about 70% of those randomly and deletes the other 30%. So each schema is generating about 200,000 rows of changes for each load. This is designed to test the specific batch replication scenario. Ultimately it does this across multiple schemas (same structure). I specifically use this because I match this with the replication to get replication (rather than transaction) statistics. I need to be able to effectively monitor the apply rate into Vertica from MySQL.
The first time I ran the process of just generating the data and inserting into MySQL, the command returned almost immediately. I seriously thought it had failed because I couldn’t believe I’d just inserted 200,000 rows into MySQL that quick. Furthermore, over on the Vertica side, I’m monitoring the application through the trepctl perf command, which provides the live output of the process. And for a second I see the blip as the data is replicated and then applied. I thought it was so quick, it was a single row (or even failed transaction) that caused the blip.
The first time I ran the tests, I got some good results with 20 simultaneous schemas:

460,000 rows/minute from a single MySQL source into Vertica. 

Then I doubled up the source MySQL servers, so two servers, 40 simultaneous schemas, and ultimately writing in about 8 million rows:

986,000 rows/minute into Vertica across 40 schemas from 2 sources

In both cases, the latency was between 3-7 seconds for each batch write (remember we are handling 10,000 rows or 5s per batch). We are also doing this across *different* schemas at this stage. These are not bad figures.
I did some further tweaks, this time, reconfiguring the batch writes to do larger blocks and larger intervals. This increases the potential latency (because there will be bigger gaps between writes into Vertica), but increases the overall row-apply performance. Now we are handling 100,000 rows or 10s intervals. The result? A small bump for a single source server:

710,000 rows/minute into Vertica across 20 schemas from 1 source

Latency has increased though, with us topping out at around 11.5s for the write when performing the very big blocks. Remember this is single-source, and so I know that the potential is there to basically double that with a second MySQL source since the scaling seems almost linear.
Now I wanted to move on to test a specific scenario I added into the applier, which is the ability to replicate from multiple source schemas into a single target schema. Each source is identical, and to ensure that the ‘materialise’ step works correctly, and that we can still analyse the data, a filter is inserted into the replication that adds the source schema to each row.
The sample data inserts look like this:
insert into msg values (0,”RWSAjXaQEtCf8nf5xhQqbeta”);
insert into msg values (0,”4kSmbikgaeJfoZ6gLnkNbeta”);
insert into msg values (0,”YSG4yeG1RI6oDW0ohG6xbeta”);
With the filter, what gets inserted is;
0, “RWSAjXaQEtCf8nf5xhQqbeta”, “sales1″
0,”4kSmbikgaeJfoZ6gLnkNbeta”, “sales1”
Etc, where ‘sales1’ is the source schema, added as an extra column to each row.
This introduces two things we need to handle on the Vertica side:

We now have to merge taking into account the source schema (since the ID column of the data could be the same across multiple schemas). For example, whereas before we did ‘DELETE WHERE ID IN (xxxx)’, and now we have to do ‘DELETE WHERE ID IN (xxxx) AND dbname = ‘sales1”.
It increases the contention ratio on the Vertica table because we now effectively write into only one table. This increases the locks and the extents processed by Vertica.

The effect of this change is that the overall apply rate slows down slightly due to the increased contention on a single table. Same tests, 20 schemas from one MySQL source database and we get the following:

760,000 rows/minute into Vertica with a single target table

This is actually not as bad as I was expecting when you consider that we are modifying every row of incoming data, and are no longer able to multi-thread the apply.
I then tried increasing that using the two sources and 40 schemas into the same single table. Initially, the performance was no longer linear, and I failed to get any improvement beyond about 10% above that 760K/min figure.
Now it was time to tune other things. First of all, I changed some of the properties on the Vertica side in terms of the queries I was running, tweaking the selecting and query parameters for the DELETE operations. For batch loading, what we do is DELETE and then INSERT, or, in some case, DELETE and UPDATE if you’ve configured it that way. Tweaking the subquery that is being used increased the performance a little.
Changing the projections used also increased the performance of the single schema apply. But the biggest gains, perhaps unsurprisingly, were to change the way the tables were defined in the first place and to use partitions in the table definition. To do this, I modified the original filter that was adding the schema name, and instead had it add the schema hash, a unique Java ID for the string. Then I created the staging and base tables in Vertica using the integer hash as the partition. Then I modified the queries to ensure that the partitions would be used effectively.
The result was that the 760k/min rate was now scalable. I couldn’t get any faster when writing into a single schema, but the rate remains relatively constant whether I am replicating five schemas into a single one, or 40 schemas from two or three sources into the same. In fact, it does ultimately start to dip slightly as you add more source schemas.
Even better, the changes I’d made to the queries and the overall Vertica batch applier also improved the speed of the standard (i.e. multi-schema) applier. I also added partitions to the ID field for too to improve the general apply rate. After testing for a couple days, the average rate:

1,900,000 rows/minute from a single MySQL source into Vertica.

This was also scalable. Five MySQL sources elicited a rate of 8.8 million rows/minute into Vertica, making the applier rate linear with a 1% penalty for each additional MySQL source. The latency stayed the same, hovering around the same level as before of around 11s for most of the time. Occasionally you’d get a spike because Vertica was having trouble keeping up, but
Essentially, we are replicating data from MySQL into Vertica for analytics at a rate I simply called ‘outstandingly staggering’. I still do.
The new single schema applier, database name filter (rowadddbname) and performance improvements are all incorporated into the new Tungsten Replicator.

Filed under: Articles Tagged: memcloud, mysql, tungsten-replicator, vertica

Keynote and Session at Percona Live Dublin 2017

On Sunday I will travel over to Dublin for Percona Live 2017.
I have two sessions, a keynote on the Wednesday morning where I’ll be talking about all the fun new stuff we have planned at Continuent and some new directions we’re working on.
I also have a more detailed session on our new appliers for Kafka, Elasticsearch and Cassandra, that’s Tuesday morning.

If you haven’t already booked to come along, feel free to use the discount code SeeMeSpeakPLE17 which will get you 15% off!
Filed under: Presentations and Conferences Tagged: continuent, mysql, percona live, perconalive, tungsten, tungsten-replicator

Tungsten Replicator and Clustering 5.2.0 Released

Continuent are pleased to announce the release of Tungsten Replicator and Tungsten Clustering 5.2.0
This release is one of our most exciting new releases for a while, as it contains some significant new features and lays the groundwork for some additional new functionality in the upcoming 5.3.0 and 6.0 releases due later this year.
In particular, this release includes the following new features:

New replicator filtering environment to make filtering quicker and easier to use, and more flexible

New filter configuration standard for new filters
New filter to make replication out of a cluster easier
New filters for filtering events and data

New applier for sending Apache Kafka messages directly from an incoming data stream
New applier for adding incoming records directly to Elasticsearch for indexing
New applier for writing data into Apache Cassandra
New commands for the trepctl tool:

qs (quick status) command to show simplified output for the current replicator status
perf (performance) command to show detailed performance statistics in real-time for the replication process
Improvements to the status to show the time in a long-running transactions

Improvements to the thl command:

You can now determine the size of events within the THL to identify large transactions (rows or SQL)
Used in combination with trepctl you can identify long-running transactions more easily

The release also contains a number of fixes and other minor improvements.
 

Upcoming Webinar, 19th July, What is New in Tungsten Replicator 5.2 and Tungsten Clustering 5.2?

Continuent Tungsten 5.2 is just around the corner. This is one of our most exciting Tungsten product releases for some time!
In this webinar we’re going to have a look at a host of new features in the new release, including
Three new Replication Applier Targets (Kafka, Cassandra and Elasticsearch)
New improvements to our core command-line tools trepctl and thl
New foundations for our filtering services, and
Improvements to the compatibility between replication and clustering
This webinar is going be a packed session and we’ll show all the exciting stuff with more in-depth follow-up sessions in the coming weeks.
 
You’ll also learn about some more exciting changes coming in the upcoming Tungsten releases (5.2.1 and 5.3), and our major Tungsten 6.0 release due out by the end of the year.
So come and join us to get the low down on everything related to Tungsten Replicator 5.2 and Tungsten Clustering 5.2. on Wednesday, July 19, 2017 9:00 AM – 9:30 AM PDT at https://attendee.gotowebinar. com/register/ 4108437731342545667
Filed under: Presentations and Conferences Tagged: continuent, mysql, replication, tungsten-replicator

New Continuent Webinar Wednesdays and Training Tuesdays

We are just starting to get into the swing of setting up our new training and webinar schedule. 

Initially, there will be one Webinar session (typically on a Wednesday) and one training session (on a Tuesday) every week from now. We’ll be covering a variety of different topics at each. 

Typically our webinars will be about products and features, comparisons to other products, mixed in with

TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569