The Data Charmer: November 2005

Monday, November 28, 2005

MySQL 5 general purpose routine library - II: Using arrays

As you know, MySQL 5 stored routines language does not provide any sort of array structure. While this is not a problem when dealing with normal SQL, since arrays are a symptom of non-normalized data, when using a programming language, it becomes quite a strain, and the programmer is often struggling to achieve results that would be really natural and simple using an array, but are made unnecessarily hard by lack of them.
Thus comes into existence this library of routines that emulate arrays in MySQL 5.

Implementation

Arrays, as part of the MySQL General Purpose Stored Routines Library are implemented as datasets, all belonging to one table, and identified by a unique name. The details of the implementation are hiddenby the interface, and you should not worry a lot about it.

General concepts about the array library

Whenever possible, each routine is implemented both as a procedure and as a function. Only when a function returns an array value or an array property, then there is no corresponding procedure. This is a design choice, so that you are free of calling the routine in the most convenient way. When using a procedure that has a function synonym, a user variable with the same name of the procedure is set with the function return value. For instance, calling function array_create() you get the array ID.

If you choose to call the procedure instead, then @array_create will contain the array ID.

Array creation

You can create an array using array_create(name, size). 'Name' is any string not yet used as an array name. If you want to be sure that the name is not taken, use the function array_exists(name). 'size' is the number of items that you want to create (they will be set to NULL).
Using a zero size will create an empty array.


call array_create('myfirst',0);

You can also create an array from string containing a comma separated list, with the function array_from_list(string, array_name).

If you want to use a different separator, then you may want to look at array_from_list_complete(string, array_name, separator).

   call array_from_list('one,two,three', 'myarray');
set @list = 'Doc:Grumpy:Happy:Sleepy:Bashful:Sneezy:Dopey';
call array_from_list_complete(@list, 'snow_white_friends', ':');

array_from_list returns the array name, so it's easy to use it combined with a routine that requires an array:


select array_size(array_from_list('one,two,three','myarray'));

Inserting data

You can treat these library arrays in four different ways:

normal arrays: Thus you can insert a value by index.
array_set_value_by_index(array_name,index,value) does this, with its universal alias array_set(array_name, index, value);
associative arrays: Also known as hashes in Perl or maps in C++. You can set the value by key rather than by index.
array_set_value_by_key(array_name, key, value) is the routine for this, and array_set(array_name,key, value) can be an alias for that routine as well. Notice that when you set a value by key, the array will still have its automatic index. So you can alway access an array by index, even if you set the values by key.
queues: First in, first out, is the principle of queues. If you want to do this, insert your values using array_unshift(array_name, value) which adds a value at the beginning of the array, and retrieve them using array_shift(array_name), which returns (and deletes) the last element in
the array.
Notice that array_unshift is the most expensive method in this collection, because all the array_indexes need to be decreased by 1.
stacks: Last in, first out. This is one of the most used device in programming. array_push(array_name,value) adds an item to the end of an array, and array_pop(array_name) retrieves the last item and deletes it.

retrieving data

No matter how you entered data into an array, you can always retrieve your items by index. The lowest index is zero, the highest one is returned by array_max_index(array_name). Knowing this, you can quickly create a loop to walk through an array, using array_get_value_by_index(array_name, index). If you inserted your values by key, you can retrieve them in the same way. array_get_value_by_key(array_name, key) does the job. For your convenience, there is also a array_get(array_name, key_or_index) function that you can use to get values by key or index. If you want to set a key for an existing array value, you can use array_set_key_by_index(array_name, index, key).

information on arrays

There are some convenience routines to get information about arrays.

array lists: to see the list of all available arrays, use array_list(), that prints a
list of all public arrays.
Private arrays are all arrays with a name starting with an underscore character
('_'). These arrays are usually created and used inside a procedure and should
not be shared.
If you want to see them as well, use array_full_list().
array contents: To see all the items in an array, call [RP]array_show(array_name) .

advanced array usage

inserting several items at once

The array routines were designed to allow flexibility in insertion. In several programming languages you can fill an array with just a statement, be it a declaration or an action. Of course, the quickest way is to use the array_from_list function, but if you have an already existing array and you need to add more items, you can do one of the following:

more array_set functions in a SELECT statement


SELECT array_set('name', 'key1', 'value1'),
array_set('name', 'key2', 'value2'),
array_set('name', 'key3', 'value2') \G

chain insertion calls using array_setn

array_setn returns the array name, so it can be used in each statement where an array name is required.


select array_setn( array_setn( array_setn(
                  'name', 'key1', 'value1'),
                          'key2', 'value2'),
                          'key3', 'value3');

Populate an array from a dataset

To achieve this result, we need the help from another part of the library. For example, if we want to insert into an array the list of USA cities as keys with their population as value, we can do this:


call for_each_table_value(
'world',               -- the database
'City',                -- table name
'Name',                -- the first column to use
'Population',          -- the second column to use
null,                  -- no need for a third column: null
'CountryCode="USA"',   -- filter clause
'call array_set("USA_cities","$I1","$I2")');

In this statement we invoke a for_each_table_value engine to execute our query for each row in table City where CountryCode = 'USA'. The query to execute will call the array_set procedure, using the columns as arguments ($I1 and $I2). As a result, we'll have an array of 264 rows, containing all USA cities with their population.


call array_show('USA_cities');
+-------------+-------------------------+-------------+
| array_index | array_key               | array_value |
+-------------+-------------------------+-------------+
|           0 | New York                | 8008278     |
|           1 | Los Angeles             | 3694820     |
|           2 | Chicago                 | 2896016     |
|           3 | Houston                 | 1953631     |
|           4 | Philadelphia            | 1517550     |
   ...   ...                       ...
|         258 | Fall River              | 90555       |
|         259 | Kenosha                 | 89447       |
|         260 | Elgin                   | 89408       |
|         261 | Odessa                  | 89293       |
|         262 | Carson                  | 89089       |
|         263 | Charleston              | 89063       |
+-------------+-------------------------+-------------+
264 rows in set (0.20 sec)

If your dataset is reasonably short, there is another trick you can pull, without using a 'for each' routine. Let's assume that you want to put into an array just the districts for Italy, taken from table 'City' in the example world database.


select array_from_list(
(select group_concat(distinct District)
from world.City where CountryCode='ITA'),
'italian_districts')\G


call array_show('italian_districts');
+-------------+-----------+----------------------+
| array_index | array_key | array_value          |
+-------------+-----------+----------------------+
|           0 | NULL      | Latium               |
|           1 | NULL      | Lombardia            |
|           2 | NULL      | Campania             |
|           3 | NULL      | Piemonte             |
|           4 | NULL      | Sicily               |
|           5 | NULL      | Liguria              |
|           6 | NULL      | Emilia-Romagna       |
|           7 | NULL      | Toscana              |
|           8 | NULL      | Apulia               |
|           9 | NULL      | Veneto               |
|          10 | NULL      | Friuli-Venezia Giuli |
|          11 | NULL      | Calabria             |
|          12 | NULL      | Sardinia             |
|          13 | NULL      | Umbria               |
|          14 | NULL      | Abruzzi              |
|          15 | NULL      | Trentino-Alto Adige  |
|          16 | NULL      | Marche               |
+-------------+-----------+----------------------+

Well, a few are missing, but not because of our routine, which seems to work fine. Notice that if the list of entries is larger than 1024 bytes, group_concat will not work as expected, and you need to increase a system variable to get correct results. But for short lists this is a trick to keep in mind.
Even better, using array_from_pair_list, you can populate an array with a list of key/value pairs.


set group_concat_max_len = 8000; -- necessary, or group_concat can't work
                          -- well in this case
select array_from_pair_list(
                 (select group_concat( concat(Name,"=>",population ))
                  from world.City where CountryCode ='USA'),
                 'USA_cities')\G

And USA_cities will get the same contents we have seen when using the for_each_table_value method.

What else can you do with arrays? Wonderful things, as you'll see in the next articles about this library.

Sunday, November 27, 2005

MySQL 5 general purpose routine library - I : general concepts

A few days ago I started a forum thread that got quite a good acceptance.

There are rumours from the MySQL community that some sort of routine repository is going to be organized sooner or later, and after I heard them I almost dismissed my initial idea.
However, some people convinced me that starting such an initiative could always be beneficial for the community, and after all, if the official repository shows up, it would be easy to merge into it.

Therefore, here it goes. This is the first article of a series that will explain what I have made in the field of general purpose routines.

The project is available at savannah.org. The documentation is till in the making, so I will start with a general overview.

PURPOSE

This routine library will collect general purpose MySQL 5 stored
procedures and functions. What is to be considered "general purpose"?
Any routine that enhances the language itself or that can be useful
in an abstract database fits the bill.

Since its appearance, MySQL stored routines language has proved itself
useful and powerful, and yet it lacks many features that we have become
accustomed to in other languages, and we would like to see in MySQL::SP
as well. Stored routines that improve the language espressiveness and
usability fall in this category.

Furthermore, there are common tasks that can get accomplished by a stored
routine, and are not related to a particular database. Also these routines
are good candidates for this library.

CURRENT CATEGORIES

At the very beginning of this project, I have put some categories that got
populated faster than others, and thus they became, in my view, the ones with
the highest priority.

ARRAYS AND BASIC DATA STRUCTURES

Perhaps the biggest disappointment when MySQL::SP language was disclosed was
the lack of arrays. Not as a SQL type, of which frankly I can do without, as
it breaks normalization, but as a language component. Without arrays, and
without things like queues and stacks, I feel that my programming abilities
are severely cut down.
Thus this first component, which inmplements named arrays where you access
items by index, but that can also be treated like Perl hashes, where items
are accessed by name. Built-in with them are some stack and queue features,
so that an array is automatically a queue, if accessed with shift/unshift,
or a stack, if accessed with push/pop.
Further routines to transform arrays to strings and strings to arrays make
this library ready for prime time.

"FOR EACH" LOOPS

I don't know you, but I find myself doing quite often things like

$ for $X in `mysql -B -N -e 'show tables'` ;  \
    do perl -pe "something with $X" | \
    mysql -e "something else with $X" ; done

i.e. I get a list out of mysql, treat the list with some sort of filter,
and then feed it back to the database.
Sometimes I fire up a quick Perl script to do the whole task, but most of the
times the whole thing would be just a simple matter if MySQL had a FOR loop
that you can use from a regular SQL statement.
This component came out of this need. Now I have routines that accept
some parameters and perform for me a loop with a counter, or using a
dataset of table values, or walk through an array.

NAMED PARAMETERS

In the process of making the "for each" routines, I came to the disappointing
discovery that I had to use quite a lot of parameters if I wanted to achieve
some powerful results. When you have routines that need more than three
parameters, as every programmer knows, it is quite difficult to get the order
right, and many mistakes can come from this situation. Thus, I wanted to
implement a simple way of calling my complex routines with maned parameters,
something that in Perl I do like this:

function_name(
db_name    => 'world',
table_name => 'City',
expression => 'where CountryCode = "ITA"'
);

This way, if I misplace one of the parameters, it does not matter, because
they are retrieved by name. My routine does almost this ("almost", because
lack of language power prevents an exact implementation). It lets you pass
an array of named parameters (key + value) and a hidden mechanism will
sort out the parameters and pass them to the receiving routine in the
correct order.
The corresponding MySQL code would be:

  select array_set('mypar', 'db_name', 'world'),
      array_set('mypar', 'table_name', 'City'),
      array_set('mypar', 'expression', 'where CountryCode = "ITA"')\G
call simple_sp('function_name', 'mypar');

As a bonus, I now have an array of parameters that I can reuse when I need
them, perhaps after changing one or two parameter values.

SYNTAX HELPERS

As I said, my routines can grow their parameters quite fast, and remembering
all of them can be difficult even for their own author. Here comes another
component that registers all the routines syntax, and has a way of showing
you a quick excerpt of their usage.
This helper is also a key component of the routines that deal with named
parameters.
Just as an example the reference guide that is currently in the repository
was created by just these two calls:

call for_each_table_value_simple(
           database(),
           '_routine_syntax',
           'routine_name',
           'routine_type="function"',
           'select fsyntax("$I1")')\G   
call for_each_table_value_simple(
           database(),
           '_routine_syntax',
           'routine_name',
           'routine_type="procedure"',
           'select psyntax("$I1")')\G

Curious for more? Visit the project site and if you are interested in contributing, subscribe to the developers mailing list.

Friday, November 25, 2005

Hiding connection parameters with Perl/DBI

In a Perl application, using the widely known DBI module, a database connection is fired up by an instruction such as this one:

my $dbh = DBI->connect("DBI:mysql:test",
  "user_name", "my_secret_password", { RaiseError => 1 })
or die "something went wrong ($DBI::errstr)";

The first parameter is compulsory, and it's made of "dbi", which is the same for all connections, the database driver, which in our case is "mysql", and the database name, in our case "test". For a self respecting database, we should also specify the user name and password.
Nothing to complain about if this script is safely stored in your home directory, screened from prying eyes by the OS permissions.
But what happens when you need to distribute this script to somebody? You need to remember to change username and password to some dummy text, to avoid the burden of having to change our real credentials if by dumb chance we sent away our script with the password for everybody to see it.

Luckily, there is an easy solution to hide connection variables from others. You just need to store them in a configuration file, which can be kept in a guarded location and never distributed. So it will be visible to your script as long as it is within the boundaries of your system, but it will be invisible to others, and moreover it won't follow the script if you need to send it away.

It works like this. In your home directory (or the home directory of the user that will execute the script, for example, for a cron job) you create afile named ".my.cnf", containing the following:

[client]
user=user_name
password=my_secret_password

The file needs to be protected with the right privileges (chmod 600 .my.cnf) so that it can't be accessed by other users. Then, you change the connection statement like this:

my $dbh = DBI->connect("DBI:mysql:test"
  . ";mysql_read_default_file=$ENV{HOME}/.my.cnf",
  undef, undef) or die "something went wrong ($DBI::errstr)";

The "mysql_read_default_file" option will point to the file with the connection variables. Notice that instead of username and password there are two null values (undef). The file name, in this case, is built from the environment variable "$HOME" (in Unix like systems). You can also state it literally, but this trick will make the script usable from different users without modifications. The only thing each of them needs is a configuration file.
For more stuff on Perl/DBI, see DBI recipes.

Tuesday, November 08, 2005

Data on demand for data warehouses

Here is a presentation given at the Open Source Database Conference on November 8, 2005.

Abstract:

Data on Demand is a paradigm of storing data for a data warehouse with huge storage requirements. There are cases where creating statistical tables is not enough, and the detail data should be kept online for long periods. This technique has been applied in cases of MySQL DW associated with industrial production control, saving up to 80% storage space and improving performance and scalability.

Available presentation material:

* open document (OpenOffice 2) [1.3 MB]

* portable document format (pdf) [2.0 MB]

* openoffice 1.1 [1.3 MB]

* powerpoint [1.2 MB]

* HTML (S5 format) + source code [736 KB]

* on-line browsable presentation (s5 format)

Saturday, November 05, 2005

New data replication paradigms

MySQL success is due to not only to its efficiency, but also to its ease of administration.
And despite what its detractors say about lack of features (that was before
5.0, of course), there is a feature that is crucial for enterprise database
management, and MySQL has had it built-in since early version. I am talking
about data replication that can trasform a single server into a scalable and
highly reliable system with minimum effort and expense.

Recently, one more element was added to MySQL set of features. MySQL Cluster is an ambitious engine that turns your servers into a full fledged cluster that can be accessed from any node without worry. Any node is master and slaveat the same time. However, these features come at the price of some higherhardware requirement (MySQL Cluster is, as of today, an all-in-RAM database).

MySQL data transfer methods

There are differences between these two approaches. Choosing MySQL Cluster or standard replication is a trade off.

method	description	the good	the bad
replication	asynchronous transfer, based on one master and N slaves	Fast, reliable, robust, easy to set up, no special HW required, transaction-safe	needs some care in the code
MSQL Cluster	synchronous transfer, based on a cluster engine	fast, no code modification needed	require huge amounts of RAM, not easy to set up, minimal transaction support

If you start from scratch, it's easy to build your applications in such a way that they are replication-aware, thus benefitting from the low-cost high scalability of this paradigm.
OTOH, if you have an application that is already established (and perhaps complex and large), adapting it to the one-master-many-slaves architecture could be a problem.
There are external tools that promise to fill the gap, and offer clever solutions, mostly based on in-between filtering layers that give your application the illusion that there is just one server, while in truth there are several ones. Also these solutions cost you something, not only in mere money, (since they are - with limited exceprions - commercial tools), but also in features, because filters mostly are not transaction safe.

Different needs for data transfers

Aside from the mere method of data movement, there are other needs from the enterprise. Notice that this is not some stretch of my imagination, but the practical result of several years of consulting in various database-related matters.

One thing that most medium to large companies wanted is the reverse of replication, i.e. they want the master to be updated by many slaves. The need is clear when you think of several branches of a company in need of sending updates about sales records and store levels. Unfortunately, simple replication does not allow this feature, and even the cluster would not be bent easily to this need.
Another thing that companies want is a conditional replication, based on rules that are far beyond the limited configuration range of both replication and cluster. For example some companies wanted a table migrated from server A to server B, but to different tables, depending on a time frame.

Enter MySQL 5, and a new world of possibilities is open to the clever administrator.

New data transfers paradigms: FEDERATED tables

The reverse replication can be achieved with tables using the FEDERATED engine .
For each table you need to replicate from the branch database to the main house, you just need to establish a FEDERATED table and a couple of triggers.
Let's day, just for the sake of it, that you have the main server at big.central.mycompany.com, and the branches as far.away.mycompany.com (codename : FAW) and at fairly.distant.mycompany.com (code name: FDI). The main table you need to update at big.central lays in the "biz" database and looks like this:

CREATE TABLE all_sales
(
  branch_id       char(3) not null,
  sale_date       datetime not null,
  item_id         int not null,
  quantity        int not null,
  primary key (branch_id, sale_date, item_id)
) engine=innodb;

In each branch, you have one table for your branch sales, that you update every day with the consolidate data from your detail tables.

CREATE TABLE branch_sales
(
  sale_date       datetime not null,
  item_id         int not null,
  quantity        int not null,
  primary key     (sale_date, item_id)
) engine=innodb;

To get the data from branch_sales to the main central table, you build a mirror of such table in each branch, but instead of being InnoDB or MyISAM, it will be FEDERATED.

CREATE TABLE central_all_sales
(
  branch_id       char(3) not null,
  sale_date       datetime not null,
  item_id         int not null,
  quantity        int not null,
  primary key (branch_id, sale_date, item_id)
) engine=FEDERATED
CONNECTION='mysql://user:pwd@big.central.mycompany.com/biz/all_sales';

As a final touch, you add a few triggers to your branch table, to get the data moving:

create trigger branch_sales_ai after INSERT on branch_sales
for each row
  insert into central_all_sales
    values ( 'FAW', new.sale_date, new.item_id, new.quantity);

create trigger branch_sales_au after UPDATE on branch_sales
for each row
  update central_all_sales
    set sale_date   = new.sale_date,
        item_id     = new.item_id,
        quantity    = new.quantity
  where 
        sale_date   = old.sale_date
        and
        item_id     = old.item_id;

create trigger branch_sales_ad after DELETE on branch_sales
for each row
  delete from central_all_sales
  where
    branch_id   = 'FAW'
    and
    sale_date   = old.sale_date
    and
    item_id     = old.item_id;

Similar triggers (with a branch_id of 'FDI') will be created in the second branch. Now, every record in branch_sales is replicated to the central table without himan intervention. For extra safety, you may add some codeto check that the record does not exist yet, before inserting.

New data transfers paradigms: BLACKHOLE tables

Have you heard about the BLACKHOLE engine?
It's a simple concept. A table with a full description, that does not store anything. Every thing you send to such table is simply discarded, and that is no surprising, since there is no storage device associated with BLACKHOLE tables.
What's the use, then? You may ask. The use is to take advantage of the side effects of such table, rather that its storage capabilities. For once, records sent to a BLACKHOLE table will leave a trail in the binary log, thus allowing you to replicate its data from a server that does not have much storage available, but you need it just as an intermediate level to spread your data to several more slaves.
Moreover, and here we come to the point of our quest, BLACKHOLE tables can fire triggers, and we are going to use them to meet the second need we have outlined, i.e. the conditional update.
Let's say that big.central.mycompany.com want the sales records sent to different tables depending on the time of the day (it's a silly example, but bear with me for a while, just pretending it's something sensible). However, the application that deals with the data transfer is already working, and nobody dares to touch it, for fear of disrupting something. But a simple change can achieve the goal.
Let's change the structure of all_sales:

CREATE TABLE all_sales
(
  branch_id       char(3) not null,
  sale_date       datetime not null,
  item_id         int not null,
  quantity        int not null,
  primary key (branch_id, sale_date, item_id)
) engine=BLACKHOLE;

Just one word changed, and now the table will discard any data sent to it.
Before seeing what we can do with it, let's create three more tables.

CREATE TABLE all_morning_sales
(
  branch_id       char(3) not null,
  sale_date       datetime not null,
  item_id         int not null,
  quantity        int not null,
  primary key (branch_id, sale_date, item_id)
) engine=innodb;

CREATE TABLE all_afternoon_sales
(
  branch_id       char(3) not null,
  sale_date       datetime not null,
  item_id         int not null,
  quantity        int not null,
  primary key (branch_id, sale_date, item_id)
) engine=innodb;

CREATE TABLE all_late_sales
(
  branch_id       char(3) not null,
  sale_date       datetime not null,
  item_id         int not null,
  quantity        int not null,
  primary key (branch_id, sale_date, item_id)
) engine=innodb;

Here we have everything ready. Now let's add one trigger:

delimiter //

create trigger all_sales_ai after insert on all_sales
for each row
begin
  case
    when time(sale_date) between '08:00:00' and '13:00:00'
      then insert into all_morning_sales
        values (new.branch_id, new.sale_date, new.item_id,
        new.quantity);
    when time(sale_date) between '13:00:01' and '18:00:00'
      then insert into all_afternoon_sales
        values (new.branch_id, new.sale_date, new.item_id,
        new.quantity);
    else
      insert into all_late_sales
        values (new.branch_id, new.sale_date, new.item_id,
        new.quantity);
  end case;
end //

delimiter ;

Similarly, you can create a trigger for updates and deletes, and of course you can get creative and update several tables at once with different parts of the incoming data.
Here. Now you have some food for thought and launch yourselves into a new data transfer enterprise.

The Data Charmer