The Data Charmer: 2005

Saturday, December 31, 2005

New Year quiz

A few days ago I was asked to analyze some data related to a particular service usage in a large communications provider. The (simplified) table looks like this:

CREATE TABLE `xusage` (
`item_id` int(11) NOT NULL,
`begin_date` date NOT NULL,
`xusage` int(11) default NULL,
PRIMARY KEY  (`item_id`,`begin_date`)
) ;

And here is some sample data:

select * from xusage order by begin_date,item_id;
+---------+------------+--------+
| item_id | begin_date | xusage |
+---------+------------+--------+
|       1 | 2005-09-01 |     93 |
|       1 | 2005-09-02 |     90 |
|       2 | 2005-09-02 |     54 |
|       1 | 2005-09-03 |     66 |
|       2 | 2005-09-03 |     74 |
|       3 | 2005-09-03 |     90 |
|       1 | 2005-09-04 |     38 |
|       2 | 2005-09-04 |     49 |
|       3 | 2005-09-04 |     91 |
|       4 | 2005-09-04 |     91 |
|       2 | 2005-09-05 |     47 |
|       3 | 2005-09-05 |     49 |
|       4 | 2005-09-05 |     33 |
|       5 | 2005-09-05 |     83 |
|       3 | 2005-09-06 |     23 |
|       4 | 2005-09-06 |     99 |
|       5 | 2005-09-06 |     93 |
|       4 | 2005-09-07 |     75 |
|       5 | 2005-09-07 |     47 |
|       5 | 2005-09-08 |      3 |
+---------+------------+--------+

The task at hand was to produce another table (for charting purposes), having the same columns as xusage, with some additional pieces of information: the total usage for that day and the following day, the usage for item_id on day +1, the percentage of such usage from the total of all usage in day + 1;
For example, in the given data, the total usage on September 1st was 93, and on September 2nd was 144. The wanted columns for the first two days would be

+---------+------------+--------+-----------+-------------+-------------+------------------+
| item_id | begin_date | xusage | total day | total day+1 | usage day+1 | percentage day+1 |
+---------+------------+--------+-----------+-------------+-------------+------------------+
|       1 | 2005-09-01 |     93 |        93 |          93 |          90 |          62.5000 |
|       1 | 2005-09-02 |     90 |       144 |         144 |          66 |          28.6957 |
|       2 | 2005-09-02 |     54 |       144 |         144 |          74 |          32.1739 |
+---------+------------+--------+-----------+-------------+-------------+------------------+

The original table had 50 million records. Therefore I had to find an efficient way of fulfilling the task.
It can be done with a single query, although performance would suffer. I chose to create a supporting table to ease the task.

My solution is here.
How would you have solved it? (BTW, there is no prize, unless Arjen wants to provide some.
Happy New Year!

Friday, December 30, 2005

function WELL_KNOWN_FUNC does not exist

That came today in a newsgroup. A user was complaining that he was able to encrypt and decrypt some data with an early release of MySQL 5.0 and not being able to decrypt it with the latest one.
The message he received was, allegedly
function AES_DECRYPT does not exist
That puzzled me, of course. I have the latest MySQL version installed in one of my servers, and I tried it straight away:


select version(), AES_DECRYPT(AES_ENCRYPT('abc','y'),'y');
+----------------+-----------------------------------------+
| version()      | AES_DECRYPT(AES_ENCRYPT('abc','y'),'y') |
+----------------+-----------------------------------------+
| 5.0.17-max-log | abc                                     |
+----------------+-----------------------------------------+

So the function does exist. But what was the poster complaining about? Perhaps he did not report the error message correctly.
Let's try to put a space between the function name and the parenthesis.


select version(), AES_DECRYPT (AES_ENCRYPT('abc','y'),'y');
ERROR 1305 (42000): FUNCTION test.AES_DECRYPT does not exist

Aha! That was the problem. The message was missing one piece that would have given away the answer immediately.
The problem was that he was calling the function with a space between name and parenthesis and the manual tells you how it will accepted by the interpteter.
If you are used to spaces in front of parentheses and you'd like the interpreter to make some additional effort before reporting failure, you can always use the sql mode to inform the server of your preferences. Either start the server with the option --sql-mode=IGNORE_SPACE, or change it run-time:


set sql_mode='IGNORE_SPACE';

# or if you have the SUPER privilege:
set global sql_mode='IGNORE_SPACE';

select version(), AES_DECRYPT (AES_ENCRYPT('abc','y'),'y');
+----------------+------------------------------------------+
| version()      | AES_DECRYPT (AES_ENCRYPT('abc','y'),'y') |
+----------------+------------------------------------------+
| 5.0.17-max-log | abc                                      |
+----------------+------------------------------------------+

Tuesday, December 20, 2005

MySQL 5 general purpose routine library - V : more on arrays

Hi.
It's me again.
That one who is crazy for data structures and believes they are at all useful.
I know you've seen an article about arrays just a few days ago, and it was mine as well, but I have something new to share and here goes.
I told you that arrays were useful.
So useful, in fact, that improving the library itself became quite an easy task.
And so I made some more addition, such as operations on whole arrays, rather than on single items.
(Fifth part of the MySQL 5 general purpose routine library presentation.)

multiple users

It occurred to me that having arrays in a multi-user environment could be either a blessing or a disaster, depending on which side you are.
If you are likely to see the benefits, having a way of sharing data among users could be a pro.
If you are more focused on the drawbacks, then you'll see that having a user modify an array while you are doing something highly sensitive on it would not be desirable.
So I said, let's have it both ways.
And I made a few adjustments to the library, so that arrays can be restricted or shareable depending on one single function that will influence all the routines.

  create function array_user()
 returns varchar(50)
 reads sql data
 begin
     -- To have separate arrays for each user
     -- uncomment the following line.
     return substring_index(user(),'@',1);
     --
     -- To have arrays shared among users
     -- use the statement below.
     -- return 'all users';
 end//

This is the "multi-user" implementation. Installing the library this way, each user will see only the arrays (s)he has created. Using the library with this function, different users can use an array with the same name, and they will stay separated and independent. Changing the return statement so that it gives 'all users', all arrays are unique among multiple users, i.e. if user A creates an array 'X', that array can be accessed, modified and erased by user B as well.

copying

Before you had arrays, you didn't care much. But now that they exist and are usable, you start wondering if you can make a copy of an array without being forced to insert items one by one. Of course you could do that using a for_each like function, but why bother, since there is a low-level array_copy function that will do it for you?
(Yes, I know that a few days ago it was not there. I just made it yesterday, it passed the tests today, and now it's ready for general usage. )
Back to the issue. To copy an array, just call array_copy, supplying the source and destination name.

    select array_copy('A', 'B');
 # now array 'B' contains the same items as array 'A'

While copying, you can also do some additional actions, such as filtering, appending and more. As you suspect, there is a array_copy_complete that allows a wide degree of control on your copy.

sorting

Sorting is a special case of copying, since it will create a new array, with the elements sorted. By default, array_sort will create a new array with the elements sorted by value. If this is not enough for you, there is array_sort_complete that lets you choose among different ways of sorting (by value, by key/value, by value as numeric) and the sorting direction (ascending, descending).

    call array_show('A');
 +-------------+-----------+------------+
 | array_index | array_key | "A" values |
 +-------------+-----------+------------+
 |           0 | g         | sheep      |
 |           1 | b         | cow        |
 |           2 | c         | ox         |
 |           3 | d         | donkey     |
 |           4 | a         | dog        |
 +-------------+-----------+------------+

 select array_sort('A','B');

 call array_show('B');
 +-------------+-----------+------------+
 | array_index | array_key | "B" values |
 +-------------+-----------+------------+
 |           0 | b         | cow        |
 |           1 | a         | dog        |
 |           2 | d         | donkey     |
 |           3 | c         | ox         |
 |           4 | g         | sheep      |
 +-------------+-----------+------------+

 # sorted by value

 select array_sort_complete('A','B','K','asc');

 call array_show('B');
 +-------------+-----------+------------+
 | array_index | array_key | "B" values |
 +-------------+-----------+------------+
 |           0 | a         | dog        |
 |           1 | b         | cow        |
 |           2 | c         | ox         |
 |           3 | d         | donkey     |
 |           4 | g         | sheep      |
 +-------------+-----------+------------+
 # sorted by key

Notice that the original array is not changed.

merging

If you want to create an array from the contents of two existing arrays, array_merge is your function. One complication that arises when merging arrays is the 'keys', which must be unique in each array. Thus, if both arrays have one or more key in common, only one value can be taken. By default, array_merge uses the key from the first array. As usual, if you want to control such detail, there is array_merge_complete.

    call array_show('A');
 +-------------+-----------+------------+
 | array_index | array_key | "A" values |
 +-------------+-----------+------------+
 |           0 | b         | cow        |
 |           1 | c         | ox         |
 |           2 | d         | donkey     |
 |           3 | a         | dog        |
 +-------------+-----------+------------+

 call array_show('B');
 +-------------+-----------+------------+
 | array_index | array_key | "B" values |
 +-------------+-----------+------------+
 |           0 | z         | hare       |
 |           1 | y         | rabbit     |
 |           2 | w         | mole       |
 |           3 | a         | beaver     |
 +-------------+-----------+------------+

 select array_merge('A','B','C');
 +--------------------------+
 | array_merge('A','B','C') |
 +--------------------------+
 | C                        |
 +--------------------------+

 call array_show('C');
 +-------------+-----------+------------+
 | array_index | array_key | "C" values |
 +-------------+-----------+------------+
 |           0 | b         | cow        |
 |           1 | c         | ox         |
 |           2 | d         | donkey     |
 |           3 | a         | dog        |
 |           4 | z         | hare       |
 |           5 | y         | rabbit     |
 |           6 | w         | mole       |
 +-------------+-----------+------------+

Notice that 'C' does not have 8 elements, but just 7. That's because key 'a' was present in both 'A' and 'B' source arrays. The conflict was resolved taking the value from the first array ('dog' from array 'A'). To change this behaviour, you could have done this:

    select array_merge_complete('A','B','C','use_second');
 +------------------------------------------------+
 | array_merge_complete('A','B','C','use_second') |
 +------------------------------------------------+
 | C                                              |
 +------------------------------------------------+

 call array_show('C');
 +-------------+-----------+------------+
 | array_index | array_key | "C" values |
 +-------------+-----------+------------+
 |           0 | b         | cow        |
 |           1 | c         | ox         |
 |           2 | d         | donkey     |
 |           3 | a         | beaver     |
 |           4 | z         | hare       |
 |           5 | y         | rabbit     |
 |           6 | w         | mole       |
 +-------------+-----------+------------+

appending

Appending is a variation of merging. Instead of creating a new array, an existing one gets the contents of another. Conflicting keys are resolved in favour of the second array.

    select array_append('A','B');
 +-----------------------+
 | array_append('A','B') |
 +-----------------------+
 | A                     |
 +-----------------------+

 call array_show('A');
 +-------------+-----------+------------+
 | array_index | array_key | "A" values |
 +-------------+-----------+------------+
 |           0 | b         | cow        |
 |           1 | c         | ox         |
 |           2 | d         | donkey     |
 |           3 | a         | beaver     |
 |           4 | z         | hare       |
 |           5 | y         | rabbit     |
 |           6 | w         | mole       |
 +-------------+-----------+------------+

filtering

Perl programmers know what grep is. It's a filter applied to an array with a regular expression, so that the resulting array is made of all elements matching such expression. array_grep does the same thing. Given an array and a regular expression, it will create an array containing all the elements from the source array where the elements match the pattern.

    call array_show('A');
 +-------------+-----------+------------+
 | array_index | array_key | "A" values |
 +-------------+-----------+------------+
 |           0 | b         | cow        |
 |           1 | c         | ox         |
 |           2 | d         | donkey     |
 |           3 | a         | dog        |
 +-------------+-----------+------------+

 select array_grep('A','B','^d');
 +--------------------------+
 | array_grep('A','B','^d') |
 +--------------------------+
 | B                        |
 +--------------------------+

 call array_show('B');
 +-------------+-----------+------------+
 | array_index | array_key | "B" values |
 +-------------+-----------+------------+
 |           0 | d         | donkey     |
 |           1 | a         | dog        |
 +-------------+-----------+------------+

The regular expression is anything that MySQL built-in REGEXP operator recognizes as valid. In this example, only the elements starting by 'd' are selected.

piping

The astute reader must have noticed that all the functions mentioned in this section return the array name, and that all array routine require an array name as their first argument. This fact leads to the pleasant result that array functions can be piped together. Wherever you need an array name, you can use an array function returning an array name instead. The above operation about grep could have been done with just one query:

    call array_show( array_grep('A','B','^d'));
 +-------------+-----------+------------+
 | array_index | array_key | "B" values |
 +-------------+-----------+------------+
 |           0 | d         | donkey     |
 |           1 | a         | dog        |
 +-------------+-----------+------------+

Here the result of array_grep (which is 'B') is given as first argument to array_show. You can take one step further, and sort the elements in the same step:

    call array_show( array_sort(array_grep('A','B','^d'),'C'));
 +-------------+-----------+------------+
 | array_index | array_key | "C" values |
 +-------------+-----------+------------+
 |           0 | a         | dog        |
 |           1 | d         | donkey     |
 +-------------+-----------+------------+

Array 'B' (as result of array_grep) is passed to array_sort, and its result, 'C', is used by array_show. If you think that this looks very much like Perl array syntax, where you can pipe grep, sort and so on, you are right. Since MYSQL SP syntax is so primitive, I wanted to add some perlish power to it. I hope I succeeded.

Monday, December 19, 2005

Dumping GIS data

Today I needed to dump a table containing GIS data and I could not find a quick way of exporting such data into a WKT format. Since I had also some more small problem with my data, I called MySQL technical support (I have a MySQL Network account) and I got the small problem solved, but the dump seems to be still a pending issue.
If you are used to dump data often, you may know that mysqldump has two useful options: --hex-blob, which will dump binary data in hexadecimal format, and --compatible=target, which will adjust the dump format to the requested target. However, none of these options has any effect on GIS data. It is always dumped in binary format, hardly compatible with any other DBMS.

So I cooked up an alternative way of dumping such data.
First, I created a stored function to create the query I needed


delimiter //

drop function if exists gis_dump_query //

create function gis_dump_query
(
 p_db_name     varchar(50),
 p_table_name  varchar(50),
 p_file_name   varchar(100)
)
returns text
reads sql data
begin
 declare done            boolean default false;
 declare is_first        boolean default true;
 declare col_list        text;
 declare cname           varchar(50);
 declare ctype           varchar(50);
 declare column_descr    varchar(100);
 declare get_cols        cursor for
     select column_name, data_type
     from
         information_schema.columns
     where
             table_schema= p_db_name
             and
             table_name = p_table_name;
 declare continue        handler for not found
     set done = true;

 set col_list = '';
 open get_cols;
 GET_COLUMNS:
 loop
     fetch get_cols into cname,ctype;
     if (done) then
         leave GET_COLUMNS;
     end if;
     if (is_first) then
         set is_first = false;
     else
         set col_list = concat(col_list, ',');
     end if;
     set column_descr = cname;
     if is_geometry(ctype) then
         set column_descr = concat('astext(',cname, ')');
     end if;
     set col_list = concat(col_list, column_descr);
 end loop;
 return concat('SELECT ', col_list, ' INTO OUTFILE ', "'",
     p_file_name, "'",
     ' FIELDS TERMINATED BY ";" OPTIONALLY ENCLOSED BY "''" ESCAPED BY ''\\\\'' ',
     ' FROM ', p_db_name, '.', p_table_name);
end //

drop function if exists is_geometry //
create function is_geometry (ctype varchar(50))
returns boolean
deterministic
begin
 return ctype in (
     'geometry',             'polygon',      'linestring',
     'point',                'curve'   ,     'surface',
     'geometrycollection',   'multipoint',   'multicurve',
     'multilinestring',      'multisurface', 'multipolygon');
end//

delimiter ;

After that, I had a quick way of exporting data. All I have to do is a select:

desc test_gis;
+------------+--------------+------+-----+---------+-------+
| Field      | Type         | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+-------+
| id         | int(11)      | NO   | PRI |         |       |
| sometext   | varchar(100) | YES  |     | NULL    |       |
| somemore   | varchar(100) | YES  |     | NULL    |       |
| gis1       | polygon      | NO   | MUL |         |       |
| gis2       | linestring   | NO   | MUL |         |       |
+------------+--------------+------+-----+---------+-------+

SELECT gis_dump_query(database(),'test_gis','/path/dumpfile.csv') as q\G
*************************** 1. row ***************************
q: SELECT id,sometext,somemore,astext(gis1),astext(gis2) INTO OUTFILE '/path/dumpfile.csv' FIELDS TERMINATED BY ";" OPTIONALLY ENCLOSED BY "'" ESCAPED BY '\\'  FROM mydb.test_gis

and I get a ready to use query to export my data. But of course, instead of cutting and pasting, I could go the whole nine yards, with another simple stored procedure.

drop procedure if exists dump_gis_table //

create procedure dump_gis_table (
 p_db_name     varchar(50),
 p_table_name  varchar(50),
 p_file_name   varchar(100)
)
reads sql data
begin
 set @_dump_query = gis_dump_query(p_db_name, p_table_name, p_file_name);
 prepare dump_query from @_dump_query;
 execute dump_query;
 deallocate prepare dump_query;
 set @_dump_query = null;
end //

Now, the task is even simpler:

call dump_gis_table(database,'test_gis','/path/dumpfile.csv');

Monday, December 12, 2005

MySQL 5 general purpose routine library - IV: Syntax helpers and named parameters

(Fourth part of the MySQL 5 general purpose routine library presentation.)

SYNTAX HELPERS

No matter how skilled you are as a programmer, when dealing with a rich language (or a rich library),
there is little chance that you will remember the syntax of all the routines, especially if such routines need more than two parameters each.
This is even truer for MySQL stored routines, where the language does not allow niceties like routines with variable number of parameters or default values for parameters that are not included in the call list.
I solved the problem storing the syntax in a couple of tables, and having a few functions returning such syntax for each routine and some procedures to list routines by a regular expression.
Let's start with these last ones.
The most general procedure is my_routines(pattern)
Here are all routines having "array_get" in their names:

    call my_routines('array_get');
  +----------------+--------------------------+--------------+
  | routine_schema | routine_name             | routine_type |
  +----------------+--------------------------+--------------+
  | my_db          | array_get                | function     |
  | my_db          | array_get_key_by_index   | function     |
  | my_db          | array_get_value_by_index | function     |
  | my_db          | array_get_value_by_key   | function     |
  +----------------+--------------------------+--------------+

And here are all the routines outside the "arrays" section that use arrays.

    call my_routines('^[^a].*array');
  +----------------+------------------------------+--------------+
  | routine_schema | routine_name                 | routine_type |
  +----------------+------------------------------+--------------+
  | my_db          | for_each_array_item          | procedure    |
  | my_db          | for_each_array_item_complete | procedure    |
  | my_db          | for_each_array_item_simple   | procedure    |
  +----------------+------------------------------+--------------+

To get the list of all routines, then

    call my_routines('.');

If I want only functions, then I use my_functions, instead of my_routines, and if I want the list of procedures, there is my_procedures.
You can see the detailed syntax for a routine with fsyntax (= function syntax) and psyntax (= procedure syntax). Why two separate routines for what seems to be a unique job? Because a function and a routine can have the same name (I said some reasoning about this in the user guide for arrays). Anyway, here are some examples:

    select fsyntax('array_get')\G
  *************************** 1. row ***************************
  fsyntax('array_get'):
   ** function my_db.array_get

      function array_get
          p_array_name     varchar(30),
          p_array_ndx_key  varchar(30)
      --
      retrieves a value from array p_array_name.
      If p_array_ndx_key is a number, then array_get_value_by_index is used,
      otherwise it is treated as a key, and array_get_value_by_key is used.
      (simulates Perl hashes or PHP arrays)
      --
      returns the array value
      --
      see also: array_get_value_by_index, array_get_value_by_key

  select psyntax('array_clear')\G
  *************************** 1. row ***************************
  psyntax('array_clear'):
   ** procedure my_db.array_clear

      procedure array_clear
          p_array_name    varchar(30),
      --
      removes all items from a given array
      --
      returns 1 on success, 0 on failure (if p_array_name does not exist)
      --
      see also: array_drop

      --
      sets variable @array_clear.

The latter case is an example of what I was saying. You can have array_clear as a function and array_clear as a procedure, with separate (albeit similar) syntax pages.

USING NAMED PARAMETERS

So far, it was the easy part. It may seem a long job writing all these reference pages, but it is not hard. The difficult part is putting the command together, making sure that you are using the right number of parameters in the correct order.
This is something that makes me love Perl, for instance, where I can write routines that accept named parameters. For example, the for_each_counter routine could be called in Perl as

    for_each_counter(
          counter_start => 1,
          counter_delta => 1,
          counter_end   => 10,
          sql_command   => 'SELECT $N * $N ');

Notice that the order of parameters does not matter, since they are recognized by name, rather than by position. My main goal was to emulate this calling syntax as much as I could, and I came up with simple_sp(routine_to_call, array_parameters). With this, I need to enter the parameters into an array, and then call the routine through simple_sp.

    select array_set('test', 'sql_command',   'SELECT $N * $N '),
         array_set('test', 'counter_start', 1),
         array_set('test', 'counter_end',   10),
         array_set('test', 'counter_delta'  1)\G

  call simple_sp('for_each_counter', 'test');

The result is very much like the one I was looking for. I can set the parameters, in the order that I remember them, and simple_sp will put them in the right order. Stop for a moment, now. I know you are asking "how does it do that?"
There was an additional part in the syntax stuff that I haven't mentioned yet. So far, we know that we have a list of all routines with their syntax. However, there is also a table containing, for each routine, a list of all parameters in the appropriate order. This list was a bit painful, because MySQL information_schema still lacks a PARAMETERS table, so I had to provide one of my own, and to avoid populating that table manually I had to use an external program (in Perl, if you must ask) that will read the SHOW CREATE {PROCEDURE|FUNCTION} statement and extract the parameters.
After that, I have a table where the parameters for each routine are stored in calling order. When I call simple_sp with my array of parameters, the routine creates another array with the parameters in the correct order, and fills its values from the corresponding keys in the first array. Then, it transforms the array into a string, the string is used to create a dynamic query, which is executed and everybody is happy.

Even easier named parameters

But I kept thinking about this problem, until I came up with the idea of exploiting the DBMS a bit more, and making it able to parse a list of named parameters similar to the one that Perl accepts. When I created array_from_pair_list, I knew that the solution was there:

    call simple_spl(
          'for_each_counter';
          'counter_start   => 1;
           counter_delta   => 1;
           counter_end     => 10;
           sql_command     => SELECT $N * $N ');

This simple_spl (=simple stored procedure list) actually works! The only drawback is that I need to set the default list separator to a semicolon (";") instead of a comma, because commas are quite common inside a SQL statement. In case you feel that the default list separator and pair separator are clashing with something in your statements, you can set them with the user variables @list_separator (default ";") and @pair_separator (default "=>").
Parting thought for the MySQL developers: wouldn't be good if this feature were included in MySQL language itself?

Friday, December 09, 2005

Creating triggers in MSQL 5.0 without SUPER privilege

Trigger grants blues

Triggers are a great feature, and used with care they can prove very useful.
Unfortunately, since they arrived quite late in the schedule, there is no specific privilege designed to grant users the power to create a trigger. Instead, as of 5.0.17, users require the SUPER privilege to create triggers. Which is kind of unconvenient, because the SUPER privilege also allows users to do more things than creating triggers, such as killing threads and setting global variables.
This it is understandable why a system administrator would be quite reluctant to give the SUPER privilege to (possibly) unknown users who have otherwise full control of one database. Think, for example, to a ISP, or to the great database provider db4free, for which this page is dedicated. In these services, you can get an account, which leads to full grants to one database. Within that database you can do almost anything, except creating triggers.
Now, if triggers are really needed and important, your only resort would be asking the system administrator to create it, and then to modify it if it turns out to be wrong, and to drop it when you no longer need it. That's probably a task that the system administrator is not willing to burden himself with.

What then?

I remember that a few years ago I had a similar need. I was the administrator of a large human resources and finance database for a international organization. I was the only one who could physically create new users, but only the Chief of Personnel (CPO) or her deputies could authorize such creation. I did not want to give root privileges to the CPO, because, as much as I trusted her judgement, she had no technical computer skills, and besides, I did not feel confortable sharing my root password with anyone or creating a user with root prvileges.
For a few weeks I was creating users, and it was a tedious work, because there were about twenty different profiles, depending on the read/write accesses on personnel and financial matters. Then I worked out a solution that made everybody's life easier.
I created a table of user creation requests, with a front end for the CPO and her deputies, where they could fill in the request for a new user or for modifying/removing existing ones. The table contained the user name and department, and the profile to which (s)he was to be allowed.
A cron job every 30 minutes checked such table, and it fulfilled the request using a template for each profile, generatng a password for the user, who was then informed by e-mail about the new account, with the CPO in CC.
The system worked very well, and it had the side effect that my vacations could be planned with more freedom than before.
Remembering this experience, I thought that the same device could be applied to triggers, and here it is how it works.

Bending the rules

Assuming an organization like db4free, we have several databases, each one assigned to one user, who does not have powers on any other DB. Currently, these users can't create triggers. Each user who wants to use triggers should create a table named trigger_request with three mandatory fields: trigger_name, which is the name of the trigger, trigger_body, containing the query to create the trigger, and done, a numeric field with default to zero.

CREATE TABLE `trigger_request` (
`trigger_name` varchar(50) NOT NULL,
`trigger_body` text,
`ts` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
`done` int(11) default '0',
PRIMARY KEY  (`trigger_name`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1

There is a cron job every N minutes (the interval depending on how many users want to use this feature). My personal choice to implement this feature is Perl, but it can be easily done in any language. When the addtriggers program runs, it checks every database with a trigger_request table (the INFORMATION_SCHEMA helps here), and reads the table to see if there is a record where done is set to zero. For each record, it attempts to create the trigger, and reports the result to the user, by means of another table, which is created if needed, called trigger_answer

CREATE TABLE `trigger_answer` (
`trigger_name` varchar(50) NOT NULL,
`TS` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
`result` text,
PRIMARY KEY  (`trigger_name`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1

Safety checks

Of course, the program I made does not execute blindly any query that is passed to it. Some sanitizing checks are performed before the query is given to the DBMS. First of all, it checks that it is a CREATE TRIGGER query, and it is created within the same database. Then, it checks that there are no references to tables in a different database.
If the query does not pass these checks, the user is informed that something unacceptable was passed. When the program is satisfied that nothing fishy was tried, it executes the query, reporting the result in trigger_answer.
Let's see a practical example. Assume that I have a database called "charmer" and I put the following in my trigger_request table.

select * from trigger_request\G
*************************** 1. row ***************************
trigger_name: city_bi
trigger_body: create trigger city_bi /*THIS SHOULD RETURN AN ERROR*/
     ts: 2005-12-09 12:58:01
   done: 0
*************************** 2. row ***************************
trigger_name: city_ai
trigger_body: create trigger city_ai after insert on City  for each row set @XY = 1
         /*this should be OK*/
     ts: 2005-12-09 12:58:01
   done: 0
*************************** 3. row ***************************
trigger_name: city_bu
trigger_body: grant all on *.* to TheWiz  identified by "gotcha" with grant option
         /*THIS SHOULD BE REJECTED*/
     ts: 2005-12-09 12:58:01
   done: 0
*************************** 4. row ***************************
trigger_name: city_au
trigger_body: create trigger city_au after update on world.City for each row set @city_au = 1
         /*THIS SHOULD BE REJECTED*/
     ts: 2005-12-09 12:58:01
   done: 0
*************************** 5. row ***************************
trigger_name: city_bd
trigger_body: create trigger city_bd before delete on charmer.City for each row set @city_bd = 1
         /*THIS SHOULD BE OK*/
     ts: 2005-12-09 12:58:01
   done: 0
5 rows in set (0.00 sec)

Running addtriggers against this dataset will produce two triggers and tree errors. The first record is accepted, because starts with "CREATE TRIGGER trigger_name", but will prduce an error because the SQL is incomplete. The second record is OK, and will execute without fuss. The third record is rejected because it is not a "CREATE TRIGGER" statement. The fourth one is rejected because it references a table in a different database, and the fifth one is accepted because, even though it references a table with the "database.table" syntax, it is using the authorized database name.
Thus, the user will see the following in trigger_answer

select * from trigger_answer\G
*************************** 1. row ***************************
trigger_name: city_bi
     TS: 2005-12-09 13:03:05
 result: You have an error in your SQL syntax; check the manual that corresponds to your
         MySQL server version for the right syntax to use near ''
*************************** 2. row ***************************
trigger_name: city_ai
     TS: 2005-12-09 13:03:05
 result: OK
*************************** 3. row ***************************
trigger_name: city_bu
     TS: 2005-12-09 13:03:05
 result: SQL command not recognized as a CREATE TRIGGER
*************************** 4. row ***************************
trigger_name: city_au
     TS: 2005-12-09 13:03:05
 result: REJECTED: Attempt at using database world from database charmer
*************************** 5. row ***************************
trigger_name: city_bd
     TS: 2005-12-09 13:03:05
 result: OK
5 rows in set (0.00 sec)

The source code for addtriggers is quite long. You can see it online at MySQL General Purpose Stored Routines Library CVS tree.

(Update) Modifying triggers

To drop a trigger, set the trigger's body in trigger_request to NULL or to an empty string. To modify an existing trigger, change the trigger's body, and set the done column to zero. The program will first drop the trigger (because it sees that it exists, from the INFORMATION SCHEMA tables) and will execute the trigger creation query.

A final plea

Does anybody see any security risks in this paradigm? Any suggestion to strenghten the checking mechanism in the cron job program would be welcome.

Sunday, December 04, 2005

MySQL 5 general purpose routine library - III : Using "FOR EACH" loops

(Third part of the MySQL 5 general purpose routine library presentation.)
How many times, while using a MySQL client, you were facing a problem that seemed quite simple to solve, if only you could perform a loop of some sort without creating an external script?
I often found myself doing things like

  $ for $X in `mysql -B -N -e 'show tables'` ; \
  do perl -pe "something with $X" |  \
  mysql -e "something else with $X" ; done

Or perhaps I write a quick Perl script to achieve the result I need. But how much better it would be if I could use a FOR EACH loop from mysql command line! Hence, when stored procedures became available, I started planning this kind of features, and now I can say that I have made my life simpler with these powerful language extensions.

complete, normal, simple.: Each group of routines has one "complete" method, i.e. a routine with as many details as it is humanly bearable. This routine is called for_each_SOMETHING_complete. Then there is another routine that does the same thing, but without so many arguments, and it is called for_each_SOMETHING. In some cases, it is possible to reduce the number of parameters even more, an then you'll find a routine called for_each_SOMETHING_simple. The reference guide has the details of what you can do with each one of them.
placeholders: Within the loop, there are elements that you may want to use in your SQL statements. For this reason, in almost each routine of this class you are provided with placeholders that refer to such elements. For example, when using a counter, there is a placeholder for the counter itself, when looping through a table list, you have placeholders for the table name, the database name, the engine type, and so on. Again, the reference guide tells you which placeholders are available for each routine.
user variables: Sometimes, there are too many details that you should take care of, but there are already so many parameters in your routine, that adding more would make its convenience fade away. Instead of adding parameters for every small detail of your routine, we prefer passing these less important arguments as user variables. For example, one routine allows up to 6 different placeholders, and every one of them can be customized. Rather than adding 6 more parameters, you can change one or more of those placeholders by setting the appropriate user variable, as explained in the reference guide.

For each table

The idea of this routine is to provide a database name, and to execute an SQL command for each table of the database, possibly with a filter to choose for which tables the task should be carried out. This option is easier if we take into account the tables available in MySQL 5 INFORMATION_SCHEMA database. Using the information listed in INFORMATION_SCHEMA.TABLES, it is easy to filter tables by name, type, engine, number of rows.
The easiest call is for_each_table(database_name, condition_text, sql_command). The first parameter is self explanatory. The second one is a condition that applies to INFORMATION_SCHEMA.TABLES. For example, if you want to limit the command applicability to the tables with a name starting by 'emp', then set condition_text to 'table_name like "emp%"'. Notice my usage of single and double quotes. Since you are passing a string, every quote inside the string must be escaped, or you should use different quote types. The last parameter, sql_command, is the query you need to execute for each table. This query, to be really effective, needs to use some table attributes, and this feat can be achieved by using placeholders.
A simple example:

    call for_each_table(
 'myoffice',
 'table_name like "tmp%"',
 'drop table $D.$T');

WARNING! You can really do this kind of commands, and you can drop all tables in your database in a second. Use this function with extreme care!

Another example

    call for_each_table(
 'myoffice',
 'table_name like "emp%"',
 'create table another_db.$T like $D.$T' );

This one will create a copy of all tables with name starting by 'emp' in another_db.
If your requirements are more complex, then you may want to try for_each_table_complete, which works like [RP]for_each_table, but has three additional parameters: a query to execute before the main command, another to execute after the main command, and a modifier telling if such queries should be executed before and after the loop ('once') or for each command in the loop ('many'). Let's see an example first:

    call for_each_table_complete(
 'world',
 'table_name !="mycounts"',
 'insert into world.mycounts select "$T", count(*) from $D.$T ',
 'create table if not exists  world.mycounts(t varchar(50), nrows int)',
 'select sum(nrows) from world.mycounts',
 'once' );

The first three parameters work just like for_each_table. The fourth parameter is a query to be executed before the main command. As you see, it creates a table used by the main command to insert some values from each table. The fifth parameter is a query to get the summary of the operation. The last parameter says that these two queries should be executed just once, not within the loop.
Let's see an example of these two different behaviours:

    call for_each_table_complete(
 'world',                                                 -- database
 '',                                                      -- no filter
 'select @SUM := @SUM + ( select count(*) from $D.$T )',  -- main command
 'select @SUM := 0',                                      -- before
 'select  @SUM',                                          -- after
 'once' );                                                -- modifier

First, let's call the routine with a 'once' modifier, and see what happens:

+-----------+
| @SUM := 0 |
+-----------+
|         0 |
+-----------+

+----------------------------------------------------+
| @SUM := @SUM + ( select count(*) from world.City ) |
+----------------------------------------------------+
|                                               4079 |
+----------------------------------------------------+

+-------------------------------------------------------+
| @SUM := @SUM + ( select count(*) from world.Country ) |
+-------------------------------------------------------+
|                                                  4318 |
+-------------------------------------------------------+

+---------------------------------------------------------------+
| @SUM := @SUM + ( select count(*) from world.CountryLanguage ) |
+---------------------------------------------------------------+
|                                                          5302 |
+---------------------------------------------------------------+

+------+
| @SUM |
+------+
| 5302 |
+------+

What happened was that the variable @SUM was set before the loop, and it got increased by the number of rows in each table. The final query gave us the total. Let's see what happens if we call the same routine with a 'many' modifier:

    call for_each_table_complete(
 'world',                                                 -- database
 '',                                                      -- no filter
 'select @SUM := @SUM + ( select count(*) from $D.$T )',  -- main command
 'select @SUM := 0',                                      -- before
 'select  @SUM',                                          -- after
 'many' );                                                -- modifier

+-----------+
| @SUM := 0 |
+-----------+
|         0 |
+-----------+

+----------------------------------------------------+
| @SUM := @SUM + ( select count(*) from world.City ) |
+----------------------------------------------------+
|                                               4079 |
+----------------------------------------------------+

+------+
| @SUM |
+------+
| 4079 |
+------+

+-----------+
| @SUM := 0 |
+-----------+
|         0 |
+-----------+

+-------------------------------------------------------+
| @SUM := @SUM + ( select count(*) from world.Country ) |
+-------------------------------------------------------+
|                                                   239 |
+-------------------------------------------------------+

+------+
| @SUM |
+------+
| 239  |
+------+

+-----------+
| @SUM := 0 |
+-----------+
|         0 |
+-----------+

+---------------------------------------------------------------+
| @SUM := @SUM + ( select count(*) from world.CountryLanguage ) |
+---------------------------------------------------------------+
|                                                           984 |
+---------------------------------------------------------------+

+------+
| @SUM |
+------+
| 984  |
+------+

Now the variable gets set to 0 for each loop, and thus the final value we get is not the total, but just the number of rows of the last table processed.
If you look at the reference guide for for_each_table_complete, you'll see that there are many more things you can do. Do some experiment and you'll realize that this routine, coupled with some imagination, can do really wonderful things.

For each counter

This is the classic "for loop" that you can find in programming language such as C, Perl, Java. You have a counter with a starting and ending value and a delta value to increment the counter. As in the for_each_table routine, you have a for_each_counter and a [RP]for_each_counter_complete to play with. The principle is the same: the normal routine gets just the minimum parameters to get the job done (counter_start, counter_end, counter_delta, sql_command) and the "complete" version has the same additional parameters seen for for_each_table_complete (sql_before, sql_after, {'once', 'many'} ).
Here is some simple usage of this routine:

 call for_each_counter(1,4,1, 'create table test.test_$N(i int)');
Query OK, 0 rows affected (0.06 sec)

show tables from test like 'test%';
+------------------------+
| Tables_in_test (test%) |
+------------------------+
| test_1                 |
| test_2                 |
| test_3                 |
| test_4                 |
+------------------------+

call for_each_counter(1,400,1, 'insert into test.test_1 values ($N)');
Query OK, 0 rows affected (0.08 sec)

select count(*) from test_1;
+----------+
| count(*) |
+----------+
|      400 |
+----------+

Here's another common example from real life experience. One of the most frustrating problems is when you have a table with date values, and you are supposed to have values for every date in a given period. To check that this is correct, you should either use an external program, or, if you have a for_each_counter routine, you can get a quick solution:

 create table test.mydates(d date);

call for_each_counter(1,365,1,
 'insert into test.mydates values ("2005-01-01" + interval $N day)');

select d from mydates limit 10;
+------------+
| d          |
+------------+
| 2005-01-02 |
| 2005-01-03 |
| 2005-01-04 |
| 2005-01-05 |
| 2005-01-06 |
| 2005-01-07 |
| 2005-01-08 |
| 2005-01-09 |
| 2005-01-10 |
| 2005-01-11 |
+------------+
10 rows in set (0.01 sec)

Now all you need to do to check that your original table is complete, is a RIGHT OUTER JOIN with mydates and you get a list of the missing dates very quick.

   SELECT
 d
FROM
 mydates
 LEFT OUTER JOIN my_real_table on (d=date_column)
WHERE
 my_real_table.PK_column is null;

For each dataset row

This is possibly the most complicated routine in this collection. It allows you to walk through a dataset and execute a query for each row, referencing in your query up to three values per row. Does it sound complicated? It is. But it is also powerful, and it works. The syntax for for_each_table_value is

  call for_each_table_value(
 database_name,
 table_name,
 wanted_column1,
 wanted_column2,
 wanted_column3,
 search_condition,
 sql_command);

database_name and table_name are just the identifiers of the place from where we get the values. wanted_column1, wanted_column2, and wanted_column3 are the columns you want to get from the table to use in your query. search_condition is the WHERE clause to select which rows we should get from the table. And sql_command is the query where you can use placeholders to reference database and table ($D, $T), each of the three columns ($I1, $I2, $I3) and a counter ($N).
Here's an example. It may seem silly, but this is not something you can easily do in normal SQL.

  call for_each_table_value(
 'world',               -- database
 'City',                -- table
 'District',            -- first column
 'Name',                -- second column
 NULL,                  -- third (not needed) column
 'CountryCode="ITA" limit 10', -- filter
 'create table world.$I2(`$I1` varchar(50))');

This query will create a table named after a city, having a column named after the city's district.

 show tables;
+-----------------+
| Tables_in_world |
+-----------------+
| Bari            |
| Bologna         |
| Catania         |
| City            |
| Country         |
| CountryLanguage |
| Firenze         |
| Genova          |
| Milano          |
| Napoli          |
| Palermo         |
| Roma            |
| Torino          |
+-----------------+

desc Firenze;
+---------+-------------+------+-----+---------+-------+
| Field   | Type        | Null | Key | Default | Extra |
+---------+-------------+------+-----+---------+-------+
| Toscana | varchar(50) | YES  |     | NULL    |       |
+---------+-------------+------+-----+---------+-------+

desc Genova;
+---------+-------------+------+-----+---------+-------+
| Field   | Type        | Null | Key | Default | Extra |
+---------+-------------+------+-----+---------+-------+
| Liguria | varchar(50) | YES  |     | NULL    |       |
+---------+-------------+------+-----+---------+-------+

Using the same method, you can drop those tables at once:

  call for_each_table_value(
 'world',
 'City',
 'Name',NULL,NULL,
 'CountryCode="ITA" limit 10', 'drop table world.$I1');

show tables;
+-----------------+
| Tables_in_world |
+-----------------+
| City            |
| Country         |
| CountryLanguage |
+-----------------+

For each array item

If you have seen the documentation about the array routines, you knew that this one was coming. Executing a loop for each element of an array is a common programming need, and this was in my mind since I started planning the array routines.
The simplest case is for_each_array_item_simple, which requires an array name and a sql command. Then the routine walks through the array items, executing the SQL command for each one of them, allowing the SQL to reference the item and the counter using placeholders.
So, for instance, if your array 'colors' contains three values: 'red', 'blue', and 'white', calling

    for_each_array_item_simple('colors','select "$N","$I"')

will execute three times, printing each array item to the screen.
A more advanced routine, for_each_array_item, will take two more parameters to mark the minimum and maximum array index to use in the loop.
As you may guess, having seen the previous routines, for_each_array_item_complete will also take parameters for queries to be executed before and after the main command.

For each list item

Last, but not least, routines for treating lists are the most dynamic of them all. You know that you can execute a query for each item of an array, and you know (if you don't, see array_from_list) that you can create an array from a comma separated list. Therefore, you can execute an SQL command from a list converted into an array. Rather than forcing the user to write

    call for_each_array_item(
     array_from_list('red,blue,white','colors'),'select "$N","$I"' );

there is a routine for_each_list_item that does just that behind the scenes.

    call for_each_list_item('red,blue,white', 'select "$N","$I"' );

which does the same thing, with less effort.
As usual, there is also a for_each_list_item_complete to do additional wonders, as the reference guide will tell you.

Monday, November 28, 2005

MySQL 5 general purpose routine library - II: Using arrays

As you know, MySQL 5 stored routines language does not provide any sort of array structure. While this is not a problem when dealing with normal SQL, since arrays are a symptom of non-normalized data, when using a programming language, it becomes quite a strain, and the programmer is often struggling to achieve results that would be really natural and simple using an array, but are made unnecessarily hard by lack of them.
Thus comes into existence this library of routines that emulate arrays in MySQL 5.

Implementation

Arrays, as part of the MySQL General Purpose Stored Routines Library are implemented as datasets, all belonging to one table, and identified by a unique name. The details of the implementation are hiddenby the interface, and you should not worry a lot about it.

General concepts about the array library

Whenever possible, each routine is implemented both as a procedure and as a function. Only when a function returns an array value or an array property, then there is no corresponding procedure. This is a design choice, so that you are free of calling the routine in the most convenient way. When using a procedure that has a function synonym, a user variable with the same name of the procedure is set with the function return value. For instance, calling function array_create() you get the array ID.

If you choose to call the procedure instead, then @array_create will contain the array ID.

Array creation

You can create an array using array_create(name, size). 'Name' is any string not yet used as an array name. If you want to be sure that the name is not taken, use the function array_exists(name). 'size' is the number of items that you want to create (they will be set to NULL).
Using a zero size will create an empty array.


call array_create('myfirst',0);

You can also create an array from string containing a comma separated list, with the function array_from_list(string, array_name).

If you want to use a different separator, then you may want to look at array_from_list_complete(string, array_name, separator).

   call array_from_list('one,two,three', 'myarray');
set @list = 'Doc:Grumpy:Happy:Sleepy:Bashful:Sneezy:Dopey';
call array_from_list_complete(@list, 'snow_white_friends', ':');

array_from_list returns the array name, so it's easy to use it combined with a routine that requires an array:


select array_size(array_from_list('one,two,three','myarray'));

Inserting data

You can treat these library arrays in four different ways:

normal arrays: Thus you can insert a value by index.
array_set_value_by_index(array_name,index,value) does this, with its universal alias array_set(array_name, index, value);
associative arrays: Also known as hashes in Perl or maps in C++. You can set the value by key rather than by index.
array_set_value_by_key(array_name, key, value) is the routine for this, and array_set(array_name,key, value) can be an alias for that routine as well. Notice that when you set a value by key, the array will still have its automatic index. So you can alway access an array by index, even if you set the values by key.
queues: First in, first out, is the principle of queues. If you want to do this, insert your values using array_unshift(array_name, value) which adds a value at the beginning of the array, and retrieve them using array_shift(array_name), which returns (and deletes) the last element in
the array.
Notice that array_unshift is the most expensive method in this collection, because all the array_indexes need to be decreased by 1.
stacks: Last in, first out. This is one of the most used device in programming. array_push(array_name,value) adds an item to the end of an array, and array_pop(array_name) retrieves the last item and deletes it.

retrieving data

No matter how you entered data into an array, you can always retrieve your items by index. The lowest index is zero, the highest one is returned by array_max_index(array_name). Knowing this, you can quickly create a loop to walk through an array, using array_get_value_by_index(array_name, index). If you inserted your values by key, you can retrieve them in the same way. array_get_value_by_key(array_name, key) does the job. For your convenience, there is also a array_get(array_name, key_or_index) function that you can use to get values by key or index. If you want to set a key for an existing array value, you can use array_set_key_by_index(array_name, index, key).

information on arrays

There are some convenience routines to get information about arrays.

array lists: to see the list of all available arrays, use array_list(), that prints a
list of all public arrays.
Private arrays are all arrays with a name starting with an underscore character
('_'). These arrays are usually created and used inside a procedure and should
not be shared.
If you want to see them as well, use array_full_list().
array contents: To see all the items in an array, call [RP]array_show(array_name) .

advanced array usage

inserting several items at once

The array routines were designed to allow flexibility in insertion. In several programming languages you can fill an array with just a statement, be it a declaration or an action. Of course, the quickest way is to use the array_from_list function, but if you have an already existing array and you need to add more items, you can do one of the following:

more array_set functions in a SELECT statement


SELECT array_set('name', 'key1', 'value1'),
array_set('name', 'key2', 'value2'),
array_set('name', 'key3', 'value2') \G

chain insertion calls using array_setn

array_setn returns the array name, so it can be used in each statement where an array name is required.


select array_setn( array_setn( array_setn(
                  'name', 'key1', 'value1'),
                          'key2', 'value2'),
                          'key3', 'value3');

Populate an array from a dataset

To achieve this result, we need the help from another part of the library. For example, if we want to insert into an array the list of USA cities as keys with their population as value, we can do this:


call for_each_table_value(
'world',               -- the database
'City',                -- table name
'Name',                -- the first column to use
'Population',          -- the second column to use
null,                  -- no need for a third column: null
'CountryCode="USA"',   -- filter clause
'call array_set("USA_cities","$I1","$I2")');

In this statement we invoke a for_each_table_value engine to execute our query for each row in table City where CountryCode = 'USA'. The query to execute will call the array_set procedure, using the columns as arguments ($I1 and $I2). As a result, we'll have an array of 264 rows, containing all USA cities with their population.


call array_show('USA_cities');
+-------------+-------------------------+-------------+
| array_index | array_key               | array_value |
+-------------+-------------------------+-------------+
|           0 | New York                | 8008278     |
|           1 | Los Angeles             | 3694820     |
|           2 | Chicago                 | 2896016     |
|           3 | Houston                 | 1953631     |
|           4 | Philadelphia            | 1517550     |
   ...   ...                       ...
|         258 | Fall River              | 90555       |
|         259 | Kenosha                 | 89447       |
|         260 | Elgin                   | 89408       |
|         261 | Odessa                  | 89293       |
|         262 | Carson                  | 89089       |
|         263 | Charleston              | 89063       |
+-------------+-------------------------+-------------+
264 rows in set (0.20 sec)

If your dataset is reasonably short, there is another trick you can pull, without using a 'for each' routine. Let's assume that you want to put into an array just the districts for Italy, taken from table 'City' in the example world database.


select array_from_list(
(select group_concat(distinct District)
from world.City where CountryCode='ITA'),
'italian_districts')\G


call array_show('italian_districts');
+-------------+-----------+----------------------+
| array_index | array_key | array_value          |
+-------------+-----------+----------------------+
|           0 | NULL      | Latium               |
|           1 | NULL      | Lombardia            |
|           2 | NULL      | Campania             |
|           3 | NULL      | Piemonte             |
|           4 | NULL      | Sicily               |
|           5 | NULL      | Liguria              |
|           6 | NULL      | Emilia-Romagna       |
|           7 | NULL      | Toscana              |
|           8 | NULL      | Apulia               |
|           9 | NULL      | Veneto               |
|          10 | NULL      | Friuli-Venezia Giuli |
|          11 | NULL      | Calabria             |
|          12 | NULL      | Sardinia             |
|          13 | NULL      | Umbria               |
|          14 | NULL      | Abruzzi              |
|          15 | NULL      | Trentino-Alto Adige  |
|          16 | NULL      | Marche               |
+-------------+-----------+----------------------+

Well, a few are missing, but not because of our routine, which seems to work fine. Notice that if the list of entries is larger than 1024 bytes, group_concat will not work as expected, and you need to increase a system variable to get correct results. But for short lists this is a trick to keep in mind.
Even better, using array_from_pair_list, you can populate an array with a list of key/value pairs.


set group_concat_max_len = 8000; -- necessary, or group_concat can't work
                          -- well in this case
select array_from_pair_list(
                 (select group_concat( concat(Name,"=>",population ))
                  from world.City where CountryCode ='USA'),
                 'USA_cities')\G

And USA_cities will get the same contents we have seen when using the for_each_table_value method.

What else can you do with arrays? Wonderful things, as you'll see in the next articles about this library.

Sunday, November 27, 2005

MySQL 5 general purpose routine library - I : general concepts

A few days ago I started a forum thread that got quite a good acceptance.

There are rumours from the MySQL community that some sort of routine repository is going to be organized sooner or later, and after I heard them I almost dismissed my initial idea.
However, some people convinced me that starting such an initiative could always be beneficial for the community, and after all, if the official repository shows up, it would be easy to merge into it.

Therefore, here it goes. This is the first article of a series that will explain what I have made in the field of general purpose routines.

The project is available at savannah.org. The documentation is till in the making, so I will start with a general overview.

PURPOSE

This routine library will collect general purpose MySQL 5 stored
procedures and functions. What is to be considered "general purpose"?
Any routine that enhances the language itself or that can be useful
in an abstract database fits the bill.

Since its appearance, MySQL stored routines language has proved itself
useful and powerful, and yet it lacks many features that we have become
accustomed to in other languages, and we would like to see in MySQL::SP
as well. Stored routines that improve the language espressiveness and
usability fall in this category.

Furthermore, there are common tasks that can get accomplished by a stored
routine, and are not related to a particular database. Also these routines
are good candidates for this library.

CURRENT CATEGORIES

At the very beginning of this project, I have put some categories that got
populated faster than others, and thus they became, in my view, the ones with
the highest priority.

ARRAYS AND BASIC DATA STRUCTURES

Perhaps the biggest disappointment when MySQL::SP language was disclosed was
the lack of arrays. Not as a SQL type, of which frankly I can do without, as
it breaks normalization, but as a language component. Without arrays, and
without things like queues and stacks, I feel that my programming abilities
are severely cut down.
Thus this first component, which inmplements named arrays where you access
items by index, but that can also be treated like Perl hashes, where items
are accessed by name. Built-in with them are some stack and queue features,
so that an array is automatically a queue, if accessed with shift/unshift,
or a stack, if accessed with push/pop.
Further routines to transform arrays to strings and strings to arrays make
this library ready for prime time.

"FOR EACH" LOOPS

I don't know you, but I find myself doing quite often things like

$ for $X in `mysql -B -N -e 'show tables'` ;  \
    do perl -pe "something with $X" | \
    mysql -e "something else with $X" ; done

i.e. I get a list out of mysql, treat the list with some sort of filter,
and then feed it back to the database.
Sometimes I fire up a quick Perl script to do the whole task, but most of the
times the whole thing would be just a simple matter if MySQL had a FOR loop
that you can use from a regular SQL statement.
This component came out of this need. Now I have routines that accept
some parameters and perform for me a loop with a counter, or using a
dataset of table values, or walk through an array.

NAMED PARAMETERS

In the process of making the "for each" routines, I came to the disappointing
discovery that I had to use quite a lot of parameters if I wanted to achieve
some powerful results. When you have routines that need more than three
parameters, as every programmer knows, it is quite difficult to get the order
right, and many mistakes can come from this situation. Thus, I wanted to
implement a simple way of calling my complex routines with maned parameters,
something that in Perl I do like this:

function_name(
db_name    => 'world',
table_name => 'City',
expression => 'where CountryCode = "ITA"'
);

This way, if I misplace one of the parameters, it does not matter, because
they are retrieved by name. My routine does almost this ("almost", because
lack of language power prevents an exact implementation). It lets you pass
an array of named parameters (key + value) and a hidden mechanism will
sort out the parameters and pass them to the receiving routine in the
correct order.
The corresponding MySQL code would be:

  select array_set('mypar', 'db_name', 'world'),
      array_set('mypar', 'table_name', 'City'),
      array_set('mypar', 'expression', 'where CountryCode = "ITA"')\G
call simple_sp('function_name', 'mypar');

As a bonus, I now have an array of parameters that I can reuse when I need
them, perhaps after changing one or two parameter values.

SYNTAX HELPERS

As I said, my routines can grow their parameters quite fast, and remembering
all of them can be difficult even for their own author. Here comes another
component that registers all the routines syntax, and has a way of showing
you a quick excerpt of their usage.
This helper is also a key component of the routines that deal with named
parameters.
Just as an example the reference guide that is currently in the repository
was created by just these two calls:

call for_each_table_value_simple(
           database(),
           '_routine_syntax',
           'routine_name',
           'routine_type="function"',
           'select fsyntax("$I1")')\G   
call for_each_table_value_simple(
           database(),
           '_routine_syntax',
           'routine_name',
           'routine_type="procedure"',
           'select psyntax("$I1")')\G

Curious for more? Visit the project site and if you are interested in contributing, subscribe to the developers mailing list.

Friday, November 25, 2005

Hiding connection parameters with Perl/DBI

In a Perl application, using the widely known DBI module, a database connection is fired up by an instruction such as this one:

my $dbh = DBI->connect("DBI:mysql:test",
  "user_name", "my_secret_password", { RaiseError => 1 })
or die "something went wrong ($DBI::errstr)";

The first parameter is compulsory, and it's made of "dbi", which is the same for all connections, the database driver, which in our case is "mysql", and the database name, in our case "test". For a self respecting database, we should also specify the user name and password.
Nothing to complain about if this script is safely stored in your home directory, screened from prying eyes by the OS permissions.
But what happens when you need to distribute this script to somebody? You need to remember to change username and password to some dummy text, to avoid the burden of having to change our real credentials if by dumb chance we sent away our script with the password for everybody to see it.

Luckily, there is an easy solution to hide connection variables from others. You just need to store them in a configuration file, which can be kept in a guarded location and never distributed. So it will be visible to your script as long as it is within the boundaries of your system, but it will be invisible to others, and moreover it won't follow the script if you need to send it away.

It works like this. In your home directory (or the home directory of the user that will execute the script, for example, for a cron job) you create afile named ".my.cnf", containing the following:

[client]
user=user_name
password=my_secret_password

The file needs to be protected with the right privileges (chmod 600 .my.cnf) so that it can't be accessed by other users. Then, you change the connection statement like this:

my $dbh = DBI->connect("DBI:mysql:test"
  . ";mysql_read_default_file=$ENV{HOME}/.my.cnf",
  undef, undef) or die "something went wrong ($DBI::errstr)";

The "mysql_read_default_file" option will point to the file with the connection variables. Notice that instead of username and password there are two null values (undef). The file name, in this case, is built from the environment variable "$HOME" (in Unix like systems). You can also state it literally, but this trick will make the script usable from different users without modifications. The only thing each of them needs is a configuration file.
For more stuff on Perl/DBI, see DBI recipes.

Tuesday, November 08, 2005

Data on demand for data warehouses

Here is a presentation given at the Open Source Database Conference on November 8, 2005.

Abstract:

Data on Demand is a paradigm of storing data for a data warehouse with huge storage requirements. There are cases where creating statistical tables is not enough, and the detail data should be kept online for long periods. This technique has been applied in cases of MySQL DW associated with industrial production control, saving up to 80% storage space and improving performance and scalability.

Available presentation material:

* open document (OpenOffice 2) [1.3 MB]

* portable document format (pdf) [2.0 MB]

* openoffice 1.1 [1.3 MB]

* powerpoint [1.2 MB]

* HTML (S5 format) + source code [736 KB]

* on-line browsable presentation (s5 format)

Saturday, November 05, 2005

New data replication paradigms

MySQL success is due to not only to its efficiency, but also to its ease of administration.
And despite what its detractors say about lack of features (that was before
5.0, of course), there is a feature that is crucial for enterprise database
management, and MySQL has had it built-in since early version. I am talking
about data replication that can trasform a single server into a scalable and
highly reliable system with minimum effort and expense.

Recently, one more element was added to MySQL set of features. MySQL Cluster is an ambitious engine that turns your servers into a full fledged cluster that can be accessed from any node without worry. Any node is master and slaveat the same time. However, these features come at the price of some higherhardware requirement (MySQL Cluster is, as of today, an all-in-RAM database).

MySQL data transfer methods

There are differences between these two approaches. Choosing MySQL Cluster or standard replication is a trade off.

method	description	the good	the bad
replication	asynchronous transfer, based on one master and N slaves	Fast, reliable, robust, easy to set up, no special HW required, transaction-safe	needs some care in the code
MSQL Cluster	synchronous transfer, based on a cluster engine	fast, no code modification needed	require huge amounts of RAM, not easy to set up, minimal transaction support

If you start from scratch, it's easy to build your applications in such a way that they are replication-aware, thus benefitting from the low-cost high scalability of this paradigm.
OTOH, if you have an application that is already established (and perhaps complex and large), adapting it to the one-master-many-slaves architecture could be a problem.
There are external tools that promise to fill the gap, and offer clever solutions, mostly based on in-between filtering layers that give your application the illusion that there is just one server, while in truth there are several ones. Also these solutions cost you something, not only in mere money, (since they are - with limited exceprions - commercial tools), but also in features, because filters mostly are not transaction safe.

Different needs for data transfers

Aside from the mere method of data movement, there are other needs from the enterprise. Notice that this is not some stretch of my imagination, but the practical result of several years of consulting in various database-related matters.

One thing that most medium to large companies wanted is the reverse of replication, i.e. they want the master to be updated by many slaves. The need is clear when you think of several branches of a company in need of sending updates about sales records and store levels. Unfortunately, simple replication does not allow this feature, and even the cluster would not be bent easily to this need.
Another thing that companies want is a conditional replication, based on rules that are far beyond the limited configuration range of both replication and cluster. For example some companies wanted a table migrated from server A to server B, but to different tables, depending on a time frame.

Enter MySQL 5, and a new world of possibilities is open to the clever administrator.

New data transfers paradigms: FEDERATED tables

The reverse replication can be achieved with tables using the FEDERATED engine .
For each table you need to replicate from the branch database to the main house, you just need to establish a FEDERATED table and a couple of triggers.
Let's day, just for the sake of it, that you have the main server at big.central.mycompany.com, and the branches as far.away.mycompany.com (codename : FAW) and at fairly.distant.mycompany.com (code name: FDI). The main table you need to update at big.central lays in the "biz" database and looks like this:

CREATE TABLE all_sales
(
  branch_id       char(3) not null,
  sale_date       datetime not null,
  item_id         int not null,
  quantity        int not null,
  primary key (branch_id, sale_date, item_id)
) engine=innodb;

In each branch, you have one table for your branch sales, that you update every day with the consolidate data from your detail tables.

CREATE TABLE branch_sales
(
  sale_date       datetime not null,
  item_id         int not null,
  quantity        int not null,
  primary key     (sale_date, item_id)
) engine=innodb;

To get the data from branch_sales to the main central table, you build a mirror of such table in each branch, but instead of being InnoDB or MyISAM, it will be FEDERATED.

CREATE TABLE central_all_sales
(
  branch_id       char(3) not null,
  sale_date       datetime not null,
  item_id         int not null,
  quantity        int not null,
  primary key (branch_id, sale_date, item_id)
) engine=FEDERATED
CONNECTION='mysql://user:pwd@big.central.mycompany.com/biz/all_sales';

As a final touch, you add a few triggers to your branch table, to get the data moving:

create trigger branch_sales_ai after INSERT on branch_sales
for each row
  insert into central_all_sales
    values ( 'FAW', new.sale_date, new.item_id, new.quantity);

create trigger branch_sales_au after UPDATE on branch_sales
for each row
  update central_all_sales
    set sale_date   = new.sale_date,
        item_id     = new.item_id,
        quantity    = new.quantity
  where 
        sale_date   = old.sale_date
        and
        item_id     = old.item_id;

create trigger branch_sales_ad after DELETE on branch_sales
for each row
  delete from central_all_sales
  where
    branch_id   = 'FAW'
    and
    sale_date   = old.sale_date
    and
    item_id     = old.item_id;

Similar triggers (with a branch_id of 'FDI') will be created in the second branch. Now, every record in branch_sales is replicated to the central table without himan intervention. For extra safety, you may add some codeto check that the record does not exist yet, before inserting.

New data transfers paradigms: BLACKHOLE tables

Have you heard about the BLACKHOLE engine?
It's a simple concept. A table with a full description, that does not store anything. Every thing you send to such table is simply discarded, and that is no surprising, since there is no storage device associated with BLACKHOLE tables.
What's the use, then? You may ask. The use is to take advantage of the side effects of such table, rather that its storage capabilities. For once, records sent to a BLACKHOLE table will leave a trail in the binary log, thus allowing you to replicate its data from a server that does not have much storage available, but you need it just as an intermediate level to spread your data to several more slaves.
Moreover, and here we come to the point of our quest, BLACKHOLE tables can fire triggers, and we are going to use them to meet the second need we have outlined, i.e. the conditional update.
Let's say that big.central.mycompany.com want the sales records sent to different tables depending on the time of the day (it's a silly example, but bear with me for a while, just pretending it's something sensible). However, the application that deals with the data transfer is already working, and nobody dares to touch it, for fear of disrupting something. But a simple change can achieve the goal.
Let's change the structure of all_sales:

CREATE TABLE all_sales
(
  branch_id       char(3) not null,
  sale_date       datetime not null,
  item_id         int not null,
  quantity        int not null,
  primary key (branch_id, sale_date, item_id)
) engine=BLACKHOLE;

Just one word changed, and now the table will discard any data sent to it.
Before seeing what we can do with it, let's create three more tables.

CREATE TABLE all_morning_sales
(
  branch_id       char(3) not null,
  sale_date       datetime not null,
  item_id         int not null,
  quantity        int not null,
  primary key (branch_id, sale_date, item_id)
) engine=innodb;

CREATE TABLE all_afternoon_sales
(
  branch_id       char(3) not null,
  sale_date       datetime not null,
  item_id         int not null,
  quantity        int not null,
  primary key (branch_id, sale_date, item_id)
) engine=innodb;

CREATE TABLE all_late_sales
(
  branch_id       char(3) not null,
  sale_date       datetime not null,
  item_id         int not null,
  quantity        int not null,
  primary key (branch_id, sale_date, item_id)
) engine=innodb;

Here we have everything ready. Now let's add one trigger:

delimiter //

create trigger all_sales_ai after insert on all_sales
for each row
begin
  case
    when time(sale_date) between '08:00:00' and '13:00:00'
      then insert into all_morning_sales
        values (new.branch_id, new.sale_date, new.item_id,
        new.quantity);
    when time(sale_date) between '13:00:01' and '18:00:00'
      then insert into all_afternoon_sales
        values (new.branch_id, new.sale_date, new.item_id,
        new.quantity);
    else
      insert into all_late_sales
        values (new.branch_id, new.sale_date, new.item_id,
        new.quantity);
  end case;
end //

delimiter ;

Similarly, you can create a trigger for updates and deletes, and of course you can get creative and update several tables at once with different parts of the incoming data.
Here. Now you have some food for thought and launch yourselves into a new data transfer enterprise.

Tuesday, October 25, 2005

The power of dynamic queries

Dynamic queries (a.k.a. prepared statements) have been rather neglected in the list of new features available in MySQL 5.
However, they are simply priceless when you try to achieve something unusual.

Let's start with a common problem. You would like to get the list of your tables with the number of records for each one.
In the old times before MySQL 5, you would do SHOW TABLE STATUS. But that output is large, difficult to display, and it has a serious fault, since for InnoDB tables the number of rows is a calculated guess, rather than a precise count.
What you'd really want is this (using the world sample database)


SELECT 'City' AS `table`, COUNT(*) as records FROM City
UNION
SELECT 'Country', COUNT(*) FROM Country
UNION
SELECT 'CountryLanguage', COUNT(*) FROM CountryLanguage

+-----------------+---------+
| table           | records |
+-----------------+---------+
| City            |    4079 |
| Country         |     239 |
| CountryLanguage |     984 |
+-----------------+---------+
3 rows in set (0.00 sec)

But such a query is too much to write for three tables. Simply unbearable when we have 10 or more.
We can do better than that.


delimiter //

create database if not exists glob //

use glob //

drop function if exists get_create_table_list //

create function get_create_table_list ( db_name varchar(30) )
returns varchar(8000)
deterministic
begin
 declare query varchar(8000);
 set group_concat_max_len=8000;
 set query = (select group_concat(concat("SELECT '",
 table_name, "' AS `table`, COUNT(*) as records
 FROM ", table_schema,".", table_name) separator ' UNION ' )
 from information_schema.tables where table_schema = db_name  );
 return query;
end//

drop procedure if exists show_table_list //

create procedure show_table_list  ( db_name varchar(30) )
deterministic
begin
 set @query = get_create_table_list(db_name);
 prepare table_list from @query;
 execute table_list;
end //

delimiter ;

The first routine (get_create_table_list) will return a query like the one I made manually. The second routine is a procedure (show_table_list) that will get that query and execute it.


select glob.get_create_table_list('world') \G
*************************** 1. row ***************************
get_create_table_list('world'): SELECT 'City' AS `table`, COUNT(*) as records
 FROM world.City UNION SELECT 'Country' AS `table`, COUNT(*) as records
 FROM world.Country UNION SELECT 'CountryLanguage' AS `table`, COUNT(*) as records
 FROM world.CountryLanguage

call glob.show_table_list('world');
+-----------------+---------+
| table           | records |
+-----------------+---------+
| City            |    4079 |
| Country         |     239 |
| CountryLanguage |     984 |
+-----------------+---------+

call glob.show_table_list('mysql');
+---------------------------+---------+
| table                     | records |
+---------------------------+---------+
| columns_priv              |       0 |
| db                        |       3 |
| func                      |       0 |
| help_category             |      29 |
| help_keyword              |     325 |
| help_relation             |     548 |
| help_topic                |     405 |
| host                      |       0 |
| proc                      |      17 |
| procs_priv                |       0 |
| tables_priv               |       2 |
| time_zone                 |       0 |
| time_zone_leap_second     |       0 |
| time_zone_name            |       0 |
| time_zone_transition      |       0 |
| time_zone_transition_type |       0 |
| user                      |       5 |
| user_info                 |       0 |
+---------------------------+---------+

Isn't it a beauty?

Saturday, December 31, 2005

Friday, December 30, 2005

Tuesday, December 20, 2005

multiple users

copying

sorting

merging

appending

filtering

piping

Monday, December 19, 2005

Monday, December 12, 2005

SYNTAX HELPERS

USING NAMED PARAMETERS

Even easier named parameters

Friday, December 09, 2005

Trigger grants blues

What then?

Bending the rules

Safety checks

(Update) Modifying triggers

A final plea

Sunday, December 04, 2005

For each table

For each counter

For each dataset row

For each array item

For each list item

Monday, November 28, 2005

Implementation

General concepts about the array library

Array creation

Inserting data

retrieving data

information on arrays

advanced array usage

inserting several items at once

Populate an array from a dataset

Sunday, November 27, 2005

PURPOSE

CURRENT CATEGORIES

ARRAYS AND BASIC DATA STRUCTURES

"FOR EACH" LOOPS

NAMED PARAMETERS

SYNTAX HELPERS

Friday, November 25, 2005

Tuesday, November 08, 2005

Saturday, November 05, 2005

MySQL data transfer methods

Different needs for data transfers

New data transfers paradigms: FEDERATED tables

New data transfers paradigms: BLACKHOLE tables

Tuesday, October 25, 2005

Pages

Giuseppe Maxia

Like my posts? Buy me a book!

Julian's force

Yes, I am one of those

Sapere, saper fare, fare, far sapere

Blog Archive

Planet MySQL