Tuesday, February 28, 2006

puzzled by date functions

Today I was analyzing the behavior of the query optimizer, and I stumbled into a most curious case.
I created two functions returning the extremes of a date range, and I wanted to see how many times those functions were called when used in a WHERE clause So I added log tracing instructions to both of them. The result was quite surprising.
Let's set the environment first.
create database if not exists test ;
use test ;

delimiter //

drop function if exists today_start //
CREATE FUNCTION today_start() RETURNS datetime
begin
insert into mylog (routine_name) values ('today_start');
-- return current_date();
return '2006-02-28 00:00:00';
end //

drop function if exists today_end //
CREATE FUNCTION today_end() RETURNS datetime
begin
insert into mylog (routine_name) values ('today_end');
-- return current_date() + interval 1 day - interval 1 second;
return '2006-02-28 23:59:59';
end //

delimiter ;

drop table if exists t1;
create table t1 (
id int(11) NOT NULL auto_increment,
dt datetime NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

drop table if exists mylog;
create table mylog (
id int not null auto_increment primary key,
routine_name varchar(20) not null,
TS timestamp
);

INSERT INTO `t1` VALUES
(1,'2006-02-28 11:19:35'), (2,'2006-02-28 11:19:38'),
(3,'2006-02-28 11:19:40'), (4,'2006-03-01 11:20:09'),
(5,'2006-03-01 11:20:11'), (6,'2006-03-01 11:20:12'),
(7,'2006-03-01 11:20:13');

select * from t1;
+----+---------------------+
| id | dt |
+----+---------------------+
| 1 | 2006-02-28 11:19:35 |
| 2 | 2006-02-28 11:19:38 |
| 3 | 2006-02-28 11:19:40 |
| 4 | 2006-03-01 11:20:09 |
| 5 | 2006-03-01 11:20:11 |
| 6 | 2006-03-01 11:20:12 |
| 7 | 2006-03-01 11:20:13 |
+----+---------------------+
Now I select all today's rows from t1.
select * from t1 where dt between today_start() and today_end();
+----+---------------------+
| id | dt |
+----+---------------------+
| 1 | 2006-02-28 11:19:35 |
| 2 | 2006-02-28 11:19:38 |
| 3 | 2006-02-28 11:19:40 |
+----+---------------------+
That's correct. Now, let's see how many times the function was called:
select * from mylog;
+----+--------------+---------------------+
| id | routine_name | TS |
+----+--------------+---------------------+
| 1 | today_start | 2006-02-28 12:26:24 |
| 2 | today_end | 2006-02-28 12:26:24 |
+----+--------------+---------------------+
And that too was what I expected. But the story changes if I use a slightly different table. This one has the same columns as t1, but the primary key is the datetime column.
drop table if exists t2;
create table t2 (
id int not null,
dt datetime NOT NULL,
PRIMARY KEY (dt)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

insert into t2 (id, dt) select id, dt from t1;
Now I did the same experiment with this table:
truncate mylog;
select * from t2 where dt between today_start() and today_end();
+----+---------------------+
| id | dt |
+----+---------------------+
| 1 | 2006-02-28 11:19:35 |
| 2 | 2006-02-28 11:19:38 |
| 3 | 2006-02-28 11:19:40 |
+----+---------------------+
The query finds the same records. Let's see what happens to mylog:
select * from mylog;
+----+--------------+---------------------+
| id | routine_name | TS |
+----+--------------+---------------------+
| 1 | today_start | 2006-02-28 12:30:00 |
| 2 | today_end | 2006-02-28 12:30:00 |
| 3 | today_start | 2006-02-28 12:30:00 |
| 4 | today_end | 2006-02-28 12:30:00 |
+----+--------------+---------------------+
I can't imagine why this is happening. The only difference is that dt is now primary key. Instead of being called once, the routine is called twice. If I simply drop the primary key in t2, then the routine is called once per query, as expected.
The result does not change if I use InnoDB tables instead of MyISAM.

Can anyone explain what is happening here?

1 comment:

Todd Farmer said...

I think it is because the optimizer is determining which index to use. I tested this out and found the following:

1. If you create a normal (non-primary key) index, the same results appear - each function is called twice.
2. If you add an additional index on (dt, id), each function is called three times.
3. If you use IGNORE INDEX (dt) on your example with the primary key index on dt, each function is called only once.
4. Adding DETERMINISTIC to the declaration of the functions has no impact on how many times the function is called. The notes in the online manual say:

A procedure or function is considered “deterministic” if it always produces the same result for the same input parameters, and “not deterministic” otherwise. If neither DETERMINISTIC nor NOT DETERMINISTIC is given in the routine definition, the default is NOT DETERMINISTIC.

...

Currently, the DETERMINISTIC characteristic is accepted, but not yet used by the optimizer.

All of the above makes me believe that the optimizer evaluates the function for every possible key because it doesn't recognize the DETERMINISTIC nature of the function (whether it is explicitly declared so or not).

Vote on Planet MySQL