The Data Charmer: New Year quiz

A few days ago I was asked to analyze some data related to a particular service usage in a large communications provider. The (simplified) table looks like this:

CREATE TABLE `xusage` (
`item_id` int(11) NOT NULL,
`begin_date` date NOT NULL,
`xusage` int(11) default NULL,
PRIMARY KEY  (`item_id`,`begin_date`)
) ;

And here is some sample data:

select * from xusage order by begin_date,item_id;
+---------+------------+--------+
| item_id | begin_date | xusage |
+---------+------------+--------+
|       1 | 2005-09-01 |     93 |
|       1 | 2005-09-02 |     90 |
|       2 | 2005-09-02 |     54 |
|       1 | 2005-09-03 |     66 |
|       2 | 2005-09-03 |     74 |
|       3 | 2005-09-03 |     90 |
|       1 | 2005-09-04 |     38 |
|       2 | 2005-09-04 |     49 |
|       3 | 2005-09-04 |     91 |
|       4 | 2005-09-04 |     91 |
|       2 | 2005-09-05 |     47 |
|       3 | 2005-09-05 |     49 |
|       4 | 2005-09-05 |     33 |
|       5 | 2005-09-05 |     83 |
|       3 | 2005-09-06 |     23 |
|       4 | 2005-09-06 |     99 |
|       5 | 2005-09-06 |     93 |
|       4 | 2005-09-07 |     75 |
|       5 | 2005-09-07 |     47 |
|       5 | 2005-09-08 |      3 |
+---------+------------+--------+

The task at hand was to produce another table (for charting purposes), having the same columns as xusage, with some additional pieces of information: the total usage for that day and the following day, the usage for item_id on day +1, the percentage of such usage from the total of all usage in day + 1;
For example, in the given data, the total usage on September 1st was 93, and on September 2nd was 144. The wanted columns for the first two days would be

+---------+------------+--------+-----------+-------------+-------------+------------------+
| item_id | begin_date | xusage | total day | total day+1 | usage day+1 | percentage day+1 |
+---------+------------+--------+-----------+-------------+-------------+------------------+
|       1 | 2005-09-01 |     93 |        93 |          93 |          90 |          62.5000 |
|       1 | 2005-09-02 |     90 |       144 |         144 |          66 |          28.6957 |
|       2 | 2005-09-02 |     54 |       144 |         144 |          74 |          32.1739 |
+---------+------------+--------+-----------+-------------+-------------+------------------+

The original table had 50 million records. Therefore I had to find an efficient way of fulfilling the task.
It can be done with a single query, although performance would suffer. I chose to create a supporting table to ease the task.

My solution is here.
How would you have solved it? (BTW, there is no prize, unless Arjen wants to provide some.
Happy New Year!

The Data Charmer

Pages

Saturday, December 31, 2005

New Year quiz

1 comment: