The Data Charmer: New Year quiz

Saturday, December 31, 2005

New Year quiz

A few days ago I was asked to analyze some data related to a particular service usage in a large communications provider. The (simplified) table looks like this:

CREATE TABLE `xusage` (
`item_id` int(11) NOT NULL,
`begin_date` date NOT NULL,
`xusage` int(11) default NULL,
PRIMARY KEY  (`item_id`,`begin_date`)
) ;

And here is some sample data:

select * from xusage order by begin_date,item_id;
+---------+------------+--------+
| item_id | begin_date | xusage |
+---------+------------+--------+
|       1 | 2005-09-01 |     93 |
|       1 | 2005-09-02 |     90 |
|       2 | 2005-09-02 |     54 |
|       1 | 2005-09-03 |     66 |
|       2 | 2005-09-03 |     74 |
|       3 | 2005-09-03 |     90 |
|       1 | 2005-09-04 |     38 |
|       2 | 2005-09-04 |     49 |
|       3 | 2005-09-04 |     91 |
|       4 | 2005-09-04 |     91 |
|       2 | 2005-09-05 |     47 |
|       3 | 2005-09-05 |     49 |
|       4 | 2005-09-05 |     33 |
|       5 | 2005-09-05 |     83 |
|       3 | 2005-09-06 |     23 |
|       4 | 2005-09-06 |     99 |
|       5 | 2005-09-06 |     93 |
|       4 | 2005-09-07 |     75 |
|       5 | 2005-09-07 |     47 |
|       5 | 2005-09-08 |      3 |
+---------+------------+--------+

The task at hand was to produce another table (for charting purposes), having the same columns as xusage, with some additional pieces of information: the total usage for that day and the following day, the usage for item_id on day +1, the percentage of such usage from the total of all usage in day + 1;
For example, in the given data, the total usage on September 1st was 93, and on September 2nd was 144. The wanted columns for the first two days would be

+---------+------------+--------+-----------+-------------+-------------+------------------+
| item_id | begin_date | xusage | total day | total day+1 | usage day+1 | percentage day+1 |
+---------+------------+--------+-----------+-------------+-------------+------------------+
|       1 | 2005-09-01 |     93 |        93 |          93 |          90 |          62.5000 |
|       1 | 2005-09-02 |     90 |       144 |         144 |          66 |          28.6957 |
|       2 | 2005-09-02 |     54 |       144 |         144 |          74 |          32.1739 |
+---------+------------+--------+-----------+-------------+-------------+------------------+

The original table had 50 million records. Therefore I had to find an efficient way of fulfilling the task.
It can be done with a single query, although performance would suffer. I chose to create a supporting table to ease the task.

My solution is here.
How would you have solved it? (BTW, there is no prize, unless Arjen wants to provide some.
Happy New Year!

1 comment:

Giuseppe Maxia said...: Kim,
The primary is already set in the original table.
Rebuilding a 50-million records table
to change the primary key is hardly what I'd call an efficient approach. :); December 31, 2005 at 4:18:00 PM GMT+1