Sunday, September 04, 2011

Finding tables without primary keys

I was checking a third party server, and I needed to find if there were tables without primary keys. This is important to know, not only because the lack of primary keys affects performance and data accuracy in general, but also because in row-based replication performance can degrade beyond belief when updating tables without primary keys. Anyway, I did not remember off the bat any method to get this information from a server with thousands of tables, and thus I went to find a solution on my own. My first instinct called for using the COLUMNS table from the INFORMATIOn_SCHEMA, and so I came up with this query, where I sum the number of columns that are inside either a PRIMARY or UNIQUE key and filter only the ones where such sum is zero (i.e. no primary or unique keys):
select 
    table_schema,table_name 
from  
    information_schema.columns  
group by 
    table_schema,table_name   
having 
    sum(if(column_key in ('PRI','UNI'), 1,0)) = 0;
This query got the job done, and it was quite quick as well. Then, since I was chatting with Sheeri Cabral about other matters, I asked her if she could come up with an alternative solution. She suggested a LEFT JOIN between the information_schema.tables and information_schema.statistics, which I translated into this query:
select 
    t.table_schema, t.table_name 
from 
    information_schema.tables  t 
    left join information_schema. statistics s 
       on t.table_schema=s.table_schema and t.table_name=s.table_name 
       and s.non_unique=0 
where 
    s.table_name is null;
This query works on the principle that it removes from the tables list all the ones for which there is no corresponding table in the statistics table. This query also works. Using both queries in a relatively empty server did not show any significant difference. But since I knew that I had to use this method on a very busy server, with a lot of tables, I quickly created 1,000 databases, each containing 5 tables, two of which did not have any primary or unique key. Now came the first surprise. The query with GROUP BY took about 0.5 seconds, while the one using LEFT JOIN used 11 seconds. I was about to congratulate myself for my acumen, when I realized that, in addition to schema and table names, I also needed the table engine. For the second query, that is not a problem. Adding the engine to the columns list works OK, and the query runs in 11 seconds like before. The first query, though, can't list the engine. There is no 'engine' in the COLUMNS table. So I needed a JOIN. Thus my query became
select 
    t.table_schema,t.table_name,engine 
from 
    information_schema.tables t 
    inner join information_schema .columns c  
        on t.table_schema=c.table_schema and t.table_name=c.table_name 
group by 
    t.table_schema,t.table_name   
having 
    sum(if(column_key in ('PRI','UNI'), 1,0)) =0;
Guess what? This query ran in 17 seconds (!). So much for my instinct! Joins without keys are not efficient in MySQL, and tables in the information schema are no exception. If anyone has a more efficient method of getting a list of tables without primary key (the list including schema name, table name, and engine), I am curious to know.

9 comments:

shlominoach said...

Ah!
I can see you have not used common_schema as yet.
Take a look at no_pk_innodb_tables.

It strictly checks for InnoDB tables (probably to be removed later on).

You should really take a look at common_schema (and spread the word, while at it). There are some interesting queries in there (and by tomorrow, I believe, a new version as well).

shlominoach said...

BTW, except for my excellent advertising, my query does not run faster than yours. However, I do have a plan to make INFORMATION_SCHEMA tables "clones". That it, create a INFORMATION_SCHEMA_GHOST schema, with TABLES, COLUMNS, STATISTICS etc. (schema-related tables), which are updated by a script using SHOW commands.
Such a schema does not need to be evaluated all the time (one's schema does not change all the time), so really once in a while, upon your decision.
The tables in that schema would be your standard MyISAM tables, so queries will be very fast.

If you're interested to help out, I'll be very happy.

datacharmer said...

Shlomi,
I didn't know about common_schema, and I should have a deeper look at it. It seems to be a useful addition to every DBA's box of tricks.

In this particular task of mine, though, no_pk_innodb_tables is not helpful, as I need to find all tables without PK, and my preliminary data suggests that many of them may be MyISAM.
Regarding your project of exporting I_S tables to regular ones, I had the same thought myself a few years ago.

Roland Bouman said...

Hi Giusesppe!

just noticed this:

"I came up with this query, where I sum the number of columns that are inside either a PRIMARY or UNIQUE key and filter only the ones where such sum is zero (i.e. no primary or unique keys):"

This query has two flaws:

#1
You check all columns, regardless of the table type. This means that VIEWs will be flagged as not having a primary key. Strictly speaking this is correct, but this is probably not the intention.

#2
If you can settle for a UNIQUE constraint instead of a PRIMARY KEY, you should also check that all columns of that constraint are NOT NULL. This complicates the query considerable, because you have to check nullability *per index*, and can't just sum it for all table columns that happen to be in an index.

Maybe if i have some time later today, I may send in a solution.

datacharmer said...

Roland,
Point taken on both counts.
The query gets more complex, with probably a bigger hit on the server.

I wanted to find a solution that does not require external scripts or stored routines, but it seems that I will have to consider one of these ways.

Roland Bouman said...

Hi Giuseppe,

here's my proposed solution:
select tables.table_schema
, tables.table_name
, tables.engine
from information_schema.tables
left join (
select table_schema
, table_name
from information_schema.statistics
group by table_schema
, table_name
, index_name
having
sum(
case
when non_unique = 0
and nullable != 'YES' then 1
else 0
end
) = count(*)
) puks
on tables.table_schema = puks.table_schema
and tables.table_name = puks.table_name
where puks.table_name is null
and tables.table_type = 'BASE TABLE'

The heart of the query is the puks subquery in the from clause. This query selects one row for each unique index that has the same number of columns as the number of columns that are not nullable. This is the set of primary keys and unique constraints that have only non-nullable columns.

The outer query matches the tables (but only base tables) against this subquery using an outer join in order to pinpoint those tables that do not have a corresponding row in the subquery, that is - tables without a primary key or non-nullable unique constraint.

datacharmer said...

Roland,
Well done!
Your query is 3 times faster than the previous winner (< 3 seconds, against 11) and filters off the views.

Thanks

Anonymous said...

Hi Guiseppe,

How you compare execution times? Doesn't it depend on table/FS cache?

Please also check this query (IN should be faster than outer join if result is few tables):

SELECT table_schema, table_name
FROM information_schema.tables
WHERE (table_catalog, table_schema, table_name) NOT IN
(SELECT table_catalog, table_schema, table_name
FROM information_schema.table_constraints
WHERE constraint_type in ('PRIMARY KEY', 'UNIQUE'))
AND table_schema NOT IN ('information_schema', 'mysql');

datacharmer said...

@Anonymous
I compare execution times by running the query after restarting the database. That will take care of caches.
Your query is extremely inefficient. It takes 2 minutes and 38 seconds to run.

Vote on Planet MySQL