Monday, May 18, 2009

Cleaning up Wordpress comment tables


cleaning wp

In Montreal with Dups and Kaj, we were looking at a number of technical problems, and each one of you got something valuable from the meeting.
One of Kaj's problems was a collection of Wordpress blogs infested by spam. Kaj has done something already but the situation was critical. Before applying Akismet to his comments, he needed to cleanup the majority of the spam in same easy way.

It is not rocket science, really, but it needs some care in the implementation.
The comment tables in a wordpress installation are called wp_X_comments, where X is a progressive integer. This makes it a bit problematic to run a single query against all comments. Additionally, the same problem exists in two different servers, with a different number of blogs. Thus, I needed to come up with a portable remedy.
There's the starting point.

SELECT
table_name, table_rows
from
information_schema.tables
where
table_schema=schema() and table_name like '%comments';
+----------------+------------+
| table_name | table_rows |
+----------------+------------+
| wp_10_comments | 385 |
| wp_11_comments | 1 |
| wp_12_comments | 7 |
| wp_13_comments | 0 |
| wp_14_comments | 80 |
| wp_15_comments | 0 |
| wp_16_comments | 24 |
| wp_17_comments | 1 |
| wp_18_comments | 2 |
| wp_19_comments | 3 |
| wp_1_comments | 6 |
| wp_2_comments | 34457 |
| wp_3_comments | 582 |
| wp_4_comments | 103 |
| wp_5_comments | 1 |
| wp_6_comments | 0 |
| wp_7_comments | 3 |
| wp_8_comments | 1 |
| wp_9_comments | 24 |
+----------------+------------+

DESC wp_1_comments;
+----------------------+---------------------+------+-----+
| Field | Type | Null | Key |
+----------------------+---------------------+------+-----+
| comment_ID | bigint(20) unsigned | NO | PRI |
| comment_post_ID | int(11) | NO | MUL |
| comment_author | tinytext | NO | |
| comment_author_email | varchar(100) | NO | |
| comment_author_url | varchar(200) | NO | |
| comment_author_IP | varchar(100) | NO | |
| comment_date | datetime | NO | |
| comment_date_gmt | datetime | NO | MUL |
| comment_content | text | NO | | <--
| comment_karma | int(11) | NO | |
| comment_approved | varchar(20) | NO | MUL | <--
| comment_agent | varchar(255) | NO | |
| comment_type | varchar(20) | NO | |
| comment_parent | bigint(20) | NO | |
| user_id | bigint(20) | NO | |
+----------------------+---------------------+------+-----+

To get rid of most of the probable spam, we can use a regular expression to search for unapproved comments that contain at least three links. There is a chance of legitimate comments like that, but this is the best shot.
First, let's see how big is the problem:
set group_concat_max_len = 65000;

set @sum=0;
set @q = (
select group_concat(
concat(
'SELECT "',
table_name,
'" as T, count(*), @sum := @sum + COUNT(*) C FROM ',
table_name,
' where comment_approved=0'
) SEPARATOR ' UNION '
)
from information_schema.tables
where table_schema = schema()
and table_name like 'wp_%_comments');

prepare q from @q;
execute q;
deallocate prepare q;
select @sum;

This code will tell us how many unapproved comments there are in each comment table.
The way if works is by creating a query using the list of comment table, taken from the information schema.

+----------------+----------+-------+
| T | count(*) | C |
+----------------+----------+-------+
| wp_10_comments | 339 | 339 |
| wp_11_comments | 0 | 339 |
| wp_12_comments | 2 | 341 |
| wp_13_comments | 0 | 341 |
| wp_14_comments | 73 | 414 |
| wp_15_comments | 0 | 414 |
| wp_16_comments | 3 | 417 |
| wp_17_comments | 0 | 417 |
| wp_18_comments | 0 | 417 |
| wp_19_comments | 0 | 417 |
| wp_1_comments | 0 | 417 |
| wp_2_comments | 33784 | 34201 |
| wp_3_comments | 579 | 34780 |
| wp_4_comments | 90 | 34870 |
| wp_5_comments | 0 | 34870 |
| wp_6_comments | 0 | 34870 |
| wp_7_comments | 1 | 34871 |
| wp_8_comments | 0 | 34871 |
| wp_9_comments | 24 | 34895 |
+----------------+----------+-------+
+-------+
| @sum |
+-------+
| 34895 |
+-------+

Cleaning up the comments manually is thus ruled out.
The next step is to see how many comments contain at least three links. To create the search pattern, we use the REPEAT function.

set @w = concat('%', repeat('http%',3));
set @sum1 =0;
set @q = (
select group_concat(
concat(
'SELECT "',
table_name,
'" as T, count(*), @sum1 := @sum1 + COUNT(*) C FROM ', table_name,
' where comment_content LIKE "', @w, '" and comment_approved = 0'
) SEPARATOR ' UNION '
)
from information_schema.tables
where table_schema = schema()
and table_name like 'wp_%_comments');

prepare q from @q;
execute q;
select @sum, @sum1, @sum1 / @sum * 100 ;

This second statement will show how many comments will be removed by simply looking at the links.

+----------------+----------+-------+
| T | count(*) | C |
+----------------+----------+-------+
| wp_10_comments | 51 | 51 |
| wp_11_comments | 0 | 51 |
| wp_12_comments | 0 | 51 |
| wp_13_comments | 0 | 51 |
| wp_14_comments | 0 | 51 |
| wp_15_comments | 0 | 51 |
| wp_16_comments | 0 | 51 |
| wp_17_comments | 0 | 51 |
| wp_18_comments | 0 | 51 |
| wp_19_comments | 0 | 51 |
| wp_1_comments | 0 | 51 |
| wp_2_comments | 26340 | 26391 |
| wp_3_comments | 402 | 26793 |
| wp_4_comments | 57 | 26850 |
| wp_5_comments | 0 | 26850 |
| wp_6_comments | 0 | 26850 |
| wp_7_comments | 0 | 26850 |
| wp_8_comments | 0 | 26850 |
| wp_9_comments | 9 | 26859 |
+----------------+----------+-------+

+-------+-------+--------------------+
| @sum | @sum1 | @sum1 / @sum * 100 |
+-------+-------+--------------------+
| 34895 | 26859 | 76.9709 |
+-------+-------+--------------------+

This filter will eliminate almost 77% of the comments.
The last step is now the production of the DELETE queries .

drop table if exists remove_queries;
create table remove_queries (q varchar(600));

insert into remove_queries
select
concat(
'DELETE FROM ', table_name,
' where comment_content LIKE "', @w, '" and comment_approved = 0;'
)
from information_schema.tables
where table_schema = schema()
and table_name like 'wp_%_comments';

select * from remove_queries;

This latest step creates the DELETE queries, the ones that will remnve the 77% of spam.
One they are created, you can either cut-and-paste them, or pipe them between client runs:

mysql wordpress -N -e 'select * from remove_queries' | mysql -v -v -v wordpress

The bigger bulk is gone. But we can do something more to clean up. The query to create is this one:

DELETE FROM wp_XXX_comments
where
comment_approved="spam"
or (
((comment_content regexp "casino|cialis|viagra|sex|loan")
or length(comment_content) < 10 )
and comment_approved = 0
);

It will remove another 20% of spam, leaving the situation so light that a few loops of the akismet plugin will remove the remaining spam.
In the other server, these two scripts removed 90% of the spam, leaving a few hundred comments to consider manually.

No comments: