The World Wide Blogger Culture, And Comment Filter Training

The new comment filtering system has been in place now for almost 6 months, and we are slowly starting to see improvements, with signs that spam (aka "bulk") content is becoming more individualised.

Recently, we've seen suggestions, from discussions in Blogger Help Forum: Something Is Broken, that the filters are being put in place in Asia. Reports of unfair filtering, similar to the complaints seen 6 months ago about English language filtering, are being seen now about Chinese, Japanese, and Thai language filtering.

When the English language filtering was put into place, the filters had to be trained from the bottom up. The issue is slightly different, for Asian language filtering.

New Blogger features generally start out provided in English, and best supported in English. Next comes non English languages that use Roman character sets (West European countries), and finally languages that use non Roman character sets. Asian languages are going to be the hardest to support.

The spam filtering, in Asian languages, is going to present a challenge, based on the filters having already being trained, in English language blogs. Some spammers have been using Asian (Chinese, Japanese, and Thai) characters, to disguise their content, for some time. Many English / European blog owners, working together, have trained the spam filters, already, to see any comments containing non Roman characters as spam.

English language blog owners, (unwillingly, in some cases) helped train the spam filters from the beginning, and had to deal with many false negatives (spam content not detected, in all languages). Later, some blog owners had to deal with false positives (non spam content, falsely labeled as spam, in English).

Asian blog owners, similarly, will have to deal with many false positives (non spam content detected as spam, in Asian languages), as the spam filters will falsely see many legitimate comments, written in Asian languages, as spam. Some false positives will happen because many English / European blog owners, long used to marking all Asian language comments as spam, may continue to do so.

Many people who actively mark spam, when posted in "European" languages other than English, can identify spam by the phrasing and structure - even if they cannot actually read the specific language used. That is because "European" (aka "Romance") languages have a common origin.

Many people who can adequately identify spam, posted in Romance languages, will have no such ability with Chinese, Indian, Japanese, or Thai - and will mistakenly mark all comments, in those languages, as spam. People who publish blogs in Chinese, Indian, Japanese, or Thai will need to be very active, in marking false positives as "Not Spam", frequently and promptly.

>> Top

Comments