Skip to main content

The World Wide Blogger Culture, And Comment Filter Training

The new comment filtering system has been in place now for almost 6 months, and we are slowly starting to see improvements, with signs that spam (aka "bulk") content is becoming more individualised.

Recently, we've seen suggestions, from discussions in Blogger Help Forum: Something Is Broken, that the filters are being put in place in Asia. Reports of unfair filtering, similar to the complaints seen 6 months ago about English language filtering, are being seen now about Chinese, Japanese, and Thai language filtering.

When the English language filtering was put into place, the filters had to be trained from the bottom up. The issue is slightly different, for Asian language filtering.

New Blogger features generally start out provided in English, and best supported in English. Next comes non English languages that use Roman character sets (West European countries), and finally languages that use non Roman character sets. Asian languages are going to be the hardest to support.

The spam filtering, in Asian languages, is going to present a challenge, based on the filters having already being trained, in English language blogs. Some spammers have been using Asian (Chinese, Japanese, and Thai) characters, to disguise their content, for some time. Many English / European blog owners, working together, have trained the spam filters, already, to see any comments containing non Roman characters as spam.

English language blog owners, (unwillingly, in some cases) helped train the spam filters from the beginning, and had to deal with many false negatives (spam content not detected, in all languages). Later, some blog owners had to deal with false positives (non spam content, falsely labeled as spam, in English).

Asian blog owners, similarly, will have to deal with many false positives (non spam content detected as spam, in Asian languages), as the spam filters will falsely see many legitimate comments, written in Asian languages, as spam. Some false positives will happen because many English / European blog owners, long used to marking all Asian language comments as spam, may continue to do so.

Many people who actively mark spam, when posted in "European" languages other than English, can identify spam by the phrasing and structure - even if they cannot actually read the specific language used. That is because "European" (aka "Romance") languages have a common origin.

Many people who can adequately identify spam, posted in Romance languages, will have no such ability with Chinese, Indian, Japanese, or Thai - and will mistakenly mark all comments, in those languages, as spam. People who publish blogs in Chinese, Indian, Japanese, or Thai will need to be very active, in marking false positives as "Not Spam", frequently and promptly.

>> Top


Popular posts from this blog

Stats Components Are Significant, In Their Own Context

One popular Stats related accessory, which displays pageview information to the public, is the "Popular Posts" gadget.

Popular Posts identifies from 1 to 10 of the most popular posts in the blog, by comparing Stats pageview counts. Optional parts of the display of each post are a snippet of text, and an ever popular thumbnail photo.

Like many Stats features, blog owners have found imaginative uses for "Popular Posts" - and overlook the limitations of the gadget. Both the dynamic nature of Stats, and the timing of the various pageview count recalculations, create confusion, when Popular Posts is examined.

What's The URL Of My Blog?

We see the plea for help, periodicallyI need the URL of my blog, so I can give it to my friends. Help!Who's buried in Grant's Tomb, after all?No Chuck, be polite.OK, OK. The title of this blog is "The Real Blogger Status", and the title of this post is "What's The URL Of My Blog?".