A research team on the mainland claims to have developed a text censor that can filter “harmful information” on the internet with unprecedented accuracy using artificial intelligence.
Traditional machine censors rely mainly on keywords to do this and struggle to achieve 70 per cent accuracy, while AI technology – which needs to be trained by humans – has taken that to about 80 per cent in recent years.
The team from Shenyang Ligong University and the Chinese Academy of Sciences said their AI technology did not need to be trained by humans and “outperforms other approaches” to attain more than 91 per cent accuracy. It would be particularly useful to “identify and filter sensitive information from online news media”, lead researcher Li Shu and her colleagues wrote in the Journal of Chinese Computer Systems.
The mainland has more than 900 million internet users, more than any other country, and is building the world’s largest 5G networks to boost speed. But the internet is tightly controlled, with many sites blocked including Google, Facebook, Twitter and some foreign news outlets – and much of the content on the sites that are available is banned.
Banned topics include cults, pornography, drug abuse, firearm use, terrorism and attacks on the Communist Party and its leaders.
But identifying them is a challenge for computers. Chinese is one of the most complex languages in the world, with nearly 10,000 characters. And sensitive words – gun, for example – could get picked up in a non-sensitive context, triggering a false alarm, or illegal information could be posted online without the use of any sensitive words.
The Chinese government and internet companies have instead relied on a huge army of censors to manually vet online content, but it is too costly and inefficient to keep pace with the growth of information on China’s internet.
Li, an associate professor of computer science at Shenyang Ligong University, said the technology developed by her team could keep up with the fast-evolving language used online in China, with a powerful dictionary containing not only sensitive words but their changing forms.
She said it could also read between the lines when searching for illegal content that was hidden in a different context, increasing the ability to identify text that is written in a way to bypass machine censors.
Many internet users in China avoid using sensitive words and instead use homonyms or add hyphens to fool the censors.
Part of the team’s text censor technology came from Google, Li said. In 2017, Google developed an open-source language model known as bidirectional encoder representations from transformers, or BERT, to help its search engine better understand users’ search terms. BERT can read words in different contexts – such as “running a business” versus “running a marathon” – as a result of reading huge databases including the entire Wikipedia site.
But BERT is not a censor and cannot understand text longer than 512 words. To make it work, Li’s machine breaks a long text into segments, lets BERT read the shorter parts and uses another AI tool to assess the results using an up-to-date dictionary.
Source: South China Morning Post