The Importance of Natural Language Processing for Non-English Languages
- Author Limarc Ambalina
- Published September 9, 2020
- Word count 1,107
Natural Language Processing (NLP) is growing in use and plays a vital role in many systems from resume-parsing for hiring to automated telephone services. You can also find it in commonly used technology such as chatbots, virtual assistants, and modern spam detection. However, the development and implementation of NLP technology is not as equitable as it may appear.
To put it into perspective, although there are over 7000 languages spoken around the world, the vast majority of NLP processes amplify seven key languages: English, Chinese, Urdu, Farsi, Arabic, French, and Spanish.
Even among these seven languages, the vast majority of technological advances have been achieved in English-based NLP systems. For example, optical character recognition (OCR) is still limited for non-English languages. And anyone who has used an online automatic translation service knows the severe limitations once you venture beyond the key languages referenced above.
How are NLP Pipelines Developed?
To understand the language disparity in NLP, it is helpful to first understand how these systems are developed. A typical pipeline starts by gathering and labeling data. A large dataset is essential here as data will be needed to both train and test the algorithm.
When a pipeline is developed for a language with little available data, it is helpful to have strong patterns within the language. Small datasets can be augmented by techniques such as synonym replacement to simplify the language, back-translation to create similarly phrased sentences to bulk up the dataset, and replacing words with other related parts of speech.
Language data also requires significant cleaning. When a non-English language with special characters is used, like Chinese, proper unicode normalization is typically required. This allows the text to be converted to a binary form that is recognizable by all computer systems, reducing the risk of processing errors. This issue is amplified for languages like Croatian, that rely heavily on accentuation to change the meaning of a word. For example, in Croatian, a single accent can change a positive word into a negative one. Therefore, these terms have to be manually encoded to ensure a robust dataset.
Finally, the dataset can be divided into a training and testing split, and sent through the machine learning process of feature engineering, modeling, evaluation, and refinement.
Original image featured on Towards Data Science
One commonly used NLP tool is Google’s Bidirectional Encoder Representations from Transformers (BERT) which is purported to develop a “state of the art” model in 30 minutes using a single tensor processing unit. Their GitHub page reports that the top 100 languages with the largest Wikipedia databases are supported, but the actual evaluation and refinement of the system has only been performed on 15 languages. While BERT technically does support more languages, the lower level of accuracy and lack of proper testing limits the applicability of this technology. Other NLP systems such as Word2Vec and the Natural Language Toolkit (NLTK) have similar limitations.
In summary, the NLP pipeline is a challenge for less popular languages. The datasets are smaller, they often need augmentation work, and the cleaning process requires time and effort. The less access you have to native-language resources, the less data available when building an NLP pipeline. This makes the barrier to entry for less popular languages very high, and in many cases, too high.
The Importance of Diverse Language Support in NLP
There are three overarching perspectives that support the expansion of NLP:
The reinforcement of social disadvantages
Language expansion to improve ML technology
Let’s look at each in more detail:
Reinforcement of Social Disadvantages
From a societal perspective, it is important to keep in mind that technology is only accessible when its tools are available in your language. On a basic level, the lack of spell-check technology impairs those who speak and write less common languages. This disparity rises up the technological chain.
What’s more, psychological research has shown that the language you speak molds the way you think. An in-built language preference in the systems that drive the internet inherently incorporates the societal norms of the driving languages.
The fact is that supported systems continue to thrive while it is challenging to introduce new aspects to a deeply ingrained program. This means that as NLP continues to develop without bringing in a diverse range of languages, it will be more challenging to incorporate them in the future, endangering the global variety of languages.
English and English-adjancent languages are not representative of other world languages as they have unique grammatical structures that many languages do not share. By supporting mostly English languages, however, the Internet and other technologies are progressively treating English as the normal, default language setting.
As a relatively agnostic system is trained on English, it learns the norms and systems of a specific language and all the cultural implications that come with that limitation. This single sided approach will only continue to become more apparent as NLP is applied to more intelligent processes that have an international audience.
Language Expansion to Improve ML Technology
When we apply machine learning techniques to only a handful of languages, we program implicit bias into the systems. As machine learning and NLP continues to advance while only supporting a few languages, we not only make it more challenging to introduce new languages, but run the risk of making it fundamentally impossible to do so.
For example, subword tokenization implementation performs very poorly on languages that feature reduplication, a feature common to many international languages such as Afrikaans, Irish, Punjabi, and Armenian.
Languages also have a variety of word order norms, which tend to stump the common neural models used in English-based NLP.
What Can Be Done?
In the current discourse around NLP, when the words “natural language” are spoken, the general assumption is that the researcher is working on an English database. To break out of this mold and create more awareness for international systems, we should first always refer to the language system being developed. This idea of always stating the language a researcher is working on is colloquially referred to as the Bender Rule.
Simple awareness of the issue, of course, is not enough. However, being mindful of the issue can help in the development of more broadly applicable tools.
When looking to introduce more languages into an NLP pipeline, it is also important to consider the size of the dataset. If you are creating a new dataset, a significant portion of your budget should be applied to creating a dataset in another language. Of course, additional research in optimizing current cleaning and annotation programs in other languages is also vital to broadening NLP technologies around the globe.
There are no posted comments.
- Hikvision Thermal CCTV Cameras for Perimeter Security in SMEs
- Digital Onboarding and No-code Platforms
- 6 Myths of Charging Smartphones Batteries that are Wrong
- Photoshop Color Correction Service, Multi-color Photos from a single Photoshoot
- Darlox Electronic Limited: Fpc laminated structure
- The future development direction of artificial intelligence
- Smart Home Technology
- Monster Hunter World (MHW) Layered Armour
- Significance of Software testing services
- Berry Producer Finds More Value in SIMBA Than Expected.
- How to Find Phone With Cell Number - Find Out Who Your Partner is Calling Today!
- How Disaster Recovery Differs From Data Backup
- Digital Fashion Technology Explained
- CAN THE NBA BRING NFTS TO THE MAINSTREAM?
- Benefits of Game Modding
- Are NFTs driving the hype around the real estate metaverse?
- Application of RFID technology
- Use your smartphone’s fingerprint scanner smartly
- Why your business needs cloud services?
- Industry 4.0 and Smart Factories
- What is Configuration Software - HMI MMI SCADA
- Basics of Industrial Control System and the Use of PLC
- Different Types of Pressure Transmitters, Working Principle, and How to Select Pressure Transmitter
- Rise of Digital Car Control Console and Application of Embedded HMI System
- How to Install Hollow Shaft Encoders, Problems & Solutions
- Programmable devices for IoT
- How to Install Hollow Shaft Encoders, Problems & Solutions
- Do You Really Understand Human Machine Interface (HMI) - HMI Q&A
- How to Install Encoders - Best Practices
- VFD Basics - How VFD Controls the Motor Running?