Get Machine Learning Training Data Using The Lionbridge Method [A How-To Guide]

Computers & TechnologyTechnology

  • Author Limarc Ambalina
  • Published March 5, 2020
  • Word count 1,292

In the field of machine learning, training data preparation is one of the most important and time-consuming tasks. In fact, many data scientists claim that a large portion of data science is pre-processing and some studies have shown that the quality of your training data is more important than the type of algorithm you use.

As a result, more and more companies like Lionbridge have entered the AI market to help serve this demand for training data.

How do you Get Machine Learning Training Data?

There are three main ways to get training data:

  1. Find open-source datasets online through websites like Kaggle, Google Dataset Search, or a dataset aggregator.

  2. Build the dataset yourself: collect/create the data and annotate it internally.

  3. Outsource data collection and annotation services from a training data provider.

For personal projects or school assignments, sometimes open datasets can provide a sufficient amount of data for the tasks you need to complete. However, when building and training AI solutions for commercial purposes, open datasets are often not available for your use case or can’t be used for profit.

Furthermore, sourcing and annotating your own training data in-house is often inefficient when you have thousands of pieces of data and just a handful of staff. This leaves us with the third option: outsourcing training data services.

Machine Learning Training Data Services

Lionbridge helps clients improve their models through a variety of machine learning training data services.

Some of our core services include:

Data Collection: speech/utterance data, handwritten data, chatbot training phrases

Image & Video Annotation: bounding boxes, polygons, circles, lines, keypoints

Text Annotation: sentiments, entities, entity linking, classification

Audio Annotation: verbatim transcription, intelligent verbatim, audio classification

Content Evaluation: ad evaluation, search evaluation, geo-local data evaluation

Lionbridge AI: From Translation to Training Data

At Lionbridge, we harness the expertise of our global community of data scientists, computational linguists, translators, and annotators to create high quality machine learning training data for a variety of use cases. With our expert community and all-in-one data annotation platform, we provide development teams with tailored training data solutions for their machine learning models.

Why Translation Companies are Perfect for Data Annotation

Why did we expand into AI? The reason is simple. We realized our global community is the perfect workforce for data annotation.

For natural language processing (NLP) especially, professional linguists are the perfect annotators for entity extraction, search query classification, and other language-based annotation projects. After thorough testing and training, this same workforce is easily able to perform various image annotation tasks for computer vision.

Now, for both NLP and computer vision, some of the world’s largest companies turn to Lionbridge for data annotation outsourcing. Our expertise in localization and linguistics enabled us with the tools, the knowledge, the contacts, and the workforce to provide training data services at scale.

Does Quality Translation = Quality Training Data?

Not necessarily. However, quality assurance processes in translation are incredibly similar to QA protocols for AI training data.

For example, one of the QA processes for localization projects is editor review. With translation, we normally have one or multiple editors review a translator’s output. Similarly, with many of our AI projects we have multiple contributors annotate the same piece of data to check for agreement.

A lot of the time, managing quality means managing contributors. We have numerous gates that your data must get through to ensure accuracy. At Lionbridge, our community guards each of those gates, making sure the end product matches your specifications.

Managing Output

With our community now at 1 million strong, as our network grows, we grow with it.

We have numerous protocols in place to make sure each contributor is performing to the best of their ability. For example, we check for inter-annotator agreement to make sure that each annotation is accurate. This process also helps us verify that the data itself is clear and that the task is straightforward. For some projects, we’ve had up to five contributors annotate the same data. Furthermore, we can also implement self-agreement checks to ensure that each contributor is consistent with their work.

A great example of QA for machine learning training data is our process for utterance/speech data collection:

First, we have sound engineers make sure that each contributor said the phrase correctly. They make sure that the contributor hasn’t missed a word and that they speak in their natural tone of voice (as opposed to monotoned reading).

Next, we send the audio files to native speakers of each language who review the sound clips according to the script.

Lastly, we send the files for audio quality checks to make sure there is no noise within a certain threshold, among other criteria that the customer requested.

These are just some of the QA measures we have in place, which are constantly being adjusted to match each project and improve our crowd.

Data Quality is Subjective

At the end of the day, we know that the definition of data quality is dependent on the project. "When you speak of quality in terms of training data, there is no objective definition. It depends on what you are trying to do," says Cedric Wagrez (Lionbridge’s Director of AI Services for Japan). "Quality is relative to your end goals and various factors, such as your KPIs, precision, and tailored use case."

High quality machine learning training data is data that is collected, annotated, and calibrated in a way that helps you achieve your goal.

At Lionbridge, we know that before we can start to manage quality, we first have to understand what it means to you.

Trial Projects

Before the project even begins, we provide you with a free consultation to explain the best ways to collect or annotate your data.

Next, we run tests and a trial project to align with your expectations. Let’s say you have 10,000 pieces of data to be annotated. To ensure that we’re all on the same page, we would take the first 100 pieces of the data, set the project up in our system, and have our community label the data. If the end result is exactly how you imagined it to be, we then go ahead with the rest of the data. If there are things to be changed, we would recalibrate based on your feedback.

It’s important to remember that quality data is not just about clear images and tight bounding boxes. The people you choose to label the data, the guidelines you give them, and the environment in which you collect the data all has to be taken into account.

Data Collection and Annotation Tools for Text, Audio, Images & Video

Have the workforce to label your data, but need a platform to label it on? We recently announced the release of our data annotation platform as a consumer product. Our engineering team and internal data scientists have built this state-of-the-art platform from the ground up.

Our platform has a simple and seamless UX, allowing you to create quality training data, with a short learning curve. Furthermore, you can easily manage your project, monitor progress, and track worker statistics via the dashboard. Now, you and your team can label data internally through our intuitive annotation interface — no coding required!

The AI industry is expected to add 15 trillion dollars to the world economy within the next 10 years. As the market continues to grow, so will the demand for training data. Thus, we will likely see more and companies like Lionbridge enter the machine learning training data industry.

Whether you need 1000 or 1 million pieces of data, Lionbridge can help you construct the best training data solution. Contact our team to learn more about how we can help you collect and label the data for your project.

This article has been viewed 189 times.

Rate article

Article comments

There are no posted comments.

Related articles