Data labeling
3 Signs You Are Ready to Annotate Data for Machine Learning
8 min read
—
Sep 2, 2021
When is the right time to start annotating data for machine learning? Read this short guide to see whether you have all the tools and resources you need to start working on your machine learning projects.
Alberto Rizzoli
Co-founder & CEO
Running a tech company?
Machine learning has now become essential.
Your company has unique detection challenges, whether it’s across images, video, or other forms of information, and Artificial Intelligence (AI) can learn how to solve them.
The thing is—
It can only solve them if it gets access to what is known as labeled data. And this is where things get a little tricky.
Check out What is Data Labeling and How to Do It Efficiently to learn more.
Labeled data means that your data is annotated using things like tags to describe them, bounding boxes to highlight regions, or polygons to segment them.
Annotating data is like showing the AI what objects or parts of imagery are you interested in re-identifying, and how to name them. This data allows you to train AI models to solve any visual task.
This is obviously useful stuff—but when is it the right time for you to start annotating your raw image or video data?
If you're considering using AI models in your business, the answer depends on a number of points that we’re going to take a look at in this article.
By the end of the piece, you’ll know exactly when it’s time for you to start annotating data, as well as what you need to do to make a start.
1. When you hire your first computer vision or deep learning engineer.
Training data is to a machine learning engineer what fuel is to a rocket.
You'll need a large amount of it to get lift-off, a bit more to reach escape velocity—and periodic top-ups when you're in orbit.
And while we definitely recommend that you get a computer vision, deep learning, or machine learning engineer on board with your business, for any of them to have a positive impact, they need access to quality training data.
Machine learning talent is attracted to companies with good, well-maintained datasets, which they can use to build engineering marvels.
Without training data unique to your business, their job will be relegated to making statistical approximations from open datasets onto yours. This is neither fun nor productive.
Check out our list of 65+ Best Free Datasets for Machine Learning.
When hiring your first computer vision or machine learning talent, ask them how they'd like to see the training data composed for them so that they can solve a particular challenge at your company.
As Tesla's director of AI, Andrej Karpathy, says, the best ML engineers spend 80% of their time on training data, so it’s really important that you know what their needs and expectations are.
By the way—
V7 is hiring for a variety of engineering roles. Check out our Careers Page to apply.
2. When you have selected the annotation platform
You might have an idea of how you want to label your data. Maybe you want to place bounding boxes around defects in manufacturing, segment masks on people, or something similar.
But now, it's time for you to pick a platform that efficiently places and maintains these labels.
Picking an annotation platform is more than just a selection of functionalities. Your AI development partner will play a major role in:
Your confidential image or video data information security.
Maintaining complex datasets in cloud environments that need to be available all year round for your team.
Managing the human aspects of annotation and connecting you to BPO workers to apply labels manually before automation takes over.
Ensuring that labeling can be automated, at least in part, to reduce ongoing costs.
V7 offers a platform that takes care of all of the points above. However, you might want to take a broader look at the market. In fact, we’ve prepared a list of the 13 best image annotation tools for you to browse!
3. When you have collected raw data (or have decided to collect it)
Certain teams might get started with open-source pre-trained models or open-source data.
Contrary to popular belief, neural networks aren't oracles of general understanding.
Rather, they’re efficient compressors of training data. If that data is yours, they'll learn about your business. If that data is someone else's, they'll never learn the nuances of your business's or clients' imagery or video.
For example, if your business wants to detect people in a factory, and you use a model pre-trained on the MSCOCO dataset, it will do a decent job at detecting easy and nearby cases—but training directly on footage from the factory will immediately eclipse that performance.
Check out The Ultimate Guide to Object Detection to build your own object detection model in hours, not weeks.
How do you collect this data?
The simplest answer is: You’ve probably already done it!
Confused?
Let me explain...
If you're a SaaS platform or you’re analyzing CCTV footage, you probably already have more than enough data.
The key is to learn how to use it.
On the other hand, if you're engineering a brand new solution, you'll have to decide where to place cameras and start collecting some raw footage for training that covers enough expected edge cases.
Many teams new to computer vision worry about hardware, camera placement, or lens profiles. Our friendly V7 team is always happy to help you out and point you in the right direction to pick your suppliers.
And remember that not everything needs to be decided on day one. In fact, these are things you can decide after or during your annotation process.
1. Your annotation schema: What to label and how?
Knowing how to label data can be just as challenging as knowing how to train neural networks.
Unfortunately, it’s often a trade secret, and the academic community rarely labels data themselves, preferring to work on pre-labeled images to keep annotation as a control in experiments.
As such, resources are scarce.
Labeling schemas change over time, and great computer vision teams change theirs yearly to account for better network architectures that can do more than just detect boxes.
For example, many teams that label data for object detection are moving to instance segmentation approaches.
Several teams have moved from still images to video in order to multiply the amount of training data they process.
Pro tip: Read our Guide to Data Preprocessing to improve the quality of your training data.
You will want to start with a simple schema to launch your MVP and get more ambitious as the team grows.
At V7, we see hundreds of labeling schemas a day being established on the platform and can advise you on industry standard practice.
2. The quantity of data needed: How do you reach and maintain perfection?
A question that gets asked a lot in the machine learning world is: “How much training data do I need?”
The answer is: It depends.
Let me elaborate...
The amount of data you'll need to collect depends on how varied it is and how critical your challenge is.
You can get away with hundreds of training samples for very easy tasks. Highly varied ones, such as navigation, robotic picking, or medical image analysis, see ambitious teams picking through millions (or even tens of millions) of training samples.
See, the quantity of data is proportional to AI performance.
As mentioned earlier, neural networks are clever compressors of data. If you don't feed them many examples, they won't know how to react to "out of distribution" samples. Instead, they’ll just guess!
Have a look at 12 Types of Neural Network Activation Functions to learn more about this process.
Start with as much data as you can afford to collect and label, and once you have your first model trained, extrapolate how much more you will need so that you’re able to cover any edge case you plan to encounter.
3. Who oversees training data at your company?
The strongest ML teams in the world now have directors of data operations on board. These ensure that the company maintains excellent training data, much like lead engineers maintain excellent code.
When you're starting small, however, the machine learning lead can be the person in charge of training data.
As you scale, you'll notice that training data becomes the fulcrum of your ML-Ops strategy, and you might feel the need to have either an individual or a whole team in charge of a) making training data available and b) maintaining it throughout the business.
If you are looking for a career in data science or want to hire a data scientist, check out 40+ Data Science Interview Questions and Answers.
Whoever you hire to oversee training data at your company, it’s super important that you control who has access to it.
Whilst training data isn’t as obviously compromising as personal data, it can still indicate what your models do.
And the last thing you want is your competitors uncovering this information. As such, make sure that you enforce restricted access procedures once you’ve put a team together.
Key takeaways
Annotating data is a complex challenge, and it’s one part of AI that requires humans to take the lead.
When’s the right time to start?
Only once you’ve hired your first engineer and armed them with the right tools should you start annotating data.
From there, they’ll need access to training data that’s unique to your business, the right annotation platform—and enough data to work with.
If you are looking for a free image annotation tool, check out The Complete Guide to CVAT—Pros & Cons.
Then, once you’ve figured out what to label and how, you can start annotating your data so that your machine learning models can start making more accurate predictions, thus helping your business grow.
Of course, it's a process, and it’s really important that you measure your team’s performance to keep improving and reaching your goals.