Dataset Usage in ML Projects
- Andromeda AI
- May 3, 2021
- 1 min read
When it comes to machine learning, you often hear the usage of a word called "dataset". This refers to the data that the machine is trained on, or where it learns from. A basic idea of how machine learning works is that it's given an input and output and is able to create ways/methods for it to take any input and produce an output similar to that on which it's trained on. For example, we could teach a machine to learn how to identify a cat from a picture by giving it a bunch of different images as well as if the image has a cat or not. From this information, the machine can learn to identify a cat and when given a new image can tell you whether or not a cat is visible in this picture.
These images and data labels are a part of the dataset. With machines and even humans the more data given to them, the better the result and output performance. The reason for this is because it is able to gather more information from every piece of additional data you give it.
In fact, China has gone as far as to create the world's largest satellite image database for machines to train on. More info can be found here: https://www.scmp.com/news/china/science/article/3131819/china-makes-worlds-largest-satellite-image-database-train-ai
Next time you are debating on which dataset to use for the training of your machine learning model, take into consideration the size of the dataset and see if it can help to have more data.

Comments