Early AI was knowledge-based (Hayes-Roth et al., 1994), while today’s AI is data-driven (Yu et al., 2018). In particular, Machine Learning (ML) is the discipline that provides models and methods for the transformation of data into task-specic knowledge. Most of the success of AI in recent years can be attributed to ML. Based on that, the creation of the appropriate dataset for training is of paramount importance and we may say that it is the most relevant point for the creation of an AI system. Therefore, we introduced this topic “Dataset Design for AI” (DD-AI, in short): how a proper dataset can be generated, built and developed, including also its update and maintenance.
The first question is related to the data quality, which assesses whether an information can serve its purpose in a particular context.
Main Question
Is the data quality of the (training) dataset considered and guaranteed?
Sub-Questions
- Is the training data representative for the purpose?
- Is it accurate enough for the purposes of the training (e.g., classify objects with CR ³ 90%)? (Botta et al. 2019)
- Is the size of the dataset big enough for the purpose?
- Is the information comprehensive (that is, all the data you need are available)?
- Is the information reliable, that is, does it corroborate or contradict other trusted resources?
- Is redundancy of information considered?
- Is the information relevant, or is it really necessary?
References
- Botta, M. et al. (2019) ‘Real-time detection of driver distraction: Random projections for pseudo-inversion-based neural training’, Knowledge and Information Systems, 60(3), pp. 1549–1564. doi:https://doi.org/10.1007/s10115-019-01339-0
- Hayes-Roth, F. and Jacobstein, N. (1994) ‘The state of knowledge-based systems’. Communications of the ACM, 37(3), pp.26–39. doi:https://doi.org/10.1145/175247.175249.
- Yu, B. and Kumbier, K. (2018). ‘Artificial intelligence and statistics’. Frontiers of Information Technology & Electronic Engineering, 19(1), pp.6–9. doi:https://doi.org/10.1631/fitee.1700813.