This question addresses two critical aspects in designing a dataset for the training and testing of AI models. The first one is about the bias: during the data collection phase, specific pattern should be avoided, which can generate a false and inaccurate depiction of reality. The second one regards the variability: it has to be present at a certain extent to guarantee that the collected data can represent the reality, but it cannot be too much, to avoid poor generalization and overfitting. A trade-off is therefore necessary. (Salay et al. 2017), (Ghahramani, 2015)
Main Question
Are critical aspects in designing the dataset taken into account?
Sub-Questions
- Are there methods implemented to minimize the bias during the dataset design process?
- Does the AI-based solution introduce bias?
- Are there methods implemented to minimize variability (assuring the necessary extension of the dataset) during the design process?
- Is the dataset balanced? (namely, similar number of the different classes)
- Has an EDA (exploratory data analysis) been conducted to identify potential sources of the bias?
References
- Salay, R., Queiroz, R. and Czarnecki, K. (2017) ‘An Analysis of ISO 26262: Using Machine Learning Safely’. Automotive Software. doi:https://doi.org/10.48550/arXiv.1709.02435.
- Ghahramani, Z. (2015) ‘Probabilistic machine learning and artificial intelligence’. Nature, 521(7553), pp.452–459. doi:https://doi.org/10.1038/nature14541.