View Categories

AI – Design of proper Dataset (including critical aspects)

This question addresses two critical aspects in designing a dataset for the training and testing of AI models. The first one is about the bias: during the data collection phase, specific pattern should be avoided, which can generate a false and inaccurate depiction of reality. The second one regards the variability: it has to be present at a certain extent to guarantee that the collected data can represent the reality, but it cannot be too much, to avoid poor generalization and overfitting. A trade-off is therefore necessary. (Salay et al. 2017), (Ghahramani, 2015)

Main Question

Are critical aspects in designing the dataset taken into account?

Sub-Questions

  1. Are there methods implemented to minimize the bias during the dataset design process?
  2. Does the AI-based solution introduce bias?
  3. Are there methods implemented to minimize variability (assuring the necessary extension of the dataset) during the design process?
  4. Is the dataset balanced? (namely, similar number of the different classes)
  5. Has an EDA (exploratory data analysis) been conducted to identify potential sources of the bias?

References