Code | Completion | Credits | Range | Language |
---|---|---|---|---|
F7ADTARVD | ZK | 5 | 14P+7C | English |
The course offers an overview of tools for knowledge extraction from data and demonstrates their use in practical tasks using the open source tool Project R. Special attention is paid to the illustrative presentation of sequentially obtained results, which will greatly facilitate communication with the data owner (e.g. a doctor), who can then better cooperate in choosing further search directions. Clustering. Improving model quality by combining multiple base models - bagging, boosting, AdaBoost. Data dimension reduction and feature selection (e.g. PCA, ICA, factor analysis). Anomaly detection.
Form of verification of study results: oral examination.
As a standard, the course is taught in contact form and the course has lectures and exercises . In case the number of students is less than 5, the teaching can take place in the form of guided self-study with regular consultations. In this case, in addition to the examination, the student is required to produce a written study on the assigned topic.
For combined study:
Teaching takes the form of guided self-study with regular consultations. In addition to the examination, the student is required to prepare a written study on a given topic.
1. Basic concepts for data description, machine learning and recognition: observation, symptom, symptom space, classification.
2. Knowledge mining - description and methodology of the CRISP process. Exploratory analysis and visualization of multidimensional data.
3. Clustering for modelling unclassified data - basic algorithms. Evaluation of the resulting model and its application.
4. Basic procedures for modeling classified data - nearest neighbor method, decision tree formation, and their properties. Examples of applications.
5. Measures for comparing the performance of different classification models (accuracy, specificity, ..., ROC curve). Methods for estimating model performance: cross-validation, bootstrapping, learning curve.
6. SVM data representation change. Example illustrating the use of a derived attribute to replace several others.
7. Construction of association rules for unclassified data and their use.
8. Different methods for improving the quality of processed data - identification of outliers and incorrect values. Understanding data and data preparation: procedures for discretization, normalization and completion of missing values, data aggregation.
9. Improving model quality by combining multiple base models - bagging, boosting, AdaBoost.
10. Data dimension reduction and feature selection (principal component analysis - PCA, PCA for classification tasks, factor analysis, regression, partial least squares).
11. Several strategies for testing the emerging models (multiple testing and various corrections).
12. Examples of other tools for data modelling: creation of regression trees, use of neural networks.
13. Recognition of anomalies in multivariate data.
14. Prospective topics in DM, e.g. working with structured data.
Exercises will be solved in the form of practical projects in which students will verify the knowledge acquired in lectures.
Exercises will be solved in the form of practical projects in which students will verify the knowledge acquired in lectures.
Qurban A Memon Q.A., Khoja S. A. Data Science. Theory, Analysis and Applications. CRC Press, 2019
Recommended:
Daróczi G.: Mastering Data Analysis with R. Packt Publishing, 2015, 978-1783982028
R software volně stažitelný na https://www.r-project.org/