20570 - DATA ANALYTICS AND VISUALIZATION
Course taught in English
Go to class group/s: 22
Class 22: RAFFAELLA PICCARRETA
For an effective learning experience, it is strongly recommended to have basic notions of statistics - descriptive statistics univariate and bivariate; most relevant inferential concepts (samples, statistics, estimators, hypothesis testing, p-values). To this aim, an online preparatory course (20354) is available, including online tests to verify the level of knowledge and understanding of the concepts used during the course. Students are expected to be able to work with Excel and Word (basic skills)
Modern graduates need to use data to a much greater extent compared to their past counterparts. Data management (retrieving, filtering, or cleaning), exploratory data analysis, and appropriate data visualization are becoming more and more relevant in any field. In this course, students are introduced to big datasets, and gain an applied understanding of the most relevant techniques of multivariate data analysis. The key goal of the course is to illustrate methods useful to analyze and summarize the most salient features of large data sets with respect to both the variables and the cases. The course features hands-on classes, where the application of each techniques is discussed with reference to real datasets.
Description and summary of multivariate samples:
- Variables reduction. Optimal indicators and Principal components analysis.
- Variables reduction. Latent concepts and Factor Analysis.
- Cases reduction. Finding groups in data and Cluster Analysis.
- Summarizing association using Simple Correspondence Analysis.
- Individuate the technique most suitable to simplify large datasets with reference to a specific goal of analysis.
- Recognize appropriate and inappropriate applications and approaches with reference to a specific goal of analysis.
- Justify the adoption of a specific path of analysis and the choices made during the analysis.
- Compare the results obtained using different approaches, evaluate the stability of results.
- Prepare data for the analysis.
- Analyze data using a statistical software.
- Interpret and critically analyze results, emphasizing the most relevant conclusions both from a technical and from an interpretative point of view.
- Effectively present the output, using suitable visualization tools allowing an immediate and unbiased understanding of the most salient features in data.
- Face-to-face lectures
- Exercises (exercises, database, software etc.)
- Interactive class activities (role playing, business game, simulation, online forum, instant polls)
The course is articulated into different types of teaching methods:
- Face-to-face classes introducing the most relevant theoretical concepts relative to each technique.
- Face-to-face classes discussing the appropriate application of the technique with reference to a specific problem and set of data. The choices left to the analyst and the possible available methods are presented. Criteria to evaluate and compare results are discussed.
- Exercises. Hands-on classes (in lab). Students are guided to the analysis of a real set of data with reference to a specific goal of analysis. In particular attention is focused on:
- Proper definition of a suitable path of analysis.
- Development from the scratch of a program allowing the analysis of data using standard functions and macro functions created for the course.
- Discussion of results and comparison of alternative strategies.
- Criteria to choose one out of the available solutions.
- Visualization of results and interpretation.
- Discussion (interactive class activities). Lessons where students are divided into small groups. A problem is described with reference to a set of data, and the output obtained under different choices and using alternative approaches is made available. The groups discuss about the path of analysis that they would follow. The alternative views and proposals made by different groups are contrasted and discussed.
|Continuous assessment||Partial exams||General exam|
For attending students the exam as the same structure described with reference to not attending students. Nonetheless, attending students can divide the theoretical exams into two partial exams. Students qualify as attending if:
- They attend at least 4 of the labs dedicated to principal components and factor analysis.
- They attend at least 4 of the labs dedicated to cluster analysis and simple correspondence analysis.
- They attend at least one lab for each technique.
- For a detailed description of the practical and of the theoretical exam, please refer to the description of the exam for not attending students.
The exam consists of two parts:
- A theoretical exam (40%).
- A practical exam (60%).
- The practical exam is the same for all the students, whereas attending students are offered the possibility to give two partial exams (20% each) instead of the general theoretical exam.
- The theoretical exam is a written exam (open questions). It aims at assessing the knowledge on the techniques introduced in the course, also with respect to the quantities that can be obtained using a software. The exam includes some “paper-and-pencil” derivation questions, as well as questions about results obtained applying the illustrated techniques to specific datasets. The latter questions do not test the knowledge of the software, but do require an understanding of typical output.
- Exams are closed-book and closed-notes, but students need to use a simple calculator to answer some questions.
- The practical exam is an in-class (lab) computer assignment, and it is the same both for attending and not attending students. It consists of the analysis of a dataset using the techniques illustrated during the course, and on the interpretation of the obtained results. Students must write their own computer programs from scratch. The exam is closed-book and closed-notes but the students can consult documents illustrating the commands needed to obtain different output.
- The final grade is obtained by combining the grades taken in the different parts. Additional (decimal) points are assigned to those students who took at least one of the three tests proposed in the preparatory course 20354 (Data Analysis) at the latest within one week before the beginning of the course. Students whose final result is not an integer receive extra-decimals for each submitted test (max 0.25 per test). The final grade obtained as described is rounded to the upper integer only when the decimal part is higher than 0.7.
Slides of the theoretical lessons are uploaded on the Bboard. These notes are complete and cover the whole program. For more detailed discussions of the topics, students can refer to:
- R.A. JOHNSON, D.W. WICHERN, Applied Multivariate Statistical Analysis, Prentice Hall, 2002, 5th ed.
- J. LATTIN, J.D. CARROLL, P.E. GREEN, Analyzing Multivariate Data, Thomson, 2003.