20570 - DATA ANALYTICS AND VISUALIZATION
Course taught in English
Go to class group/s: 22
Class-group lessons delivered on campus
For an effective learning experience, it is strongly recommended to have basic notions of statistics, in particular of univariate and bivariate descriptive statistics and of the most relevant inferential concepts (samples, statistics, estimators, hypothesis testing, p-values). To this aim, an online preparatory course (20354) is available, including online tests to verify the level of knowledge and understanding of the concepts used during the course. Students are expected to be able to work with Excel and Word (basic skills)
Modern graduates need to use data to a much greater extent compared to their past counterparts. Data management (retrieving, filtering, or cleaning), exploratory data analysis, and appropriate data visualization are becoming more and more relevant in any field. In this course, students are introduced to big datasets, and gain an applied understanding of the most relevant techniques of multivariate data analysis, with specific reference to unsupervised learning. The key goal of the course is to illustrate methods useful to analyze and summarize the most salient features of large data sets with respect to both the variables and the cases. The course features hands-on classes, where the application of each techniques is discussed with reference to real datasets.
- Description and summary of multivariate samples.
- Variables reduction. Optimal indicators and Principal components analysis.
- Variables reduction. Latent concepts and Factor Analysis.
- Cases reduction. Finding groups in data and Cluster Analysis.
- Individuate the technique most suitable to simplify relevant information in a dataset with reference to a specific goal of analysis.
- Recognize appropriate and inappropriate applications and approaches with reference to a specific goal of analysis.
- Justify the adoption of a specific path of analysis and the choices made during the analysis.
- Compare the results obtained using different approaches, evaluate the stability of results.
- Write R scripts to analyze data
Design/develop a script in the R-programming language that read, manipulate, analyse and visualise data
- Interpret and critically analyze results, emphasizing the most relevant conclusions both from a technical and from an interpretative point of view.
- Effectively present the output, using suitable visualization tools allowing an immediate and unbiased understanding of the most salient features in data.
- Face-to-face lectures
- Exercises (exercises, database, software etc.)
- Group assignments
- Interactive class activities (role playing, business game, simulation, online forum, instant polls)
The course is articulated into different types of teaching methods:
- Theory. Lessons introducing the most relevant theoretical concepts relative to each technique.
- Theory&app. Lessons illustrating and discussing the appropriate application of the technique with reference to a specific problem and set of data. The choices left to the analyst and the possible available methods are presented. Criteria to evaluate and compare results are discussed.
- R-labs. Lessons illustrating and discussing the scripts in the R-programming language employed to obtain the results discussed during the theory&app lessons.
- Hands-on. In-class assignments whereby students divided into groups are required to analyze a set of data. In particular attention is focused on:
- Proper definition of a suitable path of analysis.
- Development from the scratch of a program allowing the analysis of data using standard functions and macro functions created for the course.
- Discussion of results and comparison of alternative strategies.
- Criteria to choose one out of the available solutions.
- Visualization, interpretation and discussion of results.
- Discussion. Interactive class activities where students discuss the analysis developed during the hands-on classes. Instant polls and tests are used to identify the alternative views and choices, that are contrasted and discussed afterwards.
During the course, there will be 3 blocks of hands-on and discussion classes, one for each of the three techiques taught during the course. Each block of lesson is dedicated to an assignment requiring the analysis of a set of data using each of the three techniques taught during the course.
Such assignments aim at assessing the ability to design a work flow to analyse data using the software R, as well as the ability to draw substantive conclusions on the data at hand based on the software output.
These classes are of course dedicated to all the students, and all the students present in class are required to actively participate and contribute to the discussion.
Nonetheless, students who regularly attend the classes dedicated to each technique, work on the assignments, take the tests proposed during classes and actively participate to the discussion about the obtained results qualify as attending students. Attendance to the lessons, quality of the answers given at tests, participation to the discussion and peer evaluation will all contribute to the grade assigned to the activity in each block (2 points will be assigned for the work related to each technique - thus, 6 points in total, counting for the 20% of the final grade).
Students who get at least 3 points (over 6) in the above-mentioned activities can give the exam as attending students. Attending students will be evaluated through a theoretical exam (counting for 12 points - the 40% - of the final grade, same as for not attending students) but will have the opportunity to take a simplified practical exam (counting for 12 points - the 40% of the final grade) rather than the complete practical exam prepared for not attending students. Such opportunity can be used only once and at the first occasion when the student registers for the practical exam.
For details on the theoretical and practical exams, please refer to the assessment methods for not attending students.
For not attending students the exam consists of two parts:
- A theoretical exam (40%).
- A practical exam (60%).
- The practical exam (denoted as S - scritto - on the Bocconi website) is an in-class (lab) computer assignment. It consists of the analysis of a dataset using the techniques illustrated during the course, and on the interpretation of the obtained results. It aims at assessing the ability to design a work flow to analyse data using the software R, as well as the ability to draw substantive conclusions on the data at hand based on the software output.
Students must write their own R programs from scratch. The exam is closed-book and closed-notes and access to internet during the exam is prohibited. For the exam students will use their personal laptops. It is required to have the desktop clean and to have all the material distributed during the course removed from the laptop; instructor will possibly inspect the content of the laptop before or during the exam. Only the material made available from the instructor on Blackboard can be consulted, which will include the manuals on the R commands needed to obtain different output prepared for the course.
- The theoretical exam (denoted as O - orale - on the Bocconi website) is an (open-answers and close-answers) online exam. It aims at assessing the knowledge on the techniques introduced in the course, also with respect to the output obtained using a software. The exam includes some “paper-and-pencil” derivation questions, as well as questions about results obtained applying the illustrated techniques to specific datasets. The latter questions do not test the knowledge of the software, but do require an understanding of typical output. The exam is closed-book and closed-notes, but students need to use a simple calculator to answer some questions. The exam will be run using the students' laptop and Lockdown Browser.
For both attending and not attending students the theoretical exam can be split into two partial exams. The first partial will cover Principal components analysis (and will count for 4 points). The second Partial will cover Factor Analysis and Cluster Analysis (and will count for 8 points)
Slides of the theoretical lessons are uploaded on the Bboard. These notes are complete and cover the whole program. For more detailed discussions of the topics, students can refer to:
- R.A. JOHNSON, D.W. WICHERN, Applied Multivariate Statistical Analysis, Prentice Hall, 2002, 5th ed or subsequent editions OR
- J. LATTIN, J.D. CARROLL, P.E. GREEN, Analyzing Multivariate Data, Thomson, 2003.