20596 - MACHINE LEARNING
Course taught in English
Go to class group/s: 23
Class 23: DANIELE DURANTE
Prerequisites: For a fruitful and effective learning experience, it is strongly recommended a basic preliminary knowledge in mathematics and linear algebra, descriptive statistics, probability and random variables, simple and multiple linear regression, likelihood-based inference, and generalized linear models. Students should also be familiar with basic statistical softwares.
In 2009, the Chief Economist of Google, Hal Varian, said that Data Science would have been the most attractive job of the next ten years. He also claimed that understanding, processing and extracting value from data were going to be hugely important skills in many careers. He was right. Indeed, the Data Scientist is listed among the top jobs in the United States since several years now. The reason of this huge demand is simple and can be found in the words of Eric Schmidt, Chief Economist of Google after Hal Varian: "we create as much information in two days now as we did from the dawn of man through 2003". But information (data) is not knowledge. This fundamental translation process requires skills in database management, statistical learning, machine learning, computational statistics, along with a good intuition and the ability to deal with data, understand the analytic goals and interpret the final outputs. The course in Machine Learning aims at fostering these skills and provide students with the instruments and the mind-set to successfully deal with a wide range of data analytic problems they may find in their future jobs.
- INTRODUCTION: A smooth introduction to Machine Learning
- LINEAR METHODS: High-dimensional linear regression; Logistic regression; Linear and quadratic discriminant analysis
- MODEL ASSESSMENT AND SELECTION: Bias-variance trade-off; Training, test and validation sets; Cross-validation; Bootstrap
- REGULARIZATION AND SHRINKAGE: Subset selection; Ridge regression; Lasso and related algorithms
- METHODS BEYOND LINEARITY: Regression and smoothing splines; Local linear regression; Kernel methods; Generalized additive models
- TREE-BASED METHODS: Regression and classification trees; Bagging; Random forests; Boosting
- BEYOND TREE-BASED METHODS: Support vector machines; Neural networks
The above methods will be also implemented during LAB SESSIONS on real-world case studies. Code and implementation in classical statistical softwares, with a main focus on R, are also part of the course topics.
- Explain the methodology and theory underlying the classical machine learning methods
- Illustrate the technical aspects related to the implementation of classical machine learning methods
- Recognize the distinctive properties of each machine learning technique
- Identify the most suitable machine learning technique for a given data analytic problem
- Summarize differences and similarities between multiple machine learning techniques
- Examine the relevant research questions underlying a real-data analytic problem
- Choose a machine learning technique coherent with the analytic question and apply it to the dataframe
- Identify relevant structures underlying the data and effectively predict unobserved events
- Discuss the empirical output produced by a machine learning technique
- Connect different machine learning techniques to improve predictive performance in complex analytic problems
- Face-to-face lectures
- Exercises (exercises, database, software etc.)
- Case studies /Incidents (traditional, online)
- Individual assignments
- Group assignments
- Interactive class activities (role playing, business game, simulation, online forum, instant polls)
Classical face-to-face lectures will focus on the presentation and the discussion of the machine learning techniques covered by the course, with a main attention to methodology, theory and computational methods. To improve the learning experience and motivate the interaction, illustrative case studies and in-class exercises may also be considered.
A series of lab sessions, with the students working on their own laptop, will be also provided. These classes will, typically (but not always), consist of two main parts:
- The students will be guided in the implementation of the machine learning techniques on standard statistical softwares, with a main focus on R. Some Python code will be also made available as supplement materials. To download R or R Studio see https://www.r-project.org/ and https://www.rstudio.com/.
- After the guided implementation, an in-class individual assignment (performed on a data competition platform) will ask the students to solve a specific predictive problem from a data analytic case study, leveraging suitable machine learning tools. This interactive class activity is expected to improve the autonomy of the students in answering a variety of real-world analytic questions, and will serve as a self-assessment occasion. Some other online data competitions may be provided as individual or group homeworks (not compulsory), to offer additional training materials for the interested students.
|Continuous assessment||Partial exams||General exam|
Due to the nature of the course, only a final general exam will be considered to evaluate, with the same criteria, attending and non-attending students. This assessment will consist of two main parts.
- Traditional written individual exam which will consist of open-ended questions and small exercises. The focus is on evaluating students based on their methodological, theoretical and computational understanding of the machine learning techniques presented in the face-to-face lectures.
- Individual assignment based on a data challenge where students are asked to develop and apply a data analytic strategy to answer a predictive problem. Such a data challenge will be a longer and more structured version of those proposed in the lab sessions, and will take place towards the end of the course. This assignment is managed via an online data competition platform, and the evaluation will consider the predictive performance of the analytic approach proposed by the student and the quality of a document describing the methods considered, the code, the final results and related comments.
Grading Rule: Let X denote the grade of the traditional written individual exam and let Y be the grade of the individual assignment. Then, if Y is greater than or equal to X, the final grade is 0.3*Y+0.7*X. Otherwise, if Y is less than X, the final grade is X.
The course relies on two books which complement each other and are available online for free.
- Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Second Edition). Springer.
- James, G., Witten, D., Hastie, T. and Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R. Springer.
Slides summarizing the contents presented in class will be also provided. Students who are interested in deepening, individually, specific concepts will be provided with additional reading materials upon request. These additional materials will not be object of final evaluation.