Statistics Complements for Data Science

Base Knowledge

The Statistics Complements for Data Science curricular unit is supported on the fundamental contents of Statistics, with relevance to the ones provided by Statistical Data Analysis curricular unit. The programming knowledge provided by the Programming for Data Science curricular unit is a plus.

Teaching Methodologies

The classes are designed, according to the curriculum plan, to be both theoretical and practical. They are planned and prepared to actively engage students at various moments or throughout the entire class.

In the theoretical part of the lesson, the expository method will be frequently used to introduce concepts, fundamental results, and methods, interspersed with tasks that encourage active participation by all students (interactive lectures). These tasks include posing questions to and by students, orally and/or on a platform, as well as proposing debates/discussions in small groups on certain exposed aspects/topics.

The practical part will be designed to comprehensively develop the listed skills. This will be achieved through commented exemplification of procedures and/or problem-solving under the guidance/tutoring of the teacher. Autonomous work or work in small groups will be encouraged, progressing towards project-based learning, with the completion of an assignment. There will be a strong interaction between theory and practice, with a central focus on visualizing and dealing with actual scenarios. The computer tool that will be mainly used is Google Colab with Python language.

It is assumed that students attend classes regularly and are available for continued involvement beyond the classroom. This includes initiating or completing tasks agreed upon during class.

All supporting materials are available on the InforEstudante|Nonio platform. Other platforms that allow for interaction may also be used.

Learning Results

Statistical analysis and techniques play a fundamental role in Data Science, which begins, according to the Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology and the Foundational Methodology for Data Science, with understanding and preparing the data. It also includes support for resampling and simulation. Thus, the goals and skills of this curricular unit are focused on this recognition.

Goals:

  • Examine/understand data as part of a specific Data Science task, using proper statistical tools.
  • Select and sequentially execute data preparation techniques appropriate to the Data Science task at hand, with a special focus on the statistical techniques.
  • Identify opportunities for applying resampling and simulation techniques and perform them in simple cases.

Skills:

  • Integrate data from multiple sources to create a dataset in simple cases.
  • Describe the dataset, using statistical measures and plots.
  • Perform exploratory analysis of the dataset.
  • Check the quality of the dataset according to predefined quality dimensions.
  • Clean the dataset, with a special focus on outliers and missing values, by applying proper statistical techniques.
  • Engineer new features relevant to the problem at hand and/or for dimensionality reduction.
  • Select features, based on statistical criteria not dependent on the machine learning techniques to be used later, for the purpose of  redundancy elimination and dimensionality reduction.
  • List and properly apply basic resampling techniques.
  • Outline a simple simulation.
  • Implement the required code in the Python programming language.

Program

1. Data understanding
1.1. Stages and tasks
1.2. Statistical toolkit
1.3. The role of Exploratory Data Analysis

2. Statistical perspective on data preparation
2.1. Stages and tasks
2.2. Statistical tools: missing values treatment; outlier treatment; discretization, normalization and other transformations; techniques to eliminate redundancy; techniques for dimensionality reduction.

3. Topics of Computational Statistics
3.1. Introduction to resampling
3.2. Introduction to simulation

Curricular Unit Teachers

Internship(s)

NAO

Bibliography

Fundamental:

  • Bruce, P., Bruce, A., Gedeck, P. (2020). Practical Statistics for Data Scientists, 2nd Edition. O’Reilly.
  • Ciaburro, G. (2020). Hands-On Simulation Modeling with Python. Packt Publishing.
  • Gama, J., Carvalho, A.P.L., Faceli, K., Lorena, A.C., Oliveira, M. (2017). Extração de Conhecimento de Dados (Data Mining), 3.ª Edição. Edições Sílabo.
  • Jafari, R. (2022). Hands-On Data Preprocessing in Python. Packt Publishing.
  • James, G., Witten, D., Hastie, T., Tibshirani, R., Taylor, J. (2023). An Introduction to Statistical Learning with Applications in Python. Springer. https://www.statlearning.com/
  • Slides and hands-on available at InforEstudante|Nonio.

Complementary:

  • García, S., Luengo, J., Herrera, F. (2014). Data Preprocessing in Data Mining. Springer.
  • Hair, J.F., Black, W.C., Babin, B.J., Anderson, R.E. (2019). Multivariate Data Analysis, 8th Edition. Cengage.
  • Kuhn, M., Johnson, K. (2020). Feature Engineering and Selection – A Practical Approach for Predictive Models. CRCPress.
  • Moreira, J., Carvalho, A., Horvath, T. (2018). A General Introduction to Data Analytics. Wiley.
  • Mount, G. (2021). Advancing into Analytics From Excel to Python and R. O’Reilly.