Teaching Methodologies
The teaching activity takes place in class, with exposure to concepts, techniques, and methods, with a strong focus on solving practical
problems. Software will be used to support problem-solving.
Learning Results
In everyday life, vast amounts of data are generated either through websites, cell phones, wearable devices, or sensors associated with the
Internet of Things, among others. Processing this massive amount of data requires specialized tools that exceed the capacity of our PCs
and even some servers, making distributed systems necessary for data processing. The main objective of this course is to familiarize
students with the most important information technologies used in the manipulation, storage, and analysis of large amounts of data, one of
the significant examples being the Apache Spark framework, used for distributed computing
Program
1. Big Data Fundamentals
1.1 Concepts and motivation.
1.2 The 5 Vs and data types.
1.3 Architectures and applications.
2. The Hadoop Ecosystem
2.1 HDFS and distributed storage.
2.2 MapReduce: principles and examples.
2.3 Ecosystem components.
3. Apache Spark
3.1 Core concepts and advantages.
3.2 RDDs, DataFrames, and transformations.
3.3 Persistence and actions.
4. Large-Scale Data Processing
4.1 Data pipelines.
4.2 Integration with distributed systems.
4.3 Use cases.
5. Machine Learning in Big Data
5.1 MLlib: basic models.
5.2 Distributed evaluation.
5.3 Applied examples.
Internship(s)
NAO
Bibliography
Rajaraman, A., & Ullman, J. (2011). Mining of massive datasets. Cambridge University Press.
Ryza, S., et al. (2017). Advanced analytics with Spark: Patterns for learning from data at scale. O’Reilly Media.
Mendelevitch, O., Stella, C., & Eadline, D. (2016). Practical data science with Hadoop and Spark: Designing and building effective analytics
at scale. Addison-Wesley.
Deitel, P., & Deitel, H. (2019). Intro to Python for computer science and data science: Learning to program with AI, big data and the cloud.
Pearson.
Klosterman, S. (2019). Data science projects with Python: A case study approach to successful data science projects using Python,
pandas, and scikit-learn. Packt Publishing.
Triguero, I., & Galar, M. (2023). Large-scale data analytics with Python and Spark: A hands-on guide to implementing machine learning
solutions. Cambridge University Press.