Data Engineering

UC Berkeley, Spring 2021

Joe Hellerstein

Joe Hellerstein

hellerstein@berkeley.edu

Pronouns: he/him/his

OH: Tu 10:30-11:30AM, Th 1-2PM

Aditya Parameswaran

Aditya Parameswaran

adityagp@eecs.berkeley.edu

Pronouns: he/him/his

OH: W 3-4PM, F 11AM-12PM

The schedule and dates listed below are tentative and may be subject to change. Check out the syllabus for course information.


Week Date Lecture Assignment
1 Tu 1/19 1. Introduction & Data Science Lifecycle  
  Th 1/21 2. Logistics and Relational model & algebra  
2 Tu 1/26 3. Relational alg. contd. SQL intro  
  Th 1/28 4. SQL intro, Views  
3 Tu 2/2 5. SQL subqueries and aggregation  
  Th 2/4 6. More SQL: window functions, sampling, string manipulation Project 1 (due 2/19)
4 Tu 2/9 7. SQL updates, DDL, referential integrity, constraints  
  Th 2/11 8. Index selection and performance tuning  
5 Tu 2/16 9. Index selection and performance tuning (II)  
  Th 2/18 10. Index selection and performance tuning (III); 11. Three Data Models: Relations, Tensors and Dataframes Multivitamin 1 (due 3/4)
6 Tu 2/23 11. Relations, Tensors and Dataframes, Cont..  
  Th 2/25 12. Data Preparation Full notebook and Slide-oriented notebook Project 2 (due 3/12)
7 Tu 3/2 12b. Data Preparation Slide-oriented notebook  
  Th 3/4 12b. Data Preparation, cont.  
8 Tu 3/9 13. Data Cleaning Slide-oriented notebook Multivitamin 2 (due 3/17)
  Th 3/11 Contd.  
9 Tu 3/16 14. Normalization and ER  
  Th 3/18 15. Semistructured Data Multivitamin 3 (due 3/31)
  Tu 3/23 Spring Break  
  Th 3/25 Spring Break  
10 Tu 3/30 16. Querying semistructured data Project 3 (due 4/15)
  Th 4/1 Contd.  
11 Tu 4/6 17. Spreadsheets  
  Th 4/8 18. Graph data: Property graph models, triples/RDF 19. BI: OLAP, summarization, and visualization Multivitamin 4 (due 4/19)
12 Tu 4/13 20. Transactions  
  Th 4/15 21. Data Pipelines  
13 Tu 4/20 22. Approximation: sampling and sketching Project 4 (due 5/4)
  Th 4/22 23. Storage: Column vs. row, Compression, Exchange formats  
14 Tu 4/27 24. Parallelization: Map-Reduce, Spark, Parallel DBMS, Dask/Modin Multivitamin 5 (due 5/3 at 10 AM)
  Th 4/29 25. Security and Privacy. 26. Reflections Project 5 (due 5/14)
15 Tu 5/4 RRR Week  
  Th 5/6 RRR Week