* Timespan: December 2022 - January 2023
* Description:
- Supervised university project, with technical production.
- Set of Python scripts for statistical processing of textual data.
* Improvement paths:
- Explore this project further, using Twitter's API to continuously and dynamically retrieve raw data.
- Resume the project, varying input data sources and adding new processing operations.
This project involved producing Python scripts for statistical processing of textual data, based on data fields extracted from Tweets.
From an educational point of view, the project was organized in two sections aiming to acquire the knowledge needed to master specific data structures:
- Understanding and handling lists.
- Understanding and using dictionaries.
In particular, JavaScript Object Notation (JSON) and comma-separated values (CSV) data formats were leveraged, and regular expression-based data identification methods were defined.
From a personal point of view, this algorithmic experiment provided me with a detailed introduction to data processing methods and operations.
I was introduced to the main stages of a complete data processing procedure:
- Gathering of raw data, which constitutes the input for data processing. This stage involves reading the information emitted by a data source, followed by a reorganization of the identified data and completed by the storage of initial data structures.
- Data cleaning, which involves extracting the data deemed relevant to the ongoing data processing.
- Actual processing of extracted data, involving statistical studies, vector representations and other processing methods.
- Graphical display of the various statistical analyses carried out.
- Analysis and interpretation of results.
The advancement made can be followed through the various documents defining expectations for each part of the project. No report was required at the end of the project.
Documents and deliverables