Hover over tabs to view Projects!
Real Estate Webscrape Stream
Problem:
-
Real estate data needs to be aggregated, cleaned, and analyzed in real-time to provide up-to-date market insights.
​​
Solution:
-
Data Orchestration - Configured Airflow to run Ingestion job daily, followed by a Spark Job.
-
Data Ingestion - Scraped data from Redfin using Python and OpenAI API.
-
Data Processing - Used Kafka for streaming and Spark for processing and transforming the data.
-
Data Storage - Stored processed data in Cassandra.​
​​

System Architecture
Web-scrapping:
-
Python - Basic info and photos
-
OpenAI API - Extract public listing information
​
Requirements: Beautifulsoup4, playwright, openai, kafka-python, pyspark, cassandra-driver​​​​​​

Video Walkthrough
Watch this video to see the full implementation of the data pipeline. It demonstrates the process of scraping the housing data from Redfin, processing the data through Kafka & Spark, and storing the data in Cassandra.
​​
Conclusion
This project provided valuable experience in building end-to-end data pipelines using cutting-edge technologies. The ability to scrape, process, and store real estate data in real-time showcases the practical application of data engineering principles. The insights gained from this project are directly applicable to scenarios requiring efficient data handling and analysis, such as market trend analysis and investment decision-making. This project highlights the potential for leveraging automated data workflows to drive business intelligence.