top of page

Real Estate Webscrape Stream

Problem:

  • Real estate data needs to be aggregated, cleaned, and analyzed in real-time to provide up-to-date market insights.

​​

Solution:

  • Data Orchestration - Configured Airflow to run Ingestion job daily, followed by a Spark Job.

  • Data Ingestion - Scraped data from Redfin using Python and OpenAI API.

  • Data Processing - Used Kafka for streaming and Spark for processing and transforming the data.

  • Data Storage - Stored processed data in Cassandra.​

​​

redfin-logo-square-red-1200.png

System Architecture

Web-scrapping:

  • Python - Basic info and photos

  • OpenAI API - Extract public listing information

​

Requirements: Beautifulsoup4, playwright, openai, kafka-python, pyspark, cassandra-driver​​​​​​

system_architecture.png

Video Walkthrough

Watch this video to see the full implementation of the data pipeline. It demonstrates the process of scraping the housing data from Redfin, processing the data through Kafka & Spark, and storing the data in Cassandra. 

​​

Conclusion

This project provided valuable experience in building end-to-end data pipelines using cutting-edge technologies. The ability to scrape, process, and store real estate data in real-time showcases the practical application of data engineering principles. The insights gained from this project are directly applicable to scenarios requiring efficient data handling and analysis, such as market trend analysis and investment decision-making. This project highlights the potential for leveraging automated data workflows to drive business intelligence.

Other Stream Projects!

Real Time Voting

Developed a real-time voting application for a simulated election

AWS Food Delivery

Developed a real-time data pipeline for food delivery order data.

bottom of page