Project Walkthrough

Introduction

Datadrip is a transformative web application developed as a capstone project by a talented team of data scientists—Yifan Wang, Trevor Dalton, Diana Nguyen, Kamil Mielczarek, and Noah Prozan—as part of the UC Berkeley Masters in Data Science program. The project's primary objective was twofold: to empower retail investors with enhanced insights into quarterly earnings reports and financial documents for informed decision-making, and to streamline the workflow of financial analysts by converting presentations into Excel sheets for seamless integration into downstream modeling.

The core functionality of Datadrip leverages state-of-the-art generative AI models to deliver impactful results. For retail investors, the application utilizes text summarization techniques to distill complex financial information into digestible insights. Simultaneously, it employs advanced chart derendering processes—leveraging models like YOLOv5 for chart extraction and MatCha for chart summarization and derendering—to provide financial analysts with Excel-ready versions of presentation charts for rapid modeling.

Throughout the 16-week development period, the Datadrip team leveraged Python, PyTorch, generative AI, and computer vision to see the project to its end. The project underscored the transformative potential of AI in automating financial workflows and democratizing access to critical financial information. Datadrip represents a step forward in AI-driven financial analysis tools, demonstrating the impact of cutting-edge technologies in advancing data-driven decision-making within the financial sector.

Architecture

Architecture Diagram
Figure 1: Datadrip Architecture Diagram

Datadrip follows a three-step process:

  1. Extract all charts and text from the given financial documents using computer vision models YOLOv5 and Tesseract
  2. Understand and create summaries and tables for each of the extracted charts using MatCha
  3. Create an insightful summary of the presentation and serve both that and the derendered charts to the user using Mistral

The threading together of multiple sophisticated models is part of what makes Datadrip unique.

About the Team

Trevor Dalton

Trevor Dalton

Trevor is a Data Engineer at equity research firm M Science where he works on the video games team to delivery data and insights to AAA game publishers. He is responsible for having worked on the Computer Vision, Data Engineering, and UI/UX portions of Datadrip. Outside of Datadrip, Trevor is an avid reader of both data science and science fiction literature.

Diana Nguyen

Diana Nguyen

Diana brings a decade of expertise in analytics, data science and machine learning to the table. Her proficiency in data engineering, natural language processing, and computer vision empowers her to develop innovative, data-driven solutions for challenging business problems. As a dedicated data scientist, Diana is passionate about leveraging her skills to drive meaningful impact and deliver tangible results.

Noah Prozan

Noah Prozan

Noah currently work as a data scientist at Doordash, using data to drive marketing spend decisions for merchants. In the past he worked for Samsung, where he focused on workforce planning and quarterly HR analytics reviews for executives. He was primarily responsible for data collection and LLM functionality for Datadrip. Outside of work he enjoys skiing, sports, and travel — Japan is next on his bucket list. Noah also went to Berkeley for undergrad is a proud Cal Bear.

Yifan Wang

Yifan Wang

Yifan Wang is currently a research analyst on Barclays US Credit Research team covering the US High Grade bank, Insurance and Non-Bank Financial Sectors. She joined Barclays in 2022 from Lord Abbett, where she was an Investment Analyst covering Asset-Backed Securities. Yifan will be joining Chicago Booth as a PhD student in Financial Economics this fall.

Kamil Mielczarek

Kamil Mielczarek

Data Scientist

© Copyright 2024 Datadrip