GSoC 2025 Project Ideas

Please ask questions through issues on the respective project's repo.

Tags available @henrykironde, @bw4sz, jveitchmichaelis @ethanwhite,

Preferred names (Henry, Ben, Josh, Ethan)
Preferred_greeting (Hi|Hello|Dear|Thanks|Thank you [First_name])

The code of conduct should be your first read.

Proposal 1: Efficient Detection of Unique Images from Overlapping Images

Rationale:

Develop a workflow to compute unique image detections from overlapping images using either the weecology/DoubleCounting repository or the open-forest-observatory/geograypher repository. This project will focus on implementing an efficient algorithm for removing double counting among overlapping images.

Approach:

Choose a suitable repository either weecology/DoubleCounting or open-forest-observatory/geograypher for implementing the unique image detections workflow.
Develop an efficient algorithm for removing double counting among overlapping images.
Evaluate the performance of the workflow on various datasets.

Expected Outcomes:

A workflow for computing unique image detections from overlapping images.
Documentation on using the workflow.

Source Code: DeepForest

Degree of Difficulty:

Intermediate, long (350 hours)

Skills:

Deep learning
Git/GitHub
Machine learning
Software testing
Python and Python package deployment

Mentors:

@bw4sz
@jveitchmichaelis
@henrysenyondo
@ethanwhite

Proposal 2: Developing an Active Learning Module for DeepForest

Rationale:

Implement an active learning module for DeepForest, allowing users to select new images for model training based on current model scores. This project will focus on integrating the BOEM repository's active learning code into DeepForest, enabling more efficient model training and improved accuracy.

Approach:

Integrate the BOEM repository's active learning code into DeepForest.
Develop a user-friendly interface for selecting new images based on model scores.
Evaluate the effectiveness of the active learning module in improving model accuracy.

Expected Outcomes:

An active learning module for DeepForest using BOEM.
Documentation on using the active learning module.

Source Code: DeepForest

Degree of Difficulty:

Intermediate, long (350 hours)

Skills:

Deep learning
Git/GitHub
Active learning
Python and Python package deployment

Mentors:

@bw4sz
@jveitchmichaelis
@henrysenyondo
@ethanwhite

Proposal 3: The Airborne Wildlife benchmark dataset.

https://github.com/landing-ai/vision-agent?tab=readme-ov-file

Rationale:

There are hundreds of airborne wildlife datasets out there, most are unavailable, in many different formats and organizations and cannot be used for machine learning model training. We have identified hundreds of datasets and will work with partners to collect, standardize and training a general airborne animal detector.

Approach:

Download and organize datasets from previously identified sources
Clone the MillionTrees repo https://milliontrees.idtrees.org/en/latest/ to create a MillionAnimals benchmark. The organization, evaluation and structure is already well defined.
Develop baseline models for a single general animal detector across taxa and backgrounds for screening of images, much as camera traps has https://github.com/agentmorris/MegaDetector. This work may be in partnership with https://github.com/microsoft/CameraTraps, depending on the readiness and state of the repo.
Connect and document both MillionTrees and MillionAnimals with DeepForest for reproducible model training.

Expected Outcomes:

An agent-interaction module for DeepForest using VisionAgent.
Documentation on using the agent module.

Source Code: DeepForest

Degree of Difficulty:

Intermediate, long (350 hours)

Skills:

Deep learning
Git/GitHub
Active learning
Python and Python package deployment

Mentors:

@bw4sz
@jveitchmichaelis
@henrysenyondo
@ethanwhite

Proposal 4: DeepForest Vision Agent connection with LandingAI

https://github.com/landing-ai/vision-agent?tab=readme-ov-file

Rationale:

Text-based queries of images for labeling and organization.

Approach:

Create configuration for DeepForest users to register LLM keys
Object detection and segmentation workflows
Develop a user-friendly interface for selecting new images based on agent responses
Evaluate the effectiveness of the active learning module in improving model accuracy.

Expected Outcomes:

An agent-interaction module for DeepForest using VisionAgent. Documentation on using the agent module.

Source Code: DeepForest

Degree of Difficulty:

Intermediate, long (350 hours)

Skills:

Deep learning
Git/GitHub
Active learning
Python and Python package deployment

Mentors:

@bw4sz
@jveitchmichaelis
@henrysenyondo
@ethanwhite

Proposal 5: Integrating BIOCLIP Backbone into DeepForest's CropModel for Improved Accuracy

https://huggingface.co/imageomics/bioclip https://github.com/Imageomics/bioclip/tree/main

The DeepForest crop model Source Code: CropModel

Rationale:

Develop a BIOCLIP backbone for the CropModel, enabling improved accuracy and efficiency in crop classification tasks. This project will focus on integrating the BIOCLIP architecture into the CropModel framework.

About Bioclip

BioCLIP is a foundation model for the tree of life, built using CLIP architecture as a vision model for general organismal biology. It is trained on TreeOfLife-10M, our specially-created dataset covering over 450K taxa--the most biologically diverse ML-ready dataset available to date. Through rigorous benchmarking on a diverse set of fine-grained biological classification tasks, BioCLIP consistently outperformed existing baselines by 16% to 17% absolute. Through intrinsic evaluation, we found that BioCLIP learned a hierarchical representation aligned to the tree of life, which demonstrates its potential for robust generalizability.

Approach:

Integrate the BIOCLIP architecture into the CropModel framework.
Evaluate the performance of the BIOCLIP backbone on various airbore wildlife classification datasets. Can it be finetuned? What kind of zero-shot performance does it have?
Develop a user-friendly interface for using the BIOCLIP backbone within CropModel. Combining prompts with images.

Expected Outcomes:

A BIOCLIP backbone for the CropModel.
Documentation on using the BIOCLIP backbone within CropModel.

Degree of Difficulty:

Intermediate, long (350 hours)

Skills:

Deep learning
Git/GitHub
Crop classification models
Python and Python package deployment

Mentors:

@bw4sz
@jveitchmichaelis
@henrysenyondo
@ethanwhite

Proposal 6: Creating a multi-sensor airborne benchmark for Tree Species classification using National Ecological Observatory Network.

The aim of this proposal is to generate a novel machine learning benchmark for over 40,000 tree stems, 80 species, and 20 geographic areas across the United States. Airborne data includes RGB orthophotos, LiDAR airborne data, and 369 band hyperspectral sensing. We have hand-annotated tree crowns for each field generated tree stem and aim to wrap these data into challenging train-test splits to increase the accuracy and realism of remote sensing tree identification research.

Rationale:

Organizing the data into an easy to use benchmark will allow rapid verification and multi-sensor integration. While NEON data are public, they are not easy to access for machine learning researchers.

Approach:

Organize ground truth stems and airborne tree crown annotations alongside multi-sensor airborne imagery.
Create a pull request at torchgeo to add to their dataset storage. https://github.com/microsoft/torchgeo
Create a reproducible baseline model to show performance of per sensor and cross sensor performance.
Connect these data to the DeepForest repo https://deepforest.readthedocs.io/ using a simple two step process of object detection and classification, as well as a place for a leaderboard and dataset information.

Expected Outcomes:

A completed pull request to torchgeo organizing the data
A well organized github repo with a simple baseline model
A completed pull request to the DeepForest repo connecting the benchmark to tree detection and CropModel

Degree of Difficulty:

Intermediate, long (350 hours)

Skills:

Deep learning
Git/GitHub
Crop classification models
Python and Python package deployment

Mentors:

@bw4sz
@jveitchmichaelis
@henrysenyondo
@ethanwhite

GSoC 2025 Project Ideas

Proposal 1: Efficient Detection of Unique Images from Overlapping Images

Rationale:

Approach:

Expected Outcomes:

Degree of Difficulty:

Skills:

Mentors:

Proposal 2: Developing an Active Learning Module for DeepForest

Rationale:

Approach:

Expected Outcomes:

Degree of Difficulty:

Skills:

Mentors:

Proposal 3: The Airborne Wildlife benchmark dataset.

Rationale:

Approach:

Expected Outcomes:

Degree of Difficulty:

Skills:

Mentors:

Proposal 4: DeepForest Vision Agent connection with LandingAI

Rationale:

Approach:

Expected Outcomes:

Degree of Difficulty:

Skills:

Mentors:

Proposal 5: Integrating BIOCLIP Backbone into DeepForest's CropModel for Improved Accuracy

Rationale:

About Bioclip

Approach:

Expected Outcomes:

Degree of Difficulty:

Skills:

Mentors:

Proposal 6: Creating a multi-sensor airborne benchmark for Tree Species classification using National Ecological Observatory Network.

Rationale:

Approach:

Expected Outcomes:

Degree of Difficulty:

Skills:

Mentors:

Clone this wiki locally