My Commit Headache

Note - this is absolutely somewhat overboard. Quitting, or opting for something simpler with the tradeoff of being less functional, would have been reasonable. However, I was doing this for fun! So enjoy.

Below, you can see a simple commit graph. Same as you would see on my GitHub.

Code commits from Sat Sep 07 2024 to Sun Sep 07 2025

OctNovDecJanFebMarAprMayJunJulAugSep

Number of Contributions:

Disclaimer - this is incomplete. Enterprise work (hosted on servers outside of GitHub) and all work that has been scrapped or is not yet ready for release is not visible.

You may be wondering why this deserves its own page. If you were to have said, "There's probably an API for that", welcome to the club.

What API?

Looking into the GitHub Documentation, I find a couple candidates that could be used in different ways. Both candidates were implemented.

Option 1: REST w/ Repo + Commit

This option is simple.

  1. Fetch all repositories currently associated with me, then
  2. For each repository, fetch all commit data associated with those repositories, and add to a list.
  3. Process the list into something usable by day.

The first immediate issue is that that's a lot of things to request, just to populate the map. Not all of that data is necessary anyway, and the latency from my site to GitHub going back and forth would be a lot. That leads us to the promising

Option 2: GraphQL

GraphQL is a query language that simplifies a lot of complicated and interconnected data queries, by giving the developer an API that will accept any requests following a certain schema, or set of rules that defines its input and output, allowing the developer to request exactly what they need, and all resources necessary, all in 1 query, for the service (in this case the GitHub API) to return. Sounds promising! It

  1. Solves our issue of waiting on potentially many requests,
  2. Only requests what we need, and
  3. Simplifies our code dramatically. Now we can interconnect all of our data in 1 query, rather than our own extraction logic.

This option would be more than suitable, right? Looking into the output of this GraphQL, I was excited! Until I ran into

The Problems

Problem 1

Our first problem, which could likely be acceptable given the fact that this site is unlikely to demand much scale, is that our code is

  • pretty slow for a browser and
  • pretty wasteful.

Every time a user would like to get information, it has to fetch ALL of the information from GitHub's servers, and they have to connect all of the data together, and then return it to us to render it. This on its own is not disqualifying. However,

Problem 2

There's no commits from my work! At the moment, I've worked at 2 companies, those being a Civil Engineering firm named Torres & Associates LLC and the Cigna Group. My commits at the Cigna Group were on their Enterprise servers, so there's no way I'll ever display those commits as I currently have no more access. Torres & Associates, on the other hand, is connected directly to my GitHub, just under a different organization that I do not have ownership over. Here, I have 3 options:

  1. Accept that I won't be able to show commits from my work as my GitHub commit graph does,
  2. Ask for a Personal Access Token from a coworker to authorize repository access, or
  3. Figure out another way.

I opted to figure out another way. After all, I may work at another company in the future, one that might not want to give me that type of access for a personal project. It would also not work anymore if I left, assuming they invalidated the token. Besides, GitHub displays it, there's got to be a way for me too as well!

Little did I know it that this was seemingly uncharted territory.

Identifying Requirements

Now that I reached territory where any solution would require a more creative solution, I needed to identify what would characterize an acceptable solution. I came up with a few key points:

Maintainable

I want something that doesn't require me to continue to have to develop around things such as new organizations to request permissions from, unreadable and messy code, or anything that requires significant administration. The solution should be implemented, and then work in perpetuity.

Cheap

I'm a college student! And while I do work summers and part time during the school year, I'm not looking to spend a whole lot for a whole lot of nothing. This is a simple graph on my page. It should be free in perpetuity.

Detailed

I want all of my commits that would show on my GitHub commit graph! That needs to be a requirement for any good solution. It also needs to update past records when code is newly visible, such as when it is merged into main (as the commit graph only shows commits that are applied to the main branch).

Efficient

This is for a website. This can't require 6 second loading times, bloat the code of my site, or waste resources unnecessarily. Any solution that would fall into that boat would be not worth doing.

Exploring Options

I looked further and further for any kind of API that I could use, but I ended up realizing that without permissions, there was no good option purely by utilizing GitHub's APIs. However, I got an idea:

Part 1: Scraping

GitHub displays this data publically, for any user, including their private commit information, as long as they choose to represent it. That has all of the information related to how many commits and what day it is, so it should be enough to recreate it.

Why not just use that data?

Our Tools

BeautifulSoup

This is a library in Python, the language I developed the scraper in, that allows a program to read through a website, and extract useful information. This was needed to parse the GitHub commit information from its URL that the commit graph is present at, and then to extract the commit information by date. This, along with other libraries necessary for execution, was incorporated within a Lambda Layer, which is a bundle of packages necessary for a lambda to execute.

Lambda

AWS, or Amazon Web Services, offers a wonderful resource called AWS Lambda. This is infrastructure-as-a-service that allows a developer to host code on their servers, and then execute it programmatically, via a direct request or trigger. This differs from more traditional models with servers, as typically, companies would manage their own infrastructure, or they would have a company like AWS or Microsoft Azure manage their servers. This would incur costs irregardless of how much it was used. If I don't need to scrape GitHub more than once a day, that doesn't matter. I pay the same amount for my services.

Lambda, on the other hand, is called serverless, meaning that developers or companies don't pay for servers, but instead pay based on their usage of certain resources. In our case, a script to extract GitHub commit data from our commit graph.

This allows us to minimize the need to manage server resources, which cuts down at cost, keeping this free forever with Lambda's free tier, as well as the maintenance overhead that that might incur.

This code will extract, and eventually save, the data such that it can be queried for our commit graph.

EventBridge

Another AWS Service, this enabled me to schedule the execution of my lambda script to occur once a day, effectively keeping my extraction up-to-date as time goes on. This allows me to not have to worry about keeping my commit information up to date, as this process is now automated.

Problems Leftover

  1. Where is this data being stored?
  2. How is it being accessed and loaded onto the frontend?

Part 2: Persistence

This refers to the first problem leftover mentioned above. Where are we storing this data? It takes time to extract all of this information from GitHub, and I certainly would not want to make too many requests and have my traffic blocked. It wouldn't be good to do this for every request. That gets in the way of our efficiency, and our maintainability, as mentioned above. So let's see if we can get this data to persist, and then load that quickly for every user.

Our Tools

Remember cost. We want to store this data in a way that incurs hopefully 0 cost whatsoever. This is just a commit graph! Yes, maybe spending so much time minimizing cost might be counterproductive as my time is worth more money than the cost minimized by not doing things like this sometimes. But that's part of the learning and the fun!

DynamoDB

As someone whose work experience over the past year and a half has been vastly focused on relational databases, especially PostgreSQL, and then S3 for object storage, NoSQL databases have made me shudder in the past. I've felt that often times the "flexibility" they offer in not enforcing a schema often leads to inflexibility down the line when software requirements inevitably change. However, for this application, I only need a key and value store, where I can fetch the last year of commit data simply. This use case has a clearly defined and simple scope, which makes it perfect for a store like this. I was easily able to get a table set up to store date and number of commits, and easily set up query logic to insert, update, and retrieve the commits per day, for practically free across time due to the limited use and pay-as-you-go nature of this serverless solution.

Boto3 AWS SDK

This library was critical for interacting with our database, and used both in scraping and querying steps. This allows us to easily query, insert into, and update our database, from our lambdas.

Part 3: Querying

The goal here is to make a simple API for use in our frontend to fetch the last year of commit data in a usable format for our commit graph. For this, I used

Our Tools

Lambda

This was used alongside an HTTPS endpoint configured for our application to query. This would simply take a request and fetch the date as well as the date a year before the request, and pull the most recent commit numbers associated with the dates, and return them back to the user. This allowed a simple interface to access the system from our static site.

(Less Interesting) Part 4: View

Now I have all of my data to fetch, I just need to be able to access it. For this, I found a heat map component that could be installed via Node, looked into what it needed, then used a fetch call on the load of the page via a React hook to access the data from my API. Then, I passed that data in, and formatted it similarly to the rest of the site.

The End :)