Step 0, find all the awesomeness

How to find good repos on GitHub?

  • Starred repos top lists…? Boring… and need more than 100 for this idea.

  • All those awesome- repos are usually fun to browse! What if we try to find some of them instead?

What’s a pragmatic way to programmatically get many of those though?

So I started out by manually finding and cloning a few of those into a local directory named ./awesome-awesome.

Step 1, clone all the repos

Now let’s ask ChatGPT to find and clone all the awesome- repos that our meta repos on disk refer to. That is: let it write some sort of file-reader/parser that also invokes git clone, which we can run locally.

First I tried to ask ChatGPT to do this in Golang, but it got quite verbose and ChatGPT stalled more than once, perhaps due to the length of the script, or perhaps just an intermittent performance issue, who knows. In any case, as plan B I switched over to Python and that seemed to go a bit smoother.

After settling on Python, This is the prompt I started with:

Write me a python script that iterates over all directories inside the directory ./awesome-awesome with no recursion involved and in each of them reads the readme.md regardless of filename case.

within each readme it should look for links to github where the repo name is awesome- case insensitive match, where is something other than “awesome” with case insensitive match

the script should proceed by cloning each repository found into the directory structure ./awesome-repos/[github-user]/awesome- - keep the same case as the original link However, if the directory already exists, it should do nothing and proceed to the next repo.

For each significant step including the beginning of every loop, output friendly debug output to stdout. Also, for each iteration over many items, show a count n of m for every related debug message. If the cloning or parsing fails at any step, the script should print a warning to stderr and continue to the next link or repo.

Here is a sample implementation of the script:
import os
import re
import subprocess

def clone_repo(repo_url, repo_name, repo_path):
    try:
        if not os.path.exists(repo_path):
            os.makedirs(repo_path)
        cmd = f"git clone {repo_url} {repo_path}"
        subprocess.run(cmd, check=True, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        print(f"Successfully cloned {repo_name} at {repo_path}")
    except Exception as e:
        print(f"Error cloning {repo_name}: {e}", file=sys.stderr)

def parse_readme(directory):
    readme_file = None
    for filename in os.listdir(directory):
        if filename.lower() == "readme.md":
            readme_file = os.path.join(directory, filename)
            break
    if readme_file is None:
        print(f"No README.md found in {directory}", file=sys.stderr)
        return

    with open(readme_file, "r") as f:
        readme_content = f.read()

    regex = re.compile(r"https://github.com/([\w-]+)/awesome-(?!awesome)[\w-]+", re.IGNORECASE)
    repo_links = regex.findall(readme_content)

    if not repo_links:
        print(f"No matching repositories found in {readme_file}", file=sys.stderr)
        return

    for i, repo_link in enumerate(repo_links):
        repo_name = repo_link.split("/")[-1]
        repo_path = os.path.join("./awesome-repos", repo_name.lower())
        # repo_url = f"https://github.com/{repo_link}"
        repo_url = f"https://github.com/{}".format(urllib.parse.quote(repo_link))

        print(f"Processing {i+1} of {len(repo_links)}: {repo_name}")
        clone_repo(repo_url, repo_name, repo_path)

def main():
    base_directory = "./awesome-awesome"
    if not os.path.exists(base_directory):
        print(f"Error: {base_directory} does not exist", file=sys.stderr)
        return

    directories = [os.path.join(base_directory, d) for d in os.listdir(base_directory) if os.path.isdir(os.path.join(base_directory, d))]
    for i, directory in enumerate(directories):
        print(f"Processing directory {i+1} of {len(directories)}: {directory}")
        parse_readme(directory)

if __name__ == "__main__":
    main()

Looks promising!

It forgot to import sys though, so the code isn’t even executable:

NameError: name 'sys' is not defined

It also tried to clone the user profile:

Processing directory 1 of 7: ./awesome-awesome/awesome-all
Processing 1 of 33: bradoyler
Error cloning bradoyler: Command 'git clone https://github.com/bradoyler ./awesome-repos/bradoyler' returned non-zero exit status 128.
Processing 2 of 33: bradoyler
Error cloning bradoyler: Command 'git clone https://github.com/bradoyler ./awesome-repos/bradoyler' returned non-zero exit status 128.
Processing 3 of 33: bradoyler
Error cloning bradoyler: Command 'git clone https://github.com/bradoyler ./awesome-repos/bradoyler' returned non-zero exit status 128.
Processing 4 of 33: JStumpp
Error cloning JStumpp: Command 'git clone https://github.com/JStumpp ./awesome-repos/jstumpp' returned non-zero exit status 128.
Processing 5 of 33: J2TeaM
Error cloning J2TeaM: Command 'git clone https://github.com/J2TeaM ./awesome-repos/j2team' returned non-zero exit status 128.

What a noob! But easy things to fix. I might have stopped from boredom and procrastination three times over before getting to this point on my own, whereas these trivial errors of someone elses making are a breeze to comlain about power through. But before spending enven that amount of brainpower, let’s see if ChatGPT can fix it for us.

I posted the first prompt again and added this clarification to the end:

Remember to import all dependencies. Also note that a github clone url for the repo https://github.com/[github-user]/awesome- has the form https://github.com/[github-user]/awesome-.git

In response to this, ChatGPT came up with a different "cloning" strategy and decided to try to download zip files from GitHub instead. I aborted that run and appended this to the list of clarifications:
Use `git clone` when cloning from the .git url.

And it still didn't remember to import sys so I also added:
...including sys if needed.

It also messed up by looking for urls ending with .git in the readme files, so I rewrote that clarification too:

Also note that if you find a repo url with this form in the readme https://github.com/[github-user]/awesome- then you need to append .git before you hand it to `git clone`

This time it came up with something that worked out of the box! The thing started cloning repos left and right:

Processing directory 1 of 7: awesome-all
Cloning all from https://github.com/bradoyler/awesome-all.git
Cloning into './awesome-repos/bradoyler/all'...
remote: Enumerating objects: 80, done.
remote: Total 80 (delta 0), reused 0 (delta 0), pack-reused 80
Receiving objects: 100% (80/80), 12.19 KiB | 135.00 KiB/s, done.
Resolving deltas: 100% (40/40), done.
Skipping all as it already exists
Skipping all as it already exists
Cloning android from https://github.com/JStumpp/awesome-android.git
Cloning into './awesome-repos/JStumpp/android'...
remote: Enumerating objects: 1371, done.
remote: Counting objects: 100% (4/4), done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 1371 (delta 0), reused 2 (delta 0), pack-reused 1367
Receiving objects: 100% (1371/1371), 586.94 KiB | 985.00 KiB/s, done.
Resolving deltas: 100% (689/689), done.

...

It got stuck on the 9:th one though:

Cloning cobol from https://github.com/dshimy/awesome-cobol.git
Cloning into './awesome-repos/dshimy/cobol'...
Username for 'https://github.com':

At this point I didn’t bother asking ChatGPT to come up with a solution or amend the script. Instead I used those goodold Google/StackOverflow muscles. Or ServerFault in this case, where the suggestion of adding GIT_TERMINAL_PROMPT=0 sounded like just what I needed.

My Python subprocess memories were a bit rusty though, and I don’t remember ever using check_call(). Instead of RTFM, I asked ChatGPT if it knew how to modify that single line:

How would you modify this line:

subprocess.check_call(["git", "clone", repo_url, target_dir])

…so that GIT_TERMINAL_PROMPT=0 is set as an environment variable when the subprocess runs?

You can modify this line to set the GIT_TERMINAL_PROMPT environment variable as follows:
subprocess.check_call(
    ["git", "clone", repo_url, target_dir],
    env={'GIT_TERMINAL_PROMPT': '0'})

Couldn’t have been much easier, so RTFM:ing would have payed off. But ChatGPT took us straight to the answer too, given that we solved the hard part via ServerFault. Wonder if it would have known the full answer on its own…

Time to try again. This time with 8 repos cloned already, we also get to see if the script is idempotent or not.

Processing directory 1 of 7: awesome-all
Skipping all as it already exists
Skipping all as it already exists
Skipping all as it already exists
Skipping android as it already exists
Skipping AutoIt as it already exists
Skipping shell as it already exists
Skipping bigdata as it already exists
Skipping c as it already exists
Skipping cpp as it already exists
Skipping clojure as it already exists
Cloning cobol from https://github.com/dshimy/awesome-cobol.git
Cloning into './awesome-repos/dshimy/cobol'...
fatal: could not read Username for 'https://github.com': terminal prompts disabled
Failed to clone cobol from https://github.com/dshimy/awesome-cobol.git
Cloning cl from https://github.com/kozross/awesome-cl.git
Cloning into './awesome-repos/kozross/cl'...

Great success! It skipped over the existing ones. And it skipped the private repo too, that’s perfect!

This run took a while: many awesome repos to clone!

Step 2, stay in sync!

Next up, before proceeding with the data gathered, I thought it would also be useful to ask ChatGPT for an update script, i.e. a way to pull the latest version. Being Git-based is quite handy in that scenario.

ChatGPT did ok, I tweaked the prompt a few times in minor ways, mostly to get the verbosity to my taste

Write me a python script that iterates over all local clones of git repos with the directory structure ./awesome-repos/[github-user]/[git-repo]/ and pulls the latest change from the remote into the currently checked out branch, ignoring local changes, and use pull --rebase instead of fetch + checkout, to reduce the verbosity of those steps.

For each significant step including the beginning of every loop, output friendly debug output to stdout.

Make sure to include the total count of repos overall and which repo overall we’re currently on when doing that, i.e. make two passes over all the directories and subdirectories, one where you count all the repos and another one where you fetch while knowing the total count.

If anything fails at any point, the script should print a warning to stderr and continue to the next step or repo. And remember to import all dependencies, including sys if needed.

Here's a python script that meets your requirements:
import os
import sys
import subprocess

repo_root = './awesome-repos/'

def run_command(command, cwd=None, log_error=True):
    try:
        result = subprocess.run(command, cwd=cwd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, check=True, shell=True, text=True)
        return result.stdout.strip()
    except subprocess.CalledProcessError as error:
        if log_error:
            print(f"Error: {error.stderr}", file=sys.stderr)
        return error.stderr

# Get a list of all local git repos
repo_directories = []
for root, dirs, files in os.walk(repo_root):
    if '.git' in dirs:
        repo_directories.append(root)

total_repos = len(repo_directories)
print(f"Total repos: {total_repos}")

# Iterate over all local git repos
for index, repo_dir in enumerate(repo_directories):
    print(f"Processing repo {index + 1} of {total_repos}: {repo_dir}")
    try:
        run_command(f"git pull --rebase", cwd=repo_dir)
        # print("Success")
    except:
        print(f"Failed to pull the latest changes for {repo_dir}", file=sys.stderr)

And here is the run log:

Total repos: 308
Processing repo 1 of 308: ./awesome-repos/0xtokens/blockchain
Success
Processing repo 2 of 308: ./awesome-repos/AcalephStorage/devops
Success
Processing repo 3 of 308: ./awesome-repos/AdrienTorris/blazor
Success
Processing repo 4 of 308: ./awesome-repos/AngularClass/angular
Success
Processing repo 5 of 308: ./awesome-repos/AntBranch/github
Success
Processing repo 6 of 308: ./awesome-repos/BubuAnabelas/markdown
Success

The success message is a bit redundant in hindsight, one line per pull + additional info on update and failure would have been sufficient.



Step 3, find the actual awesomeness

Next let’s try to extract info about the actually awesome repos into individual units.

We could insert them into a database, but for now I thought something structured and table like that shows ChatGPT is on track would suffice. So I settled for .csv as an intermediay format, here’s the next prompt:

Write me a python script that iterates over all local clones of git repos with the directory structure ./awesome-repos/[github-user]/[git-repo]/ and reads the readme.md regardless of case in each of them. [git-repo] is a string consisting of the prefix awesome- followed by our [topic] for that file.

within each readme, look for list items and extract a title i.e. the link text, the url and the description that might or might not be included at the end of the line.

if there are markdown headlines and sub-headlines that group the list-items logically, it should keep track of them too, up to three levels deep. whenever headlines aren’t detected use the empty string.

The output should be a csv with the columns topic, title, url and desc which should be written to ./awsome-csv/[github-user]/awesome-.csv replacing earlier versions of itself if any.

For each significant step including the beginning of every loop, output friendly debug output to stdout. Also, for each iteration over many items, show a count n of m for every related debug message. If the cloning or parsing fails at any step, the script should print a warning to stderr and continue to the next link or repo. And remember to import all dependencies, including sys if needed. And make sure to use path.join() when concatenating filesystem paths.

Here is the script to accomplish the tasks you described:
import os
import re
import csv

def create_output_directory(github_user):
    csv_directory = f"./awesome-csv/{github_user}"
    if not os.path.exists(csv_directory):
        os.makedirs(csv_directory)
    return csv_directory

def main():
    for github_user in os.listdir('./awesome-repos/'):
        if not os.path.isdir(f'./awesome-repos/{github_user}/'):
            continue
        for git_repo in os.listdir(f'./awesome-repos/{github_user}/'):
            topic = git_repo.replace('awesome-', '')
            repo_path = f'./awesome-repos/{github_user}/{git_repo}/'
            if not os.path.isdir(repo_path):
                continue
            readme_path = os.path.join(repo_path, 'readme.md')
            
            with open(readme_path, 'r', encoding='utf-8') as f:
                readme = f.read()
            
            lines = readme.split('\n')
            title, url, desc = '', '', ''
            data = []
            for line in lines:
                line = line.strip()
                if line.startswith(('-', '*')):
                    match = re.match(r'^[-*]\s+(.*?)\s*(?:\[(.*?)\]\((.*?)\))?', line)
                    if match:
                        title, url, desc = match.groups()
                        data.append([topic, title, url, desc or ''])
                        
            create_output_directory(github_user)
            csv_path = f'./awesome-csv/{github_user}/awesome-{topic}.csv'
            with open(csv_path, 'w', encoding='utf-8', newline='') as f:
                writer = csv.writer(f)
                writer.writerow(['topic', 'title', 'url', 'description'])
                writer.writerows(data)
                
if __name__ == '__main__':
    main()

4 times out of 5 it did not add an is directory check:

NotADirectoryError: [Errno 20] Not a directory: './awesome-repos/.DS_Store'

so I added it manually in a couple of places:

if not os.path.isdir(repo_path):
    continue

It also forgot to create the output directory structure before writing the file

FileNotFoundError: [Errno 2] No such file or directory: './awesome-csv/0xtokens/awesome-blockchain.csv'


Step 4, come up with something useful

Let’s try to build a webapp with our data.

As a first stab, why not give every awesome thing its own blog post, and why not use Hugo, since it is fast and simple to host.

Then we could also assign a persistent number to each of them, to be able to pick one at random easily - all we need to randomize is a RNG and the total count - no lookups necessary beyond that. Perfect for a static site without a database.

First, to ensure we don’t re-assign a posts count randomly when updating them and adding new ones, I decided to keep track of what our max count is, by creating this file by hand:

config.json

{
  "max_count": 0
}

Next, here’s a prompt to generate Hugo blog posts from our CSV:s from the previous step:

Write me a python script begins by reading the json root key "max_count" from the file ./config.json into a variable max_count, or fails if it is not able to read a positive integer.

Then, expecting the directory structure ./awsome-csv/&lt;gitub-user>/awesome-<something>.csv look up all the csv:s within it and parse them, expecting the columns topic, title, url and desc

Then compute a variant of title called slug where you only use alphanumeric ascii characters, convert dashes to hyphens and omit anything else.

Then produce set of markdown files on disk, one for each row in the csv file, excluding the title rows

use the following file structure for the output: ./awesome-hugo/[current_count]/awesome-[topic]-[slug].md

if the file already exists, read its frontmatter and extract count and use it as current_count. if the file doesn’t exist, increment max_count by 1, update config.json with the new max_count and use the new max_count as current_count

add some front matter to the beginning of the markdown file and include the following fields:

  • count, with the value of current_count
  • tags, which should be a list with the single item with the value of topic
  • awesome_url, with the value of url
  • title, with the value of title
  • github, which is a boolean with the value true if url represents a github repo and with the value false otherwise

The body of the markdown file should consist of an empty line followed by a markdown link with the value of title as link text and the value of url as link url followed by another empty line, followed by fhe value of desc if present, followed by another empty line.

Howerver, if a given markdown file already exist, make sure to preserve its original count value in the front matter.

If the cloning or parsing fails at any step, the script should print a warning to stderr and continue to the next link or repo. And remember to import all dependencies, including sys if needed.

Here is the implementation:
import os
import re
import csv
import json
import sys

def clean_slug(title):
    return re.sub(r'[^a-zA-Z0-9-]+', '', title).replace('-', '_')

def get_current_count(file_path, max_count):
    if os.path.exists(file_path):
        with open(file_path, 'r') as f:
            content = f.read()
            match = re.search(r'count: (\d+)', content)
            if match:
                return int(match.group(1))
    max_count += 1
    with open('./config.json', 'w') as f:
        json.dump({'max_count': max_count}, f)
    return max_count

def parse_csv(csv_file):
    with open(csv_file, 'r') as f:
        reader = csv.DictReader(f)
        for row in reader:
            if 'topic' not in row or 'title' not in row or 'url' not in row:
                print(f'Skipping row in {csv_file} as it does not have required columns', file=sys.stderr)
                continue
            topic = row['topic']
            title = row['title']
            url = row['url']
            desc = row.get('desc', '')
            slug = clean_slug(title)
            current_count = get_current_count(f'./awesome-hugo/{current_count}/awesome-{topic}-{slug}.md', max_count)
            github = url.startswith('https://github.com/')
            with open(f'./awesome-hugo/{current_count}/awesome-{topic}-{slug}.md', 'w') as md:
                md.write(f'---\ncount: {current_count}\ntags: [{topic}]\nawesome_url: {url}\ntitle: {title}\ngithub: {github}\n---\n\n')
                md.write(f'[{title}]({url})\n\n')
                md.write(f'{desc}\n\n')

if __name__ == '__main__':
    try:
        with open('./config.json', 'r') as f:
            config = json.load(f)
            max_count = int(config['max_count'])
    except (FileNotFoundError, KeyError, TypeError, ValueError) as e:
        print(f'Error reading max_count from config.json: {e}', file=sys.stderr)
        sys.exit(1)
    if max_count <= 0:
        print(f'Invalid max_count value: {max_count}', file=sys.stderr)
        sys.exit(1)

    csv_dir = './awesome-csv'
    if not os.path.exists(csv_dir):
        print(f'Directory {csv_dir} does not exist', file=sys.stder

Sidenote 1: Warning for unsafe code!

Notice how ChatGPT is using f-strings to build file and url paths with direct concatenation and no escaping! Horrible practice… Halfway through writing this article I decided to rewrite all of those by hand in case someone copy-pastes from the article, so you will see a few of these comments in the code examples:

# repo_url = f"https://github.com/{repo_link}" # dangerous
repo_url = f"https://github.com/{}".format(urllib.parse.quote(repo_link))

At this point I did ask ChatGPT how to do it safely:

take the python expression repo_url = f"https://github.com/{repo_link}" and make it safer by applying url component escaping
We can use the urllib.parse module in Python to safely escape the components of a URL. Here's the updated code:
import urllib.parse

repo_link = "my_repo/my_project"
repo_url = "https://github.com/{}".format(urllib.parse.quote(repo_link))

I tried giving it some context with the “Do you understand?” trick before the actual prompt…

Pretend you're a seasoned Python developer. I am going to ask you to write a few Python scripts. The code you produce should always build urls and file paths with proper escaping when injecting variables. Do you understand?
Yes, I understand. I can write scripts in Python that properly escape variables when building URLs and file paths.

…but when repeating the first prompt after the “Do you understand?” prompt it immediately fell into the trap again:

...

def clone_repo(repo_url, repo_name):
    repo_path = f'./awesome-repos/{repo_name}'

    ...

Adding the instructions to the end of the code prompt itself worked though:

Here's the updated script that properly escapes variables when building URLs and file paths:
...

def clone_repo(repo_url, repo_name):
    repo_path = os.path.join('./awesome-repos', repo_name)

    ...

But I have seen it blatantly ignoring instructions like that at times, so never trust it blindly to do the right thing! If the complexity of the prompt is large enough I guess it starts cutting corners.

With the free version of ChatGPT you also run into the resource quotas per prompt if you make the prompt too complicated.

I also had to keep the complexity down in some of the later prompts in this article so that it wouldn’t stall due to running into quotas while generating the output. A commercial or self-hosted AI bot would probably solve some of those problems.



Sidenote 2: Encapsulate it with Docker

If you’re like me and no longer have Python installed, but use Docker daily, then this helper might be useful:

#!/bin/bash

set -e

cd $(dirname $0)

docker run -ti --rm -v "$PWD:$PWD" -w "$PWD" python:3.11 $@

To rely on this harness I use ./run.sh python step1-clone.py instead of just python step1-clone.py.

This has the added benefit of mitigating some of the catastrophic scenarios that can come from running badly written code.

If I run the script as my regular user on my day to day computer, the script is able to nuke all my files on disk more or less. Wheras with the docker workflow I only mount this particular project into the container, so it can’t do much harm to the file system beyond that.

The script still has the ability to be a bad citizen on the internet when requesting data though, and it is inside your LAN and can talk to your other computers, and even localhost services… so don’t take containerisation as a blanket protection from all kinds of hazards, when running someone elses code downloaded from the internet or generated by mighty robots in the cloud with unknown intentions. It is just one small mitigating factor in the grand scheme of good security practices and thorough code review.

Putting it all together

This article is still in progress. We will revisit it shortly and update it with links to a working Hugo site as well as links to a GitHub repo with the latest version of the files. Stay tuned!




sponsor

Need private photo sharing that's not tied up in the social media circus? Photoroll is a good way to share photos with your clients.
Or privately gather media from participants of weddings and other small to medium-sized events.