Written in 2018, 2019 by Eliah Kagan <degeneracypressure@gmail.com>.
To the extent possible under law, the author(s) have dedicated all copyright and related and neighboring rights to this software to the public domain worldwide. This software is distributed without any warranty.
You should have received a copy of the CC0 Public Domain Dedication along with this software. If not, see http://creativecommons.org/publicdomain/zero/1.0/.
newrepo
and findrepo
are utilities that allow users with shell access to
manage Git repositories. They are intended primarily for remote use via SSH,
though you can also use them locally. They run on Unix-like systems, but there
is of course no limit on what system you SSH in from. The user running either
tool needs access to traverse into and read the directory where repositories
are kept, and needs write access to create new repositories with newrepo
.
I use these tools in production. However, for general use, I suggest considering them beta-quality (at best). They're missing some desirable features and haven't been widely tested. As far as I'm aware, they're only deployed long-term on one machine in the world.
Note that findrepo
will take a long time and consume substantial RAM if
repositories have unusually long names. In practice this is not a problem. In
the recommended use case, anyone who can run findrepo
on a server can also
run their own CPU and memory-intensive programs, so it shouldn't have severe
security implications.
See below for usage. See nrr-frr for the suggested frontends. You may also want to set up a lightweight web-based interface on your Git server; I use cgit (see setup-cgit).
Currently, the path /repos
is hard-coded as the directory where repositories
are kept. This can be modified by editing the source code of findrepo
and
newrepo
. The only external configuration is the domain name (or IP address)
that will be used in the ssh://
URLs the tools output. Using the output of
hostname
, or other such methods, would not be reliable, especially when the
server is a machine in a home or small office connected to the Internet with a
NAT router, which is the primary use case for these tools.
newrepo
and findrepo
look for /etc/newrepo-findrepo.conf
and expect it to
contain a line of the form
host=HOST
where HOST
is a domain name or IP address.
For example, if that file consists of the line host=example.com
, then an
argument of gnomovision
passed to newrepo
or findrepo
refers to a
repository intended for access by the remote URL
ssh://example.com/repos/gnomovision.git
.
I suggest making a group, e.g. coders
, membership in which confers the
ability to write to /repos
. Make /repos
a setgid directory with root
as
its owner and coders
(or your chosen group name) as its group owner.
For example, on Debian with coders mary
and larry
:
sudo addgroup coders
sudo mkdir /repos
sudo chgrp coders /repos
sudo chmod g+ws,o-rwx /repos
sudo adduser mary coders
sudo adduser larry coders
This setup assumes all members of coders
are trusted with access to all
repositories. In particular, it permits users in that group to modify and even
delete their fellow users' repositories.
One way to use these utilities from client machines is to run commands like:
ssh example.com newrepo gnomovision
ssh example.com findrepo gnomovision
You may wish to automate use of newrepo
and findrepo
with dedicated
frontends, implemented as scripts, shell functions, or the like.
The nrr
and frr
frontends, provided in the (currently separate)
nrr-frr repository, are suggested for
this purpose. They're $GIT_SSH
-aware, retrieve the appropriate hostname from
a configuration file in the user's home directory, and are run like:
nrr gnomovision
frr gnomovision
Those commands could correspond (roughly) to the ssh
commands shown above.
newrepo
creates a new bare Git repository inside a top-level directory of
repositories and reports the URL that can be used to clone it or add it as a
remote. The name of the repository to be created is passed as a command-line
argument.
newrepo NAME
It looks like this when it succeeds at creating a repository:
$ newrepo gnomovision
Initialized empty Git repository in /repos/gnomovision.git/
The URL of the newly created repository is:
ssh://example.com/repos/gnomovision.git
When it fails, its output is unlikely to be misunderstood as success:
$ newrepo gnomovision
mkdir: cannot create directory ‘gnomovision.git’: File exists
findrepo
searches repositories in a top-level repo directory. It assumes the
user intends an exact match but might be wrong, perhaps due to typos, but as
likely due to incomplete or mistaken memory about the name of the repository
they're looking for. The search pattern is passed as a command-line argument.
findrepo GUESS
Your guess can be pretty wildly wrong, and findrepo
still usually manages to
include what you meant in its small handful for suggestions.
findrepo
shows a URL when it's pretty sure that specific URL is what you
actually want (for cloning a repo or adding a remote). When there is any
substantial ambiguity, is shows its own guesses but not a URL. You can run it
again with one of those guesses to get a URL.
Even when there is no real ambiguity, it warns about situations like incorrect capitalization that might cause confusion under some circumstances.
When your guess is correct, it gives you the URL for the repository:
$ findrepo gnomovision
The URL for the repository "gnomovision" is:
ssh://example.com/repos/gnomovision.git
When there are other repositories that would have matched except for differences in capitalization (a situation that would typically arise only by accident), it gives you the URL but also warns:
$ findrepo TestRepo
The URL for the repository "TestRepo" is:
ssh://example.com/repos/TestRepo.git
BEWARE! 2 other repos are named like this but capitalized differently.
TeStRePo
* TestRepo
testrepo
When your guess is off only by case, findrepo
responds appropriately
depending on whether or not there is ambiguity:
I have no "GnomoVision" repo. Perhaps you mean "gnomovision"?
The URL for the repository "gnomovision" is:
ssh://example.com/repos/gnomovision.git
BEWARE! The filesystem on this server is case-sensitive.
$ findrepo TESTREPO
I have no "TESTREPO" repo. Maybe you want one of these...
TeStRePo
TestRepo
testrepo
Those repos are like what you requested, but with different capitalization.
If any of those look EXACTLY the same, they may have weird Unicode characters.
The more interesting situation is when you spell the repository name incorrectly, or just flat-out guess wrong:
$ findrepo gnomviissn
I have no "gnomviissn" repo. These repos have vaguely similar names...
gnomovision
NVI
Visit
AsIs
Disposing
$ findrepo visual-moon
I have no "visual-moon" repo. These repos have vaguely similar names...
gnomovision
foo
Pool
Visit
Mob
NVI
The short handful of "similar names" counter-guessed by findrepo
often
includes some you may find totally irrelevant and that apparently dissimilar.
But the one you're looking for is usually among them.
findrepo
matches guesses according to cosine
similarity of
nonempty-substring frequency vectors.
Ordinary spell-checkers address the use case of spellings that are wrong but
very nearly right, and variations on that problem. They're typically based on
the idea of edit distance. I
considered this approach for findrepo
but decided against it, because when
users guess the name of a repository, they're often not guessing just about
fine points of spelling. My concern wasn't that conceptual errors wouldn't be
caught, but that they wouldn't be ranked high enough compared to typos or, if
they were, then practically unlikely conceptual errors wouldn't be ranked low
enough compared to practically likely ones.
In hindsight, I'd characterize the difference between spell-checking and what
findrepo
does this way: spell-checkers find near-matches among many possible
candidates, while findrepo
finds farther-away matches among fairly few
candidates. Since newrepo-findrepo is intended for small-scale use, it's
unlikely to be used with more than a few thousand repositories. My own use is
with hundreds.
I briefly considered the approach of attempting to write something that would glean meaning from repositories' names. That seemed both interesting and quite hard. It also wasn't obvious to me that it would work better than a simpler approach. Repository names, which are often names of software projects or abbreviations thereof, are unusual applications of natural language; I didn't think a natural language processing approach would provide any clear advantage. More important, I wanted to start by writing a prototype that my own knowledge of algorithms and data structures would facilitate, and then iterate or redesign from there if it proved inadequate.
I had heard of cosine similarity as a measure of document
distance. One approach, which
does not apply to the problem findrepo
solves but which is nonetheless
illustrative, is to represent the similarity of natural-language documents as
the normalized inner product of their word frequency vectors. I very briefly
considered using cosine similarity of character frequency vectors, but this was
obviously wrong—for example, the name of a repository and a scrambling thereof
shouldn't match each other nearly as well as the name matches itself.
What I settled on was to enumerate all nonempty substrings of existing
repository names and the user's guess, counting the frequencies of each, then
rate proximity of each repository name to the guess by the normalized inner
product of that frequency vector. In this way, the mere presence of some of the
same letters (in similar amounts) counts for something, but not too much, since
most names contain many more substrings of greater lengths. In particular, this
addresses the case of forgotten word order within a name, in such a way that
findrepo
doesn't need to understand which substrings are the words. For
example, with two words, a name has a left and right side (not necessarily the
same length), and exchanging the sides removes the contributions of the
substrings that cut across the break, but not of those on one side or the other
of the break.
Case is ignored for purposes of assessing name similarity. I didn't think of a way of incorporating it that would justify the additional complexity. Cosine similarity is only used when the guess neither exactly matches the name of any existing repository nor matches it under case-folding equivalence. Thus all repository names have to be considered regardless of the guess (even given an exact match, there might be ambiguity to warn about), but cosine similarities are only computed—and nonempty substring frequency vectors only built—when necessary.
The reason substrings must be nonempty to be considered is so length doesn't dominate too much. An n-character string contains n + 1 empty substrings (one at each character position, and one at the end).