[Today’s random Sourcerer profile: https://sourcerer.io/jgosmann]

How to: add your favorite language & libraries to Sourcerer

Alexander Surkov
Sourcerer Blog
Published in
5 min readMay 23, 2018

--

Profiles

Sourcerer profiles are built based on an engineer’s skills. The core of an engineer’s profile is built around programming languages that the engineer has expertise with. If you have ever dealt with Java or C++, then your profile would reflect this. Another key piece in our profiles are technologies that the engineer is proficient with. If you work on a project that involves HTML, CSS and JavaScript, then you would be granted a web technology header housing these individual technologies. If you use SqlAlchemy or Redis, then you would be given a database technology header on your profile as well. Curious to take a look at a profile sample? Here’s a real one.

All languages and technologies that you have ever worked with, should be reflected on your profile to describe your work experience accurately. Your profile should also highlight the most valuable parts and stay precise in accentuating the subtle differences between similar technologies in the market, all in an effort to make your profile comprehensive, accurate, and individual to you.

Hidden from the eyes, we have an under-the-hood mechanism that recognizes and extracts all the languages and technologies from your commits and source code. We started an open source project, named sourcerer-app, designed for this work. The exact same routine happens if you link your github account to Sourcerer. In this case, the app runs on our servers through your public github projects. It just happens behind the scenes.

Language and technology detection takes a central place in the sourcerer app logic. Profile quality strictly depends on how accurate and detailed the app is. Needless to say, if the app fails to identify languages/technologies, then your profile is incomplete.

Where are my languages?

If you see a language or a technology not showing up on your profile, and you are curious why it wasn’t detected, then the short answer is that we probably do not support it *yet*.

According to Quora there are about 256 programming languages. It’s an impressive amount and shows just how rich and wide an engineer’s choices are. But what’s more, the combinations of those, plus numerous libraries and frameworks, created for these languages, form an endless amount of variations.

Admittedly, we work very hard to add new languages and technologies as often as we can, in order to meet your expectations and make profiles as comprehensive as possible. We still could definitely use some help.

How can I help?

Yes, your contributions are welcomed! Only engineers can make engineering profiles better! So, if you see that your favorite language or technology is missing on your profile, you now have the ability to fix it yourself. All you need is to teach sourcerer-app to recognize it. Here’s how it’s done.

Languages

First of all you need to build sourcerer-app on your machine. Make sure that your system meets the prerequisites, which is basically Linux or OS X and Java, and also please check out our contribution guideline. Now you are all set! It’s coding time!

Languages detection logic lives in this folder. At first we attempt to recognize a language by a file extension. If the file extension is unique, then you can breathe out — it’s an easy one. All you need is to add the mapping between the file extension and the language name.

Let’s say, your favorite programming language, named ‘Baklava Code’, is not yet supported. It is hosted in files of ‘.code’ extension. You can support it by simply adding a block to our Heuristics structure like this:

“code” to { _ ->

CommonExtractor(“Baklava Code”)

},

CommonExtractor is a dummy extractor class, which has no knowledge about the language. Thus no statistics on your profile for this language other than name. But no worries, we’ll discuss hereinafter how the statistics can be refined by libraries and technologies. Keep reading on!

If the file extension is not unique, i.e. may used for different languages, then we attempt to identify patterns specific for a language. For example,‘.h’ is a file extension, used both for ‘c’ and ‘c++’. If we got a ‘.h’ file, then we check the file content againsts c++ specific structures like ‘class’, ‘template’ and such. If we find those, then we report ‘c++’, otherwise ‘c’.

Let’s get back to our ‘Baklava Code’ example. Let the ‘.code’ file extension is also used by some other programming language, for example, by ‘Hydro Code’ language. In this case, when the app sees ‘.code’ file extension, then it may fail to detect the language correctly. Obviously it needs to know more about these languages to be able to distinguish them from each other. In other words, the app should be able to recognize patterns and structures, specific for the languages, in order to resolve the ambiguity. In our funcy example, the code may look like:

“code” to { lines ->

if (Regex(“backlava code patterns”).matches(toBuf(lines)))

CommonExtractor(“Backlava”)

else

CommonExtractor(“Hydra”)

},

Last but not least, please make sure to put sample files of the language into samples folder. It will be picked up by out testing script automatically, which will check the folder’s name against the name of detected language. If matches, then the test will report the success. It may happen that we have samples for the language already. If so, then these are likely under todo tests, and you should adjust our todo list by remove them.

You can run tests on your machine by running ‘gradle test’ in sourcerer-app folder to ensure your patch is correct and the language is now recognized correctly.

Libraries and technologies

You can improve language statistics further by adding deeper analysis of source code. First, you will need to teach sourcerer-app to understand syntax of a target language. You need to create a new extractor class that will parse a source file and extract import directives. See, for example, C++ extractor class, which looks for #include directives, and then runs the extracted imports againsts known libraries.

Library taxonomies live in a separate repository, which is also open source, and thus you’re encouraged to hack on it too :) All taxonomies are present by JSON files, one file per language. A file content is an array of items, every one of which describes a library. Each library has a number of properties like name and technologies, the library is related to.

Here’s example of such record, set for Cinder c++ library:

{

“id”: “cpp.cinder”,

“imports”: [“cinder”],

“name”: “Cinder”,

“repo”: “cinder/Cinder”,

“tags”: [“toolkit”],

“tech”: [“graphics-gaming”]

}

Sourcerer-app doesn’t work directly with JSON taxonomies. Instead, it uses AI system, the taxonomies are fed to for training. Then the app runs runs through source code, line by line, and AI makes an educated guess, which library was used. Happily enough, you don’t have to know much about these internals. All you need is to tweak JSON files to support a new library :)

To start, please check out, whether we already have a taxonomy for your language. If not, then add one. Then put a new record for each library, you want to support. If library should be connected to a technology, that we don’t have yet, then please add one. Likewise the libraries, the technologies are described by JSON file, and all you need is to add a new record for it. Please refer to our guideline for more information.

That’s about it. Sounds exciting enough to start experimenting? You’re welcome to contribute! Happy coding!

--

--

Software engineer and author passionate about web, open source, accessibility and user-facing technologies.