Castalia Camp > Articles >

Open Development data @ Eclipse: Help yourself!

Published on 2015-06-18

This article discusses what open development data is, and how it can be used in the context of the Eclipse forge. The different available development data sources are detailed, with advice to access and use them easily, and a basic example is provided to show how this information can be integrated and displayed.


Introduction

Open Development data has received increasing interest in the last 10 years, with the development of software repositories mining and the advent of big data analytics. The open source movement has largely contributed to this dissemination by providing free access to code and tools, and also by promoting openness and transparence.

What has changed in recent years is that the data has become an essential asset for companies, projects and teams. People become aware of its richness and powerful possibilities. Projects and collaborative systems now often provide a centralised repository to get organised data in an easy to use format. Tools have been developped to easily retrieve, analyse and display it.

As a major open-source forge, the Eclipse Foundation has stepped in and setup dedicated services for data collection, usage and redistribution.

About open data

Knowledge is open if anyone is free to access, use, modify, and share it.

What is open data

Development data comes in many different colours, shapes and forms. In a practical sense it may be composed of:

  • Development tools metadata like git, bugzilla, gerrit.
  • Community exchanges: mails, forums, wiki.
  • Process information: IP, planning, resources, organisational metadata..
  • Product: code, software metrics, rule-checking analysis..
  • External sources: questions and answers sites, forums, publications, user rating..

Most of the time this information can be retrieved directly, either from the tools or the data sources themselves. Projects and forges can go one step further and setup an infrastructure to ease the dissemination of open data by:

  • Cleaning and organising data assets. Providing pre-processed and ready-to-use data, organised in neat structures lowers the learning step and facilitates usage.
  • Using standardised formats like JSON, XML or CSV files through a REST API or flat files. There are plenty of tools and examples in all languages for these formats.
  • Encouraging reuse through compatibility and documentation. If the data structure is carefully explained, with code and usage examples to extract and use it, more people will come in and play.

Why open data is important

Development data serves several purposes: first of all, it allows to track the activity, advancement and practices in the project. As such it constitutes a rich source of knowledge for stakeholders, developers and users of projects, enabling them to get clear insights on the project and thus to perform better.

Having data easily available also allows people to follow the project, get news and updates. Showing the activity of the project is a definitive plus when users look for a tool to help them solve their problem. In that sense it greatly helps to nurtur and foster the community. Open development data makes people feel at home and helps them to get involved.

Development data is factual. Whatever people may think or feel about the project, development data brings numbers, figures, facts that can be verified and acted upon. They enable a clear perspective on complex code, process of code issues and may provide essential insights for resolution.

Last but not least, data is the best friend of knowledge. Open development data helps the research help you: researchers need real data to explore and study, develop new models and test them. New models, methods and tools can be developed to make a better tomorrow. Hey, if we can do it, why not?


Data Sources

The PMI

Every project entering the Eclipse umbrella has to be registered in the Project Management Infrastructure database, and must provide information about its available resources, roadmap, documentation and process. This information has been standardised and can be seen on the projects.eclipse.org web interface.

One can access the PMI web interface for any project using the following url: projects.eclipse.org/projects/<project_id>.
As an example, for Sirius the url is projects.eclipse.org/projects/modeling.sirius.

Data can also be exported directly in JSON, by using the following url: projects.eclipse.org/json/project/<project_id>.
As an example, for Sirius the url is projects.eclipse.org/json/project/modeling.sirius. An example of a JSON export of this information is provided below:

{
  "projects": {
    "modeling.sirius": {
      "title": "Sirius",
      "description": [
        {
          "value": "Sirius enables the specification of a modeling workbench [SNIP]",
          "summary": "Sirius enables the specification of a modeling workbench  [SNIP]",
        }
      ],
      "bugzilla": [
        {
          "product": "Sirius",
          "create_url": "https:\/\/bugs.eclipse.org\/bugs\/enter_bug.cgi?product=Sirius",
          "query_url": "https:\/\/bugs.eclipse.org\/bugs\/buglist.cgi?product=Sirius"
        }
      ],
      "build_url": [
        {
          "url": "https:\/\/hudson.eclipse.org\/sirius\/",
          "title": "Private Hudson instance for Sirius"
        }
      ],
      "documentation_url": [
        { "url": "http:\/\/www.eclipse.org\/sirius\/doc" }
      ],
      "download_url": [
        { "url": "http:\/\/www.eclipse.org\/sirius\/download.html" }
      ],
      "gettingstarted_url": [
        { "url": "http:\/\/wiki.eclipse.org\/Sirius\/Getting_Started" }
      ],
      "id": [
        { "value": "modeling.sirius" }
      ]
    }
  }
}
          

Grimoire and the Eclipse Dashboard

The Eclipse Dashboard (dashboard.eclipse.org) uses the Grimoire open-source tool to provide community metrics gathered from Configuration management, Issue tracking systems, Mailing lists, and Gerrit reviews.

All data can be directly visualised on the web dashboard, and raw data files can be downloaded in JSON format. As an example Grimoire data for the Sirius project can be visualised in the dashboard UI at http://dashboard.eclipse.org/project.html?project=modeling.sirius. The corresponding JSON file for the tracking system is located at http://dashboard.eclipse.org/data/json/modeling.sirius-its-prj-static.json. Here is the full file content for current numbers:

{
  "changed": 870,
  "changers": 99,
  "closed": 572,
  "closed_30": 52,
  "closed_365": 470,
  "closed_7": 8,
  "closers": 23,
  "closers_30": 6,
  "closers_365": 18,
  "closers_7": 4,
  "diff_netclosed_30": 1,
  "diff_netclosed_365": 335,
  "diff_netclosed_7": 2,
  "diff_netclosers_30": -1,
  "diff_netclosers_365": 7,
  "diff_netclosers_7": 1,
  "opened": 875,
  "openers": 99,
  "percentage_closed_30": 1,
  "percentage_closed_365": 248,
  "percentage_closed_7": 33,
  "percentage_closers_30": 14,
  "percentage_closers_365": 63,
  "percentage_closers_7": 33,
  "trackers": 1
}
          

Hudson

Many Eclipse projects use Hudson as a continuous integration engine, and make its interface available to the public. The Hudson system provides a REST API to access information about the instance and its jobs. Data mungers can also use Hudson RSS feeds to track builds and jobs.

The following JSON file extract has been retrieved from https://hudson.eclipse.org/sirius/api/json?depth=1.

{
  "nodeDescription": "the master Hudson node",
  "numExecutors": 8,
  "description": "

Eclipse.org Sirius Builds

\r\n", "jobs": [ { "description": "", "displayName": "ecoretools-2.0", "name": "ecoretools-2.0", "url": "https://hudson.eclipse.org/sirius/job/ecoretools-2.0/", "buildable": true, "builds": [ { "number": 21, "url": "https://hudson.eclipse.org/sirius/job/ecoretools-2.0/21/" }, { "number": 20, "url": "https://hudson.eclipse.org/sirius/job/ecoretools-2.0/20/" } ], "color": "yellow", "healthReport": [ { "description": "Test Result: 116 tests failing out of a total of 485 tests.", "iconUrl": "health-60to79.png", "score": 76 }, { "description": "Build stability: No recent builds failed.", "iconUrl": "health-80plus.png", "score": 100 } ] } ] }

Code

As for all open-source projects, another important resource for open development data is code itself. Many aspects of the development, including e.g. coding practices or architectural patterns, can be identified in code. Several tools may be used to analyse code, from well-known rule-checking tools like PMD, FindBugs or CheckStyle, to metrics and software analytics systems.

As an example the PolarSys dashboard provides a zip of the XML results produced by PMD and FindBugs analysis. This can be in turn used to run a semantic analysis of these results, as demonstrated here.

Other sources

Beyond these official data sources there are other means to get relevant information about projects. More internal sources could be analysed, like web site & wiki recent changes or downloads. By using text analysis techniques or custom scripts much can be extracted from these.

External web sites may also provide useful aspects of the project about the community (e.g. articles and publications), user feedback (e.g. user ratings like OpenHub), or support (questions and answers web sites like Stack Overflow).


Dashboard UI

An example dashboard

A basic example of a dashboard has been developed as a proof of concept. This is merely a quick 'n dirty hack to pull information from some of the data repositories mentioned and present it in a single place. There is much room for improvement once all the data is at hand.

The magic about the smooth integration and compatibility of the Eclipse data sources is that a single element — the project ID, is enough to identify and retrieve the information from the various repositories. In this case, from this single entry we are able to:

  • Extract PMI data, including name, description, Hudson URL, doc..
  • Extract Dashboard data (current and history) to draw community figures.
  • Extract Continuous Integration data, including number of jobs, status of builds..

This data is processed and displayed in a basic dashboard, but it could also be..

  • More complex dashboards with timelines, deep analysis: HTML/JS/D3js..
  • PDF automatic reports with Markdown or Knitr generated documents.
  • Actions: send emails on thresholds, connect to an analytics framework..

List of available resources

This section is a summary of the various places where open development data can be found.

PMI data can be visualised on the http://projects.eclipse.org web site.

The Grimoire data files can be visualised on the Eclipse dashboard at http://dashboard.eclipse.org.

  • A single project can be retrieved via http://dashboard.eclipse.org/project.html?project=modeling.sirius.
  • And the JSON files can be retrieved via http://dashboard.eclipse.org/data/json.

API documentation for Hudson instances can be retrieved from the engine itself at https://hudson.eclipse.org/sirius/api/. Many access points are defined from there:

  • JSON data for the whole instance: https://hudson.eclipse.org/sirius/api/json.
  • JSON data for a specific job: https://hudson.eclipse.org/sirius/job/sirius-3.0.x/api/json.

The PolarSys dashboard is also a good example of data integration for Eclipse projects: http://dashboard.polarsys.org. It is further described in the PolarSys wiki at https://polarsys.org/wiki/Maturity_Assessment_WG.