Linux Kernel benchmarks versions 3.11 to 3.16

3.16 is out for 1.5 weeks now. Enough time to benchmark the new kernel in comparison with the 5 recent kernel versions. All benchmarks are targeted towards the kernel CPU scheduler. There was a lot of progress on the cbenchsuite visualization code to improve the usability and details on the plots. These new developments include box plots and a whole new web frontend to browse the benchmark results.

As you can see in the browsers above there are much more details visible with the new box plots.

Some results

I will show some results here that are more or less significant in comparison to older kernel versions.

Let’s start with a benchmark that measures the performance of the fork syscall.

Fork syscall benchmark on Linux Kernels from 3.11 to 3.16 with 4 threads.

This is the performance of the fork() syscall when 4 threads perform the same fork loop on different kernels. You can clearly see that 3.16 reaches 10,000 forks less than previous kernel versions. That is about 4% less forks. But that’s only the case for 4 threads. In other benchmark configurations the performance difference is not as big as here.

Continuing with the popular kernel compile benchmark. How fast is a kernel compiled on different kernel versions with different number of threads.

Runtime for compiling the linux kernel with 2 threads.

This plot shows the runtime comparison with 2 threads. As you can see in the plot, 3.16 performs better than 3.15 and other versions. With more than 4 threads, this performance benefit of 3.16 is not present.

Also popular as a scheduler benchmark is hackbench. Hackbench has a number of groups which communicate with each other. They can be setup to use pipes or not.

Hackbench with 90 groups and without pipes.

Hackbench with 90 groups communicating over pipes.

The plots show that the performance using pipes got worse in 3.16 while the performance without pipes increased.

Setup

The detailed setup for all the plots is visible in the result browsers linked above. I used a development state of cbenchsuite to measure and create plots and web frontend. The latest features are not released yet but will probably be soon.

If you have any questions, don’t hesitate to ask them.

0 votes, 0.00 avg. rating (0% score)

UART/RS232 BeagleBone black cape

There are many capes and documentation for the BeagleBone black out there. However the most annoying thing when using the BeagleBone were always the terribly designed pin headers which are simply not meant to be used for temporary connections with different components like I2C stuff. It is not convenient to have to look the pins up before connecting.

So as I was interested in how PCB design works etc. I simply designed a cape for exactly this purpose, as a much easier way to connect to the essential pins of BeagleBone. To be also able to use this as a remote console for other embedded boards, I added some converters for 3.3V level UART to RS232.

At the end, this cape contains 2 I2C pin headers, 1 SPI, 8 GPIOs, 4 UART, 4 RS232 (although these can only be used when the correspending UART is disabled via a switch). Most pin headers contain additional power pins, 3.3V and/or 5V.

Finished self designed and soldered cape

Design

To design this PCB I used KiCad which is an open source toolchain for PCB design. I never used any other PCB software before but it was quite clear how to use it after reading a bit of the documentation. I used explicitly small SMD components to find out how to reflow solder components.

Download

Here are all the files of the design:

I directly uploaded the revision 1 gerber files to oshpark. All fotos you can see are with the pcbs from oshpark.

Soldering

The SMD components that I used for this board are probably for many people still within the range of handsoldering. However, as I wanted to learn something new with this project, I tried to do it with solder paste and a ‘reflow oven’. It sometimes sounds quite difficult to do it but it is actually very easy.

If you have, you can use a solderpaste mask for your board, which makes the right placement of solderpaste even easier. But it is also possible with a solder paste syringe. After putting a little bit of solder paste on every SMD pad you put the component into the solder paste. For SMD ICs you can simply put a small line of solder paste across all pads. You don’t have to align the components extremely accurate as they all will straighten up when the solder gets liquid.

I reflowed all three boards that I have and gained some feeling how much solder paste is necessary for the pads. Essentially you can’t do much wrong as you always have the option to resolder components after they were reflow soldered. I had to do that with nearly all the ICs because there were some solder brigdes between pins. But that still was quite easy and fast done. However next time I will use a stencil for the solder paste as I recently discovered oshstencils.

For reflowing you don’t need some expensive reflow oven, any oven will do. The only thing you should have is a thermometer which you can put into the oven to observe the temperature.

Reflowed PCB, all SMD components correctly soldered and solder bridges removed

Software

As you have to support this hardware from Linux somehow, I wrote a devicetree patch for the am335x-bone-common file, although I am not sure if this cape really will work for the beaglebone white. I tested the UART ports, I2C and SPI are still missing as I didn’t have a i2c/spi device. However if you run into problems let me know. The patch is currently based on 3.16-rc.

am335x devicetree patch

Of course this is all ‘as is’, no warranty. Everything you can see is open source. The hardware design is open hardware, so feel free to modify it. However it would be great to hear about any modifications or experiences, feel free to contact me.

0 votes, 0.00 avg. rating (0% score)

egpms_ctl 1.0 released for Gembird EG-PMS

egpms_ctl combines a Linux Kernel module and a python control script to give you reliable access to Gembird EG-PMS or silvershield programmable power outlets.

Last weekend I released version 1.0. The kernel module got a complete cleanup including a format change of the schedule sysfs files. Those are now seperated by ‘,’ and ‘;’. For example the following command sequence will add a schedule to start in 1 minute with enabling the outlet and after another 2 minutes it will disable it again.

1,on;2,off;

As you may see in this example, you are now able to use real words to describe the desired outlet state. Possible are ‘on’/’enabled’/’1′ and ‘off’/’disabled’/’0′.

You can obtain the source code from git.allfex.org or github.com.

2 votes, 4.00 avg. rating (81% score)

How to install cbenchsuite on Debian stable

There are some requirements you have to install before starting to clone and install cbenchsuite.

apt-get install git autoconf libtool pkg-config gperf ncurses-dev uuid uuid-dev sqlite3 libsqlite3-dev

Some explanations of those depencies:

  • git – clone cbenchsuite
  • autoconf, libtool, pkg-config, gperf, ncurses-dev – Build of kconfig-frontends
  • uuid, uuid-dev – To generate uuids for each indivual test execution
  • sqlite3, libsqlite3-dev – cbenchsuite backend storage

Now you can simply install cbenchsuite:

git clone git://git.allfex.org/cbenchsuite.git
cd cbenchsuite
make

At this point a menuconfig dialog will start and you can configure your cbenchsuite instance. It is safe to exit the dialog directly and store the default configuration. You can change this afterwards.

make

This make call will compile cbenchsuite itself based on the configuration file ‘.config’. Afterwards you can use cbenchsuite as described in the README file.

0 votes, 0.00 avg. rating (0% score)

Linux 3.6 – 3.9 CPU scheduler performance

This is the first big series of benchmarks with the high accuracy cbenchsuite, The focus for those benchmarks was the performance of the Linux CPU scheduler in different kernel versions. I used two different systems, an old AMD dual core and a newer Core i7-2600 quad core, which is slightly overclocked. The systems were benchmarked with kernels 3.6, 3.7, 3.8 and 3.9-rc4. This post will give you a few interesting results but it is not a summary and it has no conclusion. This is actually about releasing the raw data of my measurements.

Measurements

  • 10 different benchmarks, e.g. kernel compile, hackbench, fork-bench, …
  • A lot more different benchmark configurations, e.g. number of threads used
  • 4 monitors used to record memory, scheduler, … statistics
  • Kernel caches and swap was reset before each run for higher accuracy

Results

Let’s start with some hackbench results. Hackbench spawns a number of groups of processes which communicate within the group. The runtime is shown in the following two charts.

dual core hackbench

Dual core system Hackbench results.

You can see that Linux 3.7.0 was actually the fastest overall kernel in this benchmark. With usage of pipes, the latest kernel 3.9 is on the same bad level as 3.6. Without pipes, 3.9 is actually not as bad as with pipes. However it’s also not the fastest in this benchmark on a dual core.

hackbench quad core.

Hackbench on a quad-core.

You can see similar results as on the dual core system. Unfortunately, this is not the best benchmark for 3.9, but 3.7 is still very good.

Continuing with the standard kernel compile benchmark.

kernel compile dual quad

Kernel compile benchmark on dual and quad core.

In this benchmark, 3.9 can outperform the other kernels significantly on the dual core. But there is no performance difference on the quad core system visible. I couldn’t find any explenation for this difference in the monitor results. However I found some interesting difference in the number of contextswitches for the dual core system.

single threaded kernel compile contextswitches

Number of contextswitches while compiling the kernel with 1 thread.

You can see that they have different task switching behavior especially on the dual core system. But if you compare those differences with the performance differences above, you can’t see any significant changes.

Result links

The following links will point you to the generated websites with a lot of results. There are three complete sets of visualized results available, all on the same dataset. The first contains both systems, the second shows just the dual core, and the third only the quad core. All generated html files will need javascript.

4 votes, 4.25 avg. rating (84% score)

cbenchsuite – High accuracy C benchmark suite framework

cbenchsuite is a benchmark framework written in C and Python for easy, high quality system performance measurements and visualizations. It was specifically designed and tested for Linux. The first set of plugins included in the benchmark suite is aimed at measuring the performance of the CPU/Linux CPU scheduler.

This is the first release (0.1). After a lot of testing and development, it should be stable enough. If you find bugs, please contact me.

Download, Documentation and more Information

 

Features

  • Complete framework
    • Benchmarking systems
    • Storage in sqlite3 databases
    • Automatically generate figures and websites for the results
  • Efficient benchmark framework
    • written in C
    • unloads unnecessary modules before benchmarking
  • High accuracy
    • Efficient framework implementation
    • Setting scheduling priorities to get accurate monitor information
    • Warmup benchmark execution to remove measurements of cache heatup
    • Data is stored with the number of preceding benchmark executions to be able to determine the cache hotness after the benchmarks
    • Repeat benchmarks until the standard error drops below a given threshold
    • Plugins to reset Linux kernel caches/swap etc.
    • Store system information and plugin component versions to seperate different benchmark environments
  • Extendable
    • Extensive module/plugin/benchsuite interface that can be used without changing any core source files
    • Storage backend interface
  • Configurable
    • Many configuration options like minimum/maximum runs/runtime per benchmark
    • Execute combinations of plugins defined via command line arguments or as a seperate benchsuite inside a module
    • Plugins can have options and different versions

Example plots

hackbench stacked bar chart

Hackbench stacked bar chart.

cached data kernel compile

Kernel compile memory monitor recording the amount of cached data.

 

0 votes, 0.00 avg. rating (0% score)

How to DIY 42 watt LED room light

Last week, one of the three halogen light bulbs got broken. So I had the choice of replacing the one light bulb or searching for another possibly better way to light up my room. I really like LEDs so I searched for some ideas for a complete room.

My goal was to reach at least 3000 lumen. I found some nice high power led stripes with a high CRI of 90. Unfortunately they need heatsinks, which I never used before in any project. In the following I will give you a complete partlist and a rough assembly guide with some hints. But first some pictures of the finished LEDs with heatsink. As you can see, I am really not very good at combining the parts, but it works, and that’s the important thing ;).

You need basic soldering skills to connect wires to the LED stripes.

LEDs with heatsinks

LEDs with heatsinks. ~21W per module.

LEDs in action

LEDs in action. You should never look into the LEDs, but also the camera has problems with the amount of light ;)

The whole system has a total power consumption of 42W. The heatsinks heat up to 45°C. Theoretical about 2800 lumen, but I can’t measure that. It is bright enough to light my whole room. My previous lamp had 2x 50W halogen light bulbs and was much darker.

Continue reading

0 votes, 0.00 avg. rating (0% score)

Partial sorted list scheduler – pslist

pslist is a fair O(1) Linux CPU scheduler with a very easy design. The concept of a partial sorted list only works for scheduling problems, I will explain the datastructure later in this post, first some benchmark results.

Benchmark Results

The results were measured on a dual core AMD Athlon X2 5000+ at 2.6Ghz with cool’n’quiet disabled. CFS and pslist were measured based on Linux 3.6, without stable updates (more details). All plots include 0.95 confidence intervals. All measurements were repeated at least 5 times and 10 minutes. Depending on the relative standard errors of the results, the measurements were repeated until a maximum runtime of 60 minutes per benchmark. If you have questions about the setup, ask in the comments.

Here are some benchmarks with significant differences between both schedulers. We begin with the influence of complexity classes, pslist O(1), CFS O(log n). The yield benchmark tries to make as much ‘sched_yield’ calls as possible, for a limited amount of time. The calls are executed in parallel with the given number of threads. The plot shows the number of iterations reached, more is better.

You can clearly see the advantage of pslist, which has a very low time consumption. List operations are much cheaper than red-black-tree operations. Especially with an increasing number of threads, pslist reaches significantly more iterations than CFS.

The same complexity class influence is visible in the hackbench benchmark. With 400 processes, pslist is again better than CFS. This plot shows the runtime necessary to complete the benchmark, so less is better.

Continue reading

0 votes, 0.00 avg. rating (0% score)

egpms_ctl – Gembird EnerGenie kernel module and control tool

EG-PMS is a USB-programmable power outlet. This kernel module adds reliable linux support for all devices in this family, which includes the previous so called sis-pm devices.

As you probably know there is already a project supporting these USB devices, called sispmctl. They did a great job in reverse engineering the USB protocol. Unfortunately the device itself isn’t very reliable. It may happen that it reconnects randomly or returns a wrong serial number. So addressing a special device may become impossible. That’s the reason for porting the protocol into a kernel module, which is much more reliable and can be addressed by USB-IDs.

Download and documentation at github.

Requirements

  • Python 3
  • Linux 3.3 or above

Thanks

Thanks to the sispmctl project for their tool and the USB-protocol, which I used for the kernel module.

0 votes, 0.00 avg. rating (0% score)

Linux High Throughput Fair Scheduler

HTFS was supposed to be a Completly Fair Scheduler (CFS) replacement for servers. I developed it for my bachelor thesis. The idea was to have a fair scheduler with a complexity class of O(1). The design supports most of the kernel scheduler features, including taskgroups. This is the release of my bachelor thesis, all measurements and the HTFS kernel patch. I am not developing this scheduler, because it is more important to reduce calculation- and contextswitch-overheads than reducing the complexity class.

Abstract

The server market is continously growing to fullfill the demands of cloud com-
puting, internet related servers, like HTTP or Email server, high throughput
computing and much more. To reach the highest possible ressource utilization,
modern operating system kernels are highly optimized. This also is the case
for the Linux CPU scheduler. But especially for servers the Completely Fair
Scheduler has some performance flaws.
In this bachelor thesis a new CPU-scheduler design is proposed. The High
Throughput Fair Scheduler (HTFS) is a multi-queue design, which is able to
fullfill O(1) limitations. To assure fairness to all tasks this classical queue-
design is extended with virtual runtimes. Through a non-strict fairness HTFS
can work with less task switches, which results in higher throughput. HTFS,
aimed at high scheduling speed, fairness and throughput, is able to compete
with the Linux version 2.6.38 CPU-scheduler.

Material

Thanks to Matthias Grawinkel, Tobias Beisel and Andre Brinkmann for suppoting me.

0 votes, 0.00 avg. rating (0% score)