Coding 2015

Wednesday, November 23, 2016

Kesalahan Statistik

http://www.cs.cornell.edu/~asampson/blog/statsmistakes.html

Computer scientists in systemsy fields, myself included, aren’t great at using statistics. Maybe it’s because there are so many other potential problems with empirical evaluations that solid statistical reasoning doesn’t seem that important. Other subfields, like HCI and machine learning, have much higher standards for data analysis. Let’s learn from their example.
Here are three kinds of avoidable statistics mistakes that I notice in published papers.

No Statistics at All

The most common blunder is not using statistics at all when your paper clearly uses statistical data. If your paper uses the phrase “we report the average time over 20 runs of the algorithm,” for example, you should probably use statistics.
Here are two easy things that every paper should do when it deals with performance data or anything else that can randomly vary:
First, plot the error bars. In every figure that represents an average, compute the standard error of the mean or just the plain old standard deviation and add little whiskers to each bar. Explain what the error bars mean in the caption.
(a) Just noise. (b) Meaningful results. (c) Who knows???

(a) Just noise. (b) Meaningful results. (c) Who knows???

Second, do a simple statistical test. If you ever say “our system’s average running time is X seconds, which is less than the baseline running time of Y seconds,” you need show that the difference is statistically significant. Statistical significance tells the reader that the difference you found was more than just “in the noise.”
For most CS papers I read, a really basic test will work: Student’s

t

-test checks that two averages that look different actually are different. The process is easy. Collect some

N

samples from the two conditions, compute the mean

\overline{X}

and the standard deviation

s

for each, and plug them into this formula:

t = \frac{ \overline{X}_1 - \overline{X}_2 } { \sqrt{ \frac{s_1^2}{N_1} + \frac{s_2^2}{N_2} } }

then plug that

t

into the cumulative distribution function of the

t

-distribution to get a

p

-value. If your

p

-value is below a threshold

\alpha

that you chose ahead of time (0.05 or 0.01, say), then you have a statistically significant difference. Your favorite numerical library probably already has an implementation that does all the work for you.
If you’ve taken even an intro stats course, you know all this already! But you might be surprised to learn how many computer scientists don’t. Program committees don’t require that papers use solid statistics, so the literature is full of statistics-free but otherwise-good papers, so standards remain low, and Prof. Ouroboros keeps drawing figures without error bars. Other fields are moving beyond the

p

-value, and CS isn’t even there yet.

Failure to Reject = Confirmation

When you do use a statistical test in a paper, you need to interpret its results correctly. When your test produces a

p

-value, here are the correct interpretations:

If $p < \alpha$ : The difference between our average running time and the baseline’s average running time is statistically significant. Pedantically, we reject the null hypothesis that says that the averages might be the same.
Otherwise, if $p \ge \alpha$ : We conclude nothing at all. Pedantically, we fail to reject that null hypothesis.

It’s tempting to think, when

p \ge \alpha

, that you’ve found the opposite thing from the

p < \alpha

case: that you get to conclude that there is no statistically significant difference between the two averages. Don’t do that!
Simple statistical tests like the

t

-test only tell you when averages are different; they can’t tell you when they’re the same. When they fail to find a difference, there are two possible explanations: either there is no difference or you haven’t collected enough data yet. So when a test fails, it could be your fault: if you had run a slightly larger experiment with a slightly larger

N

, the test might have successfully found the difference. It’s always wrong to conclude that the difference does not exist.
If you want to claim that two means are equal, you’ll need to use a different test where the null hypothesis says that they differ by at least a certain amount. For example, an appropriate one-tailed

t

-test will do.

The Multiple Comparisons Problem

In most ordinary evaluation sections, it’s probably enough to use only a handful of statistical tests to draw one or two bottom-line conclusions. But you might find yourself automatically running an unbounded number of comparisons. Perhaps you have

n

benchmarks, and you want to compare the running time on each one to a corresponding baseline with a separate statistical test. Or maybe your system works in a feedback loop: it tries one strategy, performs a statistical test to check whether the strategy worked, and starts over with a new strategy otherwise.
Repeated statistical tests can get you into trouble. The problem is that every statistical test has a probability of lying to you. The probability that any single test is wrong is small, but if you do lots of test, the probability amplifies quickly.
For example, say you choose

\alpha = 0.05

and run one

t

-test. When the test succeeds—when it finds a significant difference—it’s telling you that there’s at most an

\alpha

chance that the difference arose from random chance. In 95 out of 100 parallel universes, your paper found a difference that actually exists. I’d take that bet.
Now, say you run a series of

n

tests in the scope of one paper. Then every test has an

\alpha

chance of going wrong. The chances that your paper has more than

k

errors in it is given by the binomial distribution:

1 - \sum_{i=0}^{k} {n \choose i} \alpha^i (1-\alpha)^{n-i}

which grows exponentially with the number of tests,

n

. If you use just 10 tests with

\alpha = 0.05

, for example, your chance of having one test go wrong grows to 40%. If you do 100, the probability is above 99%. At that point, it’s a near certainty that your paper is misreporting some result.
(To compute these probabilities yourself, set

k = 0

so you get the chance of at least one error. Then the CDF above simplifies down to

1 - (1 - \alpha) ^ n

.)
This pitfall is called the multiple comparisons problem. If you really need to run lots of tests, all is not lost: there are standard ways to compensate for the increased chance of error. The simplest is the Bonferroni correction, where you reduce your per-test

\alpha

\frac{\alpha}{n}

to preserve an overall

\alpha

chance of going wrong.

Thursday, August 18, 2016

Using Drupal's autocomplete on a custom form element

http://www.jochenhebbrecht.be/site/2011-01-10/drupal/using-drupals-autocomplete-a-custom-form-element

Did you ever created a custom form? A place where you couldn't use the Form API of Drupal?
Imagine this form, and you want to create an autocomplete feature on the nickname. The moment you start typing a nickname, a list of usernames get in the textfield.

<form action="/foo">
   ...
   <input type="text" value="Nickname"></input>
   <input type="submit" value="Submit"></input>
</form>

Step 1: adjust the HTML code

Start from a normal text field

<input id="txt-existing-bidder" type="text"></input>

Add attributes to normal text field

<input id="txt-nickname" type="text" class="form-text form-autocomplete text"
       autocomplete="OFF"></input>

Add a hidden value to capture the autocomplete values. Put this field immediately after the textfield

<input class="autocomplete" type="hidden" id="txt-nickname-autocomplete"
       value="/##path_to_autocomplete##"/autocomplete" disabled="disabled" />

Step 2: adjust the Javascript code

You need the following JS files to have a fully working version:

drupal_add_js("misc/autocomplete.js");
drupal_add_js("misc/ahah.js");

If the HTML (see above) is added by AJAX, Drupal's autocomplete will not discover the autocomplete fields. Therefore, execute following code after loading the AJAX call
```
Drupal.attachBehaviors($("##CONTEXT##"));
```
##CONTEXT##: reference to a div which hold the input

Step 3: create a autocomplete PHP function which returns a JSON object

/**
 * Searches all auction users and returns a JSON object of all users
 * @param $search the value to be searched
 * @return JSON object with all users
 */
function autocomplete_users($search = '') {
        // TODO: implement search function here
 
        // DEBUG
 $users['admin {uid:1}'] = 'admin';
 $users['jochen {uid:2}'] = 'jochen';
 
 return drupal_json($users);
}

Step 4: the result

Saturday, August 6, 2016

A x64 OS #1: UEFI

http://kazlauskas.me/entries/x64-uefi-os-1.html

As a part of the OS project for the university there has been a request to also write up the experiences and challenges encountered. This is the first post of the series on writing a x64 operating system when booting straight from UEFI. Please keep in mind that these posts are written by a not-even-hobbyist and content in these posts should be taken with a grain of salt.

Kernel and UEFI

I’ve decided to write a kernel targeting x64 as a UEFI application. There is a number of appeals to write a kernel as UEFI application as opposed to writing a multiboot kernel. Namely:

For x86 family of processors, you avoid the work necessary to upgrade from real mode to protected mode and then from protected mode to long mode which is more commonly known as 64-bit mode. As a UEFI application your kernel gets a fully working x64 environment from a get-go;
Unlike BIOS, UEFI is a well documented firmware. Most of the interfaces provided by BIOS are de facto and you’re lucky if they work at all, while most of these provided by UEFI are de jure and usually just work;
UEFI is extensible, whereas BIOS is not really;
Finally, UEFI is the modern technology which is to stay around, while BIOS is a 40 years old piece of technology on death row. Learning about soon-to-be-dead technology is waste of the effort.

Despite my strong attachment to the Rust community and Rust’s perfect suitability for kernels¹, I’ll be writing the kernel in C. Mostly because it is unlikely people inside the university will be familiar with Rust, but also because GNU-EFI is a C library and I cannot be bothered to bind it. I’d surely be writing it in Rust were I more serious about the project.

Toolchain

As it turns out, developing a x64 kernel on a x64 host greatly simplifies setting up the build tool-chain. I’ll be using:

clang to compile the C code (no cross-compiler is necessary²!);
gnu-efi library to interact with the UEFI firmware;
qemu emulator to run my kernel; and
OVMF as the UEFI firmware.

The UEFI “Hello, world!”

The following snippet of code is all the code you need to print something on the screen as an UEFI application:

// main.c
#include <efi.h>
#include <efilib.h>
#include <efiprot.h>

EFI_STATUS
efi_main (EFI_HANDLE ImageHandle, EFI_SYSTEM_TABLE *SystemTable)
{
    InitializeLib(ImageHandle, SystemTable);
    Print(L"Hello, world from x64!");
    for(;;) __asm__("hlt");
}

However, compiling this code correctly is not as trivial. Following three commands are necessary to produce a working UEFI application:

clang -I/usr/include/efi -I/usr/include/efi/x86_64 -I/usr/include/efi/protocol -fno-stack-protector -fpic -fshort-wchar -mno-red-zone -DHAVE_USE_MS_ABI -c -o src/main.o src/main.c
ld -nostdlib -znocombreloc -T /usr/lib/elf_x86_64_efi.lds -shared -Bsymbolic -L /usr/lib /usr/lib/crt0-efi-x86_64.o src/main.o -o huehuehuehuehue.so -lefi -lgnuefi
objcopy -j .text -j .sdata -j .data -j .dynamic -j .dynsym  -j .rel -j .rela -j .reloc --target=efi-app-x86_64 huehuehuehuehue.so huehuehuehuehue.efi

The clang command is pretty self-explanatory: we tell the compiler where to look for the EFI headers and what to compile into a object file. Probably the most non-trivial option here is the -DHAVE_USE_MS_ABI one – x64 UEFI uses the Windows’ x64 calling convention, and not the regular C one, thus all arguments in calls to UEFI functions must be passed in a different way than it is usually done in C code. Historically this conversion was done by the uefi_call_wrapper wrapper, but clang supports the calling convention natively, and we tell that fact to the gnu-efi library with this option³.
Then, I manually link my object file and UEFI-specific C runtime up into a shared library using a custom linker script provided by the gnu-efi library. The result is an ELF library about 250KB in size. However, UEFI expects its applications in PE executable format, so we must convert our library into the desired format with the objcopy command. At this point huehuehuehuehue.efi file should be produced and majority of UEFI firmwares should be able to run it.
In practice, I’ve automated these steps along with a considerably complex sequence of building image files I’ve stolen from OSDEV’s tutorial on creating images into a Makefile. Feel free to copy it in parts or in whole for your own use cases.

UEFI boot and runtime services

A UEFI application has 2 distinct stages over its lifetime: a stage where so-called boot services are available and stage after these boot services are disabled. An UEFI application will be launched by the UEFI firmware and both boot and runtime services will be available to the application. Most notably, boot services provide APIs for loading other UEFI applications (e.g. implementing bootloaders), handling (allocating and deallocating) memory and using protocols (speaking to other active UEFI applications).
Once the kernel is done with using boot services it calls ExitBootServices which is a method provided by… a boot service. Past that point only runtime services are available and you cannot ever return to a state where boot services are available except by resetting the system. Managing UEFI variables, system clock and resetting the system is pretty much the only things you can do with the runtime services.
For my kernel, I will use the graphics output protocol to set up the video frame buffer, exit the boot services and, finally, shut down the machine before reaching the hlt instruction. Following piece of code implements the described sequence. I left some code out, you can see it in full at Gitlab. For example, the definition of init_graphics.

EFI_STATUS
efi_main (EFI_HANDLE ImageHandle, EFI_SYSTEM_TABLE *SystemTable)
{
    EFI_STATUS status;
    InitializeLib(ImageHandle, SystemTable);

    // Initialize graphics
    EFI_GRAPHICS_OUTPUT_PROTOCOL *graphics;
    EFI_GUID graphics_proto = EFI_GRAPHICS_OUTPUT_PROTOCOL_GUID;
    status = SystemTable->BootServices->LocateProtocol(&graphics_proto, NULL, (void **)&graphics);
    if(status != EFI_SUCCESS) return status;
    status = init_graphics(graphics);
    if(status != EFI_SUCCESS) return status;

    // Figure out the memory map (should be identity mapping)
    boot_state.memory_map = LibMemoryMap(&boot_state.memory_map_size,
                                         &boot_state.map_key,
                                         &boot_state.descriptor_size,
                                         &boot_state.descriptor_version);
    // Exit the boot services...
    SystemTable->BootServices->ExitBootServices(ImageHandle, boot_state.map_key);
    // and set up the memory map we just found.
    SystemTable->RuntimeServices->SetVirtualAddressMap(boot_state.memory_map_size,
                                                       boot_state.descriptor_size,
                                                       boot_state.descriptor_version,
                                                       boot_state.memory_map);
    // Once we’re done we power off the machine.
    SystemTable->RuntimeServices->ResetSystem(EfiResetShutdown, EFI_SUCCESS, 0, NULL);
    for(;;) __asm__("hlt");
}

Note, that some protocols can either be attached to your own EFI_HANDLE or some other EFI_HANDLE (i.e. protocol is provided by another UEFI application). Graphics output protocol I’m using here is an example of a protocol attached to another EFI_HANDLE, therefore we use LocateProtocol boot service to find it. In the off-chance the protocol is attached to application’s own EFI_HANDLE, the HandleProtocol method should be used instead:

EFI_LOADED_IMAGE *loaded_image = NULL;
EFI_GUID loaded_image_protocol = LOADED_IMAGE_PROTOCOL;
EFI_STATUS status = SystemTable->BootServices->HandleProtocol(ImageHandle, &loaded_image_protocol, &loaded_image);

Next steps

At this point I have a bare bones frame for my awesome kernel called “huehuehuehuehue”. From this point onwards the development of the kernel should not differ much from the traditional development of any other x64 kernel. Next, I’ll be implementing software and hardware interrupts; expect a post on that.

Tuesday, August 2, 2016

"Reverse Engineering for Beginners" free book

http://beginners.re/

Reverse Engineering challenges

Contrived by Dennis Yurichev (yurichev.com).
The website has been inspired by Project Euler and "the matasano crypto challenges".
http://challenges.re/#Solutions

The Matasano Crypto Challenges

https://blog.pinboard.in/2013/04/the_matasano_crypto_challenges/

Android: Collecting and Plotting Accelerometer Data

https://androidstream.wordpress.com/2013/01/16/android-collecting-and-plotting-accelerometer-data/

Loggin accelerometer from Android to PC
http://simena86.github.io/blog/2013/04/30/logging-accelerometer-from-android-to-pc/

Sunday, July 31, 2016

High frequency security bug hunting: 120 days, 120 bugs

https://shubs.io/high-frequency-security-bug-hunting-120-days-120-bugs/

Wednesday, November 23, 2016

Kesalahan Statistik

No Statistics at All

Failure to Reject = Confirmation

The Multiple Comparisons Problem

Thursday, August 18, 2016

Using Drupal's autocomplete on a custom form element

Step 1: adjust the HTML code

Step 2: adjust the Javascript code

Step 3: create a autocomplete PHP function which returns a JSON object

Step 4: the result

Saturday, August 6, 2016

A x64 OS #1: UEFI

Ker­nel and UEFI

Tool­chain

The UEFI “Hel­lo, world!”

UEFI boot and runtime ser­vices

Next steps

Tuesday, August 2, 2016

"Reverse Engineering for Beginners" free book

"Reverse Engineering for Beginners" free book

Reverse Engineering challenges

The Matasano Crypto Challenges

Android: Collecting and Plotting Accelerometer Data

Android: Collecting and Plotting Accelerometer Data

Sunday, July 31, 2016

High frequency security bug hunting: 120 days, 120 bugs

Kernel and UEFI

Toolchain

The UEFI “Hello, world!”

UEFI boot and runtime services