1 Keeping data small 2 3When many applets are compiled into busybox, all rw data and 4bss for each applet are concatenated. Including those from libc, 5if static busybox is built. When busybox is started, _all_ this data 6is allocated, not just that one part for selected applet. 7 8What "allocated" exactly means, depends on arch. 9On NOMMU it's probably bites the most, actually using real 10RAM for rwdata and bss. On i386, bss is lazily allocated 11by COWed zero pages. Not sure about rwdata - also COW? 12 13In order to keep busybox NOMMU and small-mem systems friendly 14we should avoid large global data in our applets, and should 15minimize usage of libc functions which implicitly use 16such structures. 17 18Small experiment to measure "parasitic" bbox memory consumption: 19here we start 1000 "busybox sleep 10" in parallel. 20busybox binary is practically allyesconfig static one, 21built against uclibc. Run on x86-64 machine with 64-bit kernel: 22 23bash-3.2# nmeter '%t %c %m %p %[pn]' 2423:17:28 .......... 168M 0 147 2523:17:29 .......... 168M 0 147 2623:17:30 U......... 168M 1 147 2723:17:31 SU........ 181M 244 391 2823:17:32 SSSSUUU... 223M 757 1147 2923:17:33 UUU....... 223M 0 1147 3023:17:34 U......... 223M 1 1147 3123:17:35 .......... 223M 0 1147 3223:17:36 .......... 223M 0 1147 3323:17:37 S......... 223M 0 1147 3423:17:38 .......... 223M 1 1147 3523:17:39 .......... 223M 0 1147 3623:17:40 .......... 223M 0 1147 3723:17:41 .......... 210M 0 906 3823:17:42 .......... 168M 1 147 3923:17:43 .......... 168M 0 147 40 41This requires 55M of memory. Thus 1 trivial busybox applet 42takes 55k of memory on 64-bit x86 kernel. 43 44On 32-bit kernel we need ~26k per applet. 45 46Script: 47 48i=1000; while test $i != 0; do 49 echo -n . 50 busybox sleep 30 & 51 i=$((i - 1)) 52done 53echo 54wait 55 56(Data from NOMMU arches are sought. Provide 'size busybox' output too) 57 58 59 Example 1 60 61One example how to reduce global data usage is in 62archival/libarchive/decompress_gunzip.c: 63 64/* This is somewhat complex-looking arrangement, but it allows 65 * to place decompressor state either in bss or in 66 * malloc'ed space simply by changing #defines below. 67 * Sizes on i386: 68 * text data bss dec hex 69 * 5256 0 108 5364 14f4 - bss 70 * 4915 0 0 4915 1333 - malloc 71 */ 72#define STATE_IN_BSS 0 73#define STATE_IN_MALLOC 1 74 75(see the rest of the file to get the idea) 76 77This example completely eliminates globals in that module. 78Required memory is allocated in unpack_gz_stream() [its main module] 79and then passed down to all subroutines which need to access 'globals' 80as a parameter. 81 82 83 Example 2 84 85In case you don't want to pass this additional parameter everywhere, 86take a look at archival/gzip.c. Here all global data is replaced by 87single global pointer (ptr_to_globals) to allocated storage. 88 89In order to not duplicate ptr_to_globals in every applet, you can 90reuse single common one. It is defined in libbb/ptr_to_globals.c 91as struct globals *const ptr_to_globals, but the struct globals is 92NOT defined in libbb.h. You first define your own struct: 93 94struct globals { int a; char buf[1000]; }; 95 96and then declare that ptr_to_globals is a pointer to it: 97 98#define G (*ptr_to_globals) 99 100ptr_to_globals is declared as constant pointer. 101This helps gcc understand that it won't change, resulting in noticeably 102smaller code. In order to assign it, use SET_PTR_TO_GLOBALS macro: 103 104 SET_PTR_TO_GLOBALS(xzalloc(sizeof(G))); 105 106Typically it is done in <applet>_main(). Another variation is 107to use stack: 108 109int <applet>_main(...) 110{ 111#undef G 112 struct globals G; 113 memset(&G, 0, sizeof(G)); 114 SET_PTR_TO_GLOBALS(&G); 115 116Now you can reference "globals" by G.a, G.buf and so on, in any function. 117 118 119 bb_common_bufsiz1 120 121There is one big common buffer in bss - bb_common_bufsiz1. It is a much 122earlier mechanism to reduce bss usage. Each applet can use it for 123its needs. Library functions are prohibited from using it. 124 125'G.' trick can be done using bb_common_bufsiz1 instead of malloced buffer: 126 127#define G (*(struct globals*)&bb_common_bufsiz1) 128 129Be careful, though, and use it only if globals fit into bb_common_bufsiz1. 130Since bb_common_bufsiz1 is BUFSIZ + 1 bytes long and BUFSIZ can change 131from one libc to another, you have to add compile-time check for it: 132 133if (sizeof(struct globals) > sizeof(bb_common_bufsiz1)) 134 BUG_<applet>_globals_too_big(); 135 136 137 Drawbacks 138 139You have to initialize it by hand. xzalloc() can be helpful in clearing 140allocated storage to 0, but anything more must be done by hand. 141 142All global variables are prefixed by 'G.' now. If this makes code 143less readable, use #defines: 144 145#define dev_fd (G.dev_fd) 146#define sector (G.sector) 147 148 149 Finding non-shared duplicated strings 150 151strings busybox | sort | uniq -c | sort -nr 152 153 154 gcc's data alignment problem 155 156The following attribute added in vi.c: 157 158static int tabstop; 159static struct termios term_orig __attribute__ ((aligned (4))); 160static struct termios term_vi __attribute__ ((aligned (4))); 161 162reduces bss size by 32 bytes, because gcc sometimes aligns structures to 163ridiculously large values. asm output diff for above example: 164 165 tabstop: 166 .zero 4 167 .section .bss.term_orig,"aw",@nobits 168- .align 32 169+ .align 4 170 .type term_orig, @object 171 .size term_orig, 60 172 term_orig: 173 .zero 60 174 .section .bss.term_vi,"aw",@nobits 175- .align 32 176+ .align 4 177 .type term_vi, @object 178 .size term_vi, 60 179 180gcc doesn't seem to have options for altering this behaviour. 181 182gcc 3.4.3 and 4.1.1 tested: 183char c = 1; 184// gcc aligns to 32 bytes if sizeof(struct) >= 32 185struct { 186 int a,b,c,d; 187 int i1,i2,i3; 188} s28 = { 1 }; // struct will be aligned to 4 bytes 189struct { 190 int a,b,c,d; 191 int i1,i2,i3,i4; 192} s32 = { 1 }; // struct will be aligned to 32 bytes 193// same for arrays 194char vc31[31] = { 1 }; // unaligned 195char vc32[32] = { 1 }; // aligned to 32 bytes 196 197-fpack-struct=1 reduces alignment of s28 to 1 (but probably 198will break layout of many libc structs) but s32 and vc32 199are still aligned to 32 bytes. 200 201I will try to cook up a patch to add a gcc option for disabling it. 202Meanwhile, this is where it can be disabled in gcc source: 203 204gcc/config/i386/i386.c 205int 206ix86_data_alignment (tree type, int align) 207{ 208#if 0 209 if (AGGREGATE_TYPE_P (type) 210 && TYPE_SIZE (type) 211 && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST 212 && (TREE_INT_CST_LOW (TYPE_SIZE (type)) >= 256 213 || TREE_INT_CST_HIGH (TYPE_SIZE (type))) && align < 256) 214 return 256; 215#endif 216 217Result (non-static busybox built against glibc): 218 219# size /usr/srcdevel/bbox/fix/busybox.t0/busybox busybox 220 text data bss dec hex filename 221 634416 2736 23856 661008 a1610 busybox 222 632580 2672 22944 658196 a0b14 busybox_noalign 223 224 225 226 Keeping code small 227 228Use scripts/bloat-o-meter to check whether introduced changes 229didn't generate unnecessary bloat. This script needs unstripped binaries 230to generate a detailed report. To automate this, just use 231"make bloatcheck". It requires busybox_old binary to be present, 232use "make baseline" to generate it from unmodified source, or 233copy busybox_unstripped to busybox_old before modifying sources 234and rebuilding. 235 236Set CONFIG_EXTRA_CFLAGS="-fno-inline-functions-called-once", 237produce "make bloatcheck", see the biggest auto-inlined functions. 238Now, set CONFIG_EXTRA_CFLAGS back to "", but add NOINLINE 239to some of these functions. In 1.16.x timeframe, the results were 240(annotated "make bloatcheck" output): 241 242function old new delta 243expand_vars_to_list - 1712 +1712 win 244lzo1x_optimize - 1429 +1429 win 245arith_apply - 1326 +1326 win 246read_interfaces - 1163 +1163 loss, leave w/o NOINLINE 247logdir_open - 1148 +1148 win 248check_deps - 1148 +1148 loss 249rewrite - 1039 +1039 win 250run_pipe 358 1396 +1038 win 251write_status_file - 1029 +1029 almost the same, leave w/o NOINLINE 252dump_identity - 987 +987 win 253mainQSort3 - 921 +921 win 254parse_one_line - 916 +916 loss 255summarize - 897 +897 almost the same 256do_shm - 884 +884 win 257cpio_o - 863 +863 win 258subCommand - 841 +841 loss 259receive - 834 +834 loss 260 261855 bytes saved in total. 262 263scripts/mkdiff_obj_bloat may be useful to automate this process: run 264"scripts/mkdiff_obj_bloat NORMALLY_BUILT_TREE FORCED_NOINLINE_TREE" 265and select modules which shrank. 266