Maxalt
This is MaxAlt optimization page
Launchpad Entry: https://launchpad.net/distros/ubuntu/+spec/optimize-glibc-multi-core
Created: 2006-06-22
Contributors: MaxAlt
Packages affected:
Summary
gcc 4.2 has integrated Intel's patch for Conroe/Merom (Core Duo 2) default optimizations at -O2. It was backported to gcc 4.1, and attached in this page. When built, make sure that -m64 biarch patch is on with #define DRIVER_SELF_SPECS "%{m64:%{!mtune:-mtune=x86-64}}" Have not tested on Bensley server, but on laptop Core Duo2: - with -O3, I get a 7% improvement - with -O2, I get a 5% improvement for the static interpreter - with -O2, I get a 15% improvement for the dynamically linked interpreter
Edit conflict - other version:
gcc 4.2 has integrated Intel's patch for Conroe/Merom (Core Duo 2) default optimizations at -O2. It was backported to gcc 4.1, and attached in this page. When built, make sure that -m64 biarch patch is on with #define DRIVER_SELF_SPECS "%{m64:%{!mtune:-mtune=x86-64}}" Have not tested on Bensley server, but on laptop Core Duo2: - with -O3, I get a 7% improvement - with -O2, I get a 5% improvement for the static interpreter - with -O2, I get a 15% improvement for the dynamically linked interpreter
Edit conflict - other version:
gcc 4.2 has integrated Intel's patch for Conroe/Merom (Core Duo 2) default optimizations at -O2. It was backported to gcc 4.1, and attached in this page. When built, make sure that -m64 biarch patch is on with #define DRIVER_SELF_SPECS "%{m64:%{!mtune:-mtune=x86-64}}" Have not tested on Bensley server, but on laptop Core Duo2: - with -O3, I get a 7% improvement - with -O2, I get a 5% improvement for the static interpreter - with -O2, I get a 15% improvement for the dynamically linked interpreter
Edit conflict - your version:
End of edit conflict
Edit conflict - your version:
End of edit conflict
Rationale
Use Cases
Scope
- Changes in glibc
Design
Resolve overhead of interception algorithm when to use unchanged glibc/strncmp and optimized strcmp: when inlining and by call.
strcmp will contain generic optimizations and will not be microarchitecture specific. The code is single threaded itself, so the shared cache architecture does not affect optimizations directly. Proposed code would : * take care of alignment/length of the string * prefetch into cache if reused or threaded * use correct optimized compiler flags and intrinsics * account for cache and cacheline size * SSE/SSE2 usage * reduce mispredictions
run the new strcmp through harness tests
ld.so would benefit out of optimization as well, as optimized ld would be shared architecture aware and will prefetch into cache shared strings for multi-threaded compare
Summary
Rationale
Implementation
Outstanding Issues
BoF agenda and discussion
Maxalt (last edited 2008-08-06 16:16:29 by localhost)