vpx_codec_dec_init: check that the iface is a decoder

Make sure the given interface is actually a decoder interface before initializing it. Change-Id: Ie48d737f2956cc2f0891666de5ea87251e96bc49
Remove unused vp8_get4x4sse_cs_mmx declaration
2011-03-24 15:05:10 +02:00 · 2011-03-24 15:05:10 +02:00 · 2011-03-24 15:05:10 +02:00 · 2011-03-24 15:05:10 +02:00 · 2011-03-24 15:05:09 +02:00 · 2011-03-24 15:05:09 +02:00
264 changed files with 13124 additions and 15084 deletions
--- a/.mailmap
+++ b/.mailmap
@@ -2,4 +2,3 @@ Adrian Grange <agrange@google.com>
 Johann Koenig <johannkoenig@google.com>
 Tero Rintaluoma <teror@google.com> <tero.rintaluoma@on2.com>
 Tom Finegan <tomfinegan@google.com>
-Ralph Giles <giles@xiph.org> <giles@entropywave.com>
--- a/12
+++ b/12
@@ -4,11 +4,8 @@
 Aaron Watry <awatry@gmail.com>
 Adrian Grange <agrange@google.com>
 Alex Converse <alex.converse@gmail.com>
-Alexis Ballier <aballier@gentoo.org>
-Alok Ahuja <waveletcoeff@gmail.com>
 Andoni Morales Alastruey <ylatuya@gmail.com>
 Andres Mejia <mcitadel@gmail.com>
-Aron Rosenberg <arosenberg@logitech.com>
 Attila Nagy <attilanagy@google.com>
 Fabio Pedretti <fabio.ped@libero.it>
 Frank Galligan <fgalligan@google.com>
@@ -25,29 +22,20 @@ Jeff Muizelaar <jmuizelaar@mozilla.com>
 Jim Bankoski <jimbankoski@google.com>
 Johann Koenig <johannkoenig@google.com>
 John Koleszar <jkoleszar@google.com>
-Joshua Bleecher Snyder <josh@treelinelabs.com>
 Justin Clift <justin@salasaga.org>
 Justin Lebar <justin.lebar@gmail.com>
-Lou Quillio <louquillio@google.com>
 Luca Barbato <lu_zero@gentoo.org>
 Makoto Kato <makoto.kt@gmail.com>
 Martin Ettl <ettl.martin78@googlemail.com>
 Michael Kohler <michaelkohler@live.com>
-Mike Hommey <mhommey@mozilla.com>
 Mikhal Shemer <mikhal@google.com>
 Pascal Massimino <pascal.massimino@gmail.com>
 Patrik Westin <patrik.westin@gmail.com>
 Paul Wilkins <paulwilkins@google.com>
 Pavol Rusnak <stick@gk2.sk>
 Philip Jägenstedt <philipj@opera.com>
-Rafael Ávila de Espíndola <rafael.espindola@gmail.com>
-Ralph Giles <giles@xiph.org>
-Ronald S. Bultje <rbultje@google.com>
 Scott LaVarnway <slavarnway@google.com>
-Stefan Holmer <holmer@google.com>
-Taekhyun Kim <takim@nvidia.com>
 Tero Rintaluoma <teror@google.com>
-Thijs Vermeir <thijsvermeir@gmail.com>
 Timothy B. Terriberry <tterribe@xiph.org>
 Tom Finegan <tomfinegan@google.com>
 Yaowu Xu <yaowu@google.com>
--- a/112
+++ b/112
@@ -1,115 +1,3 @@
-2011-08-15 v0.9.7-p1 "Cayuga" patch 1
-  This is an incremental bugfix release against Cayuga. All users of that
-  release are strongly encouraged to upgrade.
-
-    - Fix potential OOB reads (cdae03a)
-
-          An unbounded out of bounds read was discovered when the
-          decoder was requested to perform error concealment (new in
-          Cayuga) given a frame with corrupt partition sizes.
-
-          A bounded out of bounds read was discovered affecting all
-          versions of libvpx. Given an multipartition input frame that
-          is truncated between the mode/mv partition and the first
-          residiual paritition (in the block of partition offsets), up
-          to 3 extra bytes could have been read from the source buffer.
-          The code will not take any action regardless of the contents
-          of these undefined bytes, as the truncated buffer is detected
-          immediately following the read based on the calculated
-          starting position of the coefficient partition.
-
-    - Fix potential error concealment crash when the very first frame
-      is missing or corrupt (a609be5)
-
-    - Fix significant artifacts in error concealment (a4c2211, 99d870a)
-
-    - Revert 1-pass CBR rate control changes (e961317)
-      Further testing showed this change produced undesirable visual
-      artifacts, rolling back for now.
-
-
-2011-08-02 v0.9.7 "Cayuga"
-  Our third named release, focused on a faster, higher quality, encoder.
-
-  - Upgrading:
-    This release is backwards compatible with Aylesbury (v0.9.5) and
-    Bali (v0.9.6). Users of older releases should refer to the Upgrading
-    notes in this document for that release.
-
-  - Enhancements:
-          Stereo 3D format support for vpxenc
-          Runtime detection of available processor cores.
-          Allow specifying --end-usage by enum name
-          vpxdec: test for frame corruption
-          vpxenc: add quantizer histogram display
-          vpxenc: add rate histogram display
-          Set VPX_FRAME_IS_DROPPABLE
-          update configure for ios sdk 4.3
-          Avoid text relocations in ARM vp8 decoder
-          Generate a vpx.pc file for pkg-config.
-          New ways of passing encoded data between encoder and decoder.
-
-  - Speed:
-      This release includes across-the-board speed improvements to the
-      encoder. On x86, these measure at approximately 11.5% in Best mode,
-      21.5% in Good mode (speed 0), and 22.5% in Realtime mode (speed 6).
-      On ARM Cortex A9 with Neon extensions, real-time encoding of video
-      telephony content is 35% faster than Bali on single core and 48%
-      faster on multi-core. On the NVidia Tegra2 platform, real time
-      encoding is 40% faster than Bali.
-
-      Decoder speed was not a priority for this release, but improved
-      approximately 8.4% on x86.
-
-          Reduce motion vector search on alt-ref frame.
-          Encoder loopfilter running in its own thread
-          Reworked loopfilter to precalculate more parameters
-          SSE2/SSSE3 optimizations for build_predictors_mbuv{,_s}().
-          Make hor UV predict ~2x faster (73 vs 132 cycles) using SSSE3.
-          Removed redundant checks
-          Reduced structure sizes
-          utilize preload in ARMv6 MC/LPF/Copy routines
-          ARM optimized quantization, dfct, variance, subtract
-          Increase chrow row alignment to 16 bytes.
-          disable trellis optimization for first pass
-          Write SSSE3 sub-pixel filter function
-          Improve SSE2 half-pixel filter funtions
-          Add vp8_sub_pixel_variance16x8_ssse3 function
-          Reduce unnecessary distortion computation
-          Use diamond search to replace full search
-          Preload reference area in sub-pixel motion search (real-time mode)
-
-  - Quality:
-      This release focused primarily on one-pass use cases, including
-      video conferencing. Low latency data rate control was significantly
-      improved, improving streamability over bandwidth constrained links.
-      Added support for error concealment, allowing frames to maintain
-      visual quality in the presence of substantial packet loss.
-
-          Add rc_max_intra_bitrate_pct control
-          Limit size of initial keyframe in one-pass.
-          Improve framerate adaptation
-          Improved 1-pass CBR rate control
-          Improved KF insertion after fades to still.
-          Improved key frame detection.
-          Improved activity masking (lower PSNR impact for same SSIM boost)
-          Improved interaction between GF and ARFs
-          Adding error-concealment to the decoder.
-          Adding support for independent partitions
-          Adjusted rate-distortion constants
-
-
-  - Bug Fixes:
-          Removed firstpass motion map
-          Fix parallel make install
-          Fix multithreaded encoding for 1 MB wide frame
-          Fixed iwalsh_neon build problems with RVDS4.1
-          Fix semaphore emulation, spin-wait intrinsics on Windows
-          Fix build with xcode4 and simplify GLOBAL.
-          Mark ARM asm objects as allowing a non-executable stack.
-          Fix vpxenc encoding incorrect webm file header on big endian
-
-
 2011-03-07 v0.9.6 "Bali"
  Our second named release, focused on a faster, higher quality, encoder.

--- a/build/make/Makefile
+++ b/build/make/Makefile
@@ -82,8 +82,8 @@ qexec=$(if $(quiet),@)
 #
 # Common rules"
 #
-.PHONY: all
-all:
+.PHONY: all-$(target)
+all-$(target):

 .PHONY: clean
 clean::
@@ -98,11 +98,11 @@ install::
 $(BUILD_PFX)%.c.d: %.c
 	$(if $(quiet),@echo "    [DEP] $@")
 	$(qexec)mkdir -p $(dir $@)
-	$(qexec)$(CC) $(INTERNAL_CFLAGS) $(CFLAGS) -M $< | $(fmt_deps) > $@
+	$(qexec)$(CC) $(CFLAGS) -M $< | $(fmt_deps) > $@

 $(BUILD_PFX)%.c.o: %.c
 	$(if $(quiet),@echo "    [CC] $@")
-	$(qexec)$(CC) $(INTERNAL_CFLAGS) $(CFLAGS) -c -o $@ $<
+	$(qexec)$(CC) $(CFLAGS) -c -o $@ $<

 $(BUILD_PFX)%.asm.d: %.asm
 	$(if $(quiet),@echo "    [DEP] $@")
@@ -124,12 +124,6 @@ $(BUILD_PFX)%.s.o: %.s
 	$(if $(quiet),@echo "    [AS] $@")
 	$(qexec)$(AS) $(ASFLAGS) -o $@ $<

-.PRECIOUS: %.c.S
-%.c.S: CFLAGS += -DINLINE_ASM
-$(BUILD_PFX)%.c.S: %.c
-	$(if $(quiet),@echo "    [GEN] $@")
-	$(qexec)$(CC) -S $(CFLAGS) -o $@ $<
-
 .PRECIOUS: %.asm.s
 $(BUILD_PFX)%.asm.s: %.asm
 	$(if $(quiet),@echo "    [ASM CONVERSION] $@")
@@ -194,7 +188,7 @@ define linker_template
 $(1): $(filter-out -%,$(2))
 $(1):
 	$(if $(quiet),@echo    "    [LD] $$@")
-	$(qexec)$$(LD) $$(strip $$(INTERNAL_LDFLAGS) $$(LDFLAGS) -o $$@ $(2) $(3) $$(extralibs))
+	$(qexec)$$(LD) $$(strip $$(LDFLAGS) -o $$@ $(2) $(3) $$(extralibs))
 endef
 # make-3.80 has a bug with expanding large input strings to the eval function,
 # which was triggered in some cases by the following component of
@@ -336,10 +330,12 @@ ifneq ($(call enabled,DIST-SRCS),)
    DIST-SRCS-$(CONFIG_MSVS)  += build/make/gen_msvs_proj.sh
    DIST-SRCS-$(CONFIG_MSVS)  += build/make/gen_msvs_sln.sh
    DIST-SRCS-$(CONFIG_MSVS)  += build/x86-msvs/yasm.rules
-    DIST-SRCS-$(CONFIG_MSVS)  += build/x86-msvs/obj_int_extract.bat
    DIST-SRCS-$(CONFIG_RVCT) += build/make/armlink_adapter.sh
-    # Include obj_int_extract if we use offsets from asm_*_offsets
-    DIST-SRCS-$(ARCH_ARM)$(ARCH_X86)$(ARCH_X86_64)    += build/make/obj_int_extract.c
+    #
+    # This isn't really ARCH_ARM dependent, it's dependent on whether we're
+    # using assembly code or not (CONFIG_OPTIMIZATIONS maybe). Just use
+    # this for now.
+    DIST-SRCS-$(ARCH_ARM)    += build/make/obj_int_extract.c
    DIST-SRCS-$(ARCH_ARM)    += build/make/ads2gas.pl
    DIST-SRCS-yes            += $(target:-$(TOOLCHAIN)=).mk
 endif
@@ -359,6 +355,6 @@ ifeq ($(CONFIG_EXTERNAL_BUILD),yes)
 endif
 BUILD_TARGETS += .docs .libs .bins
 INSTALL_TARGETS += .install-docs .install-srcs .install-libs .install-bins
-all: $(BUILD_TARGETS)
+all-$(target): $(BUILD_TARGETS)
 install:: $(INSTALL_TARGETS)
 dist: $(INSTALL_TARGETS)
--- a/build/make/ads2gas.pl
+++ b/build/make/ads2gas.pl
@@ -21,14 +21,8 @@ print "@ This file was created from a .asm file\n";
 print "@  using the ads2gas.pl script.\n";
 print "\t.equ DO1STROUNDING, 0\n";

-# Stack of procedure names.
-@proc_stack = ();
-
 while (<STDIN>)
 {
-    # Load and store alignment
-    s/@/,:/g;
-
    # Comment character
    s/;/@/g;

@@ -85,10 +79,7 @@ while (<STDIN>)
    s/CODE([0-9][0-9])/.code $1/;

    # No AREA required
-    # But ALIGNs in AREA must be obeyed
-    s/^\s*AREA.*ALIGN=([0-9])$/.text\n.p2align $1/;
-    # If no ALIGN, strip the AREA and align to 4 bytes
-    s/^\s*AREA.*$/.text\n.p2align 2/;
+    s/^\s*AREA.*$/.text/;

    # DCD to .word
    # This one is for incoming symbols
@@ -123,8 +114,8 @@ while (<STDIN>)
    # put the colon at the end of the line in the macro
    s/^([a-zA-Z_0-9\$]+)/$1:/ if !/EQU/;

-    # ALIGN directive
-    s/ALIGN/.balign/g;
+    # Strip ALIGN
+    s/\sALIGN/@ ALIGN/g;

    # Strip ARM
    s/\sARM/@ ARM/g;
@@ -136,23 +127,9 @@ while (<STDIN>)
    # Strip PRESERVE8
    s/\sPRESERVE8/@ PRESERVE8/g;

-    # Use PROC and ENDP to give the symbols a .size directive.
-    # This makes them show up properly in debugging tools like gdb and valgrind.
-    if (/\bPROC\b/)
-    {
-        my $proc;
-        /^_([\.0-9A-Z_a-z]\w+)\b/;
-        $proc = $1;
-        push(@proc_stack, $proc) if ($proc);
-        s/\bPROC\b/@ $&/;
-    }
-    if (/\bENDP\b/)
-    {
-        my $proc;
-        s/\bENDP\b/@ $&/;
-        $proc = pop(@proc_stack);
-        $_ = "\t.size $proc, .-$proc".$_ if ($proc);
-    }
+    # Strip PROC and ENDPROC
+    s/\sPROC/@/g;
+    s/\sENDP/@/g;

    # EQU directive
    s/(.*)EQU(.*)/.equ $1, $2/;
@@ -171,6 +148,3 @@ while (<STDIN>)
    next if /^\s*END\s*$/;
    print;
 }
-
-# Mark that this object doesn't need an executable stack.
-printf ("\t.section\t.note.GNU-stack,\"\",\%\%progbits\n");
--- a/build/make/ads2gas_apple.pl
+++ b/build/make/ads2gas_apple.pl
@@ -41,9 +41,6 @@ sub trim($)

 while (<STDIN>)
 {
-    # Load and store alignment
-    s/@/,:/g;
-
    # Comment character
    s/;/@/g;

@@ -100,10 +97,7 @@ while (<STDIN>)
    s/CODE([0-9][0-9])/.code $1/;

    # No AREA required
-    # But ALIGNs in AREA must be obeyed
-    s/^\s*AREA.*ALIGN=([0-9])$/.text\n.p2align $1/;
-    # If no ALIGN, strip the AREA and align to 4 bytes
-    s/^\s*AREA.*$/.text\n.p2align 2/;
+    s/^\s*AREA.*$/.text/;

    # DCD to .word
    # This one is for incoming symbols
@@ -143,8 +137,8 @@ while (<STDIN>)
    # put the colon at the end of the line in the macro
    s/^([a-zA-Z_0-9\$]+)/$1:/ if !/EQU/;

-    # ALIGN directive
-    s/ALIGN/.balign/g;
+    # Strip ALIGN
+    s/\sALIGN/@ ALIGN/g;

    # Strip ARM
    s/\sARM/@ ARM/g;
--- a/build/make/configure.sh
+++ b/build/make/configure.sh
@@ -412,14 +412,11 @@ EOF
 write_common_target_config_h() {
    cat > ${TMP_H} << EOF
 /* This file automatically generated by configure. Do not edit! */
-#ifndef VPX_CONFIG_H
-#define VPX_CONFIG_H
 #define RESTRICT    ${RESTRICT}
 EOF
    print_config_h ARCH   "${TMP_H}" ${ARCH_LIST}
    print_config_h HAVE   "${TMP_H}" ${HAVE_LIST}
    print_config_h CONFIG "${TMP_H}" ${CONFIG_LIST}
-    echo "#endif /* VPX_CONFIG_H */" >> ${TMP_H}
    mkdir -p `dirname "$1"`
    cmp "$1" ${TMP_H} >/dev/null 2>&1 || mv ${TMP_H} "$1"
 }
@@ -629,7 +626,7 @@ process_common_toolchain() {
    case ${toolchain} in
        sparc-solaris-*)
            add_extralibs -lposix4
-            disable fast_unaligned
+            add_cflags "-DMUST_BE_ALIGNED"
            ;;
        *-solaris-*)
            add_extralibs -lposix4
@@ -642,8 +639,8 @@ process_common_toolchain() {
    # on arm, isa versions are supersets
    enabled armv7a && soft_enable armv7 ### DEBUG
    enabled armv7 && soft_enable armv6
-    enabled armv7 || enabled armv6 && soft_enable armv5te
-    enabled armv7 || enabled armv6 && soft_enable fast_unaligned
+    enabled armv6 && soft_enable armv5te
+    enabled armv6 && soft_enable fast_unaligned
    enabled iwmmxt2 && soft_enable iwmmxt
    enabled iwmmxt && soft_enable armv5te

@@ -692,7 +689,7 @@ process_common_toolchain() {
            if enabled armv7
                then
                    check_add_cflags --cpu=Cortex-A8 --fpu=softvfp+vfpv3
-                    check_add_asflags --cpu=Cortex-A8 --fpu=softvfp+vfpv3
+                    check_add_asflags --cpu=Cortex-A8 --fpu=none
                else
                    check_add_cflags --cpu=${tgt_isa##armv}
                    check_add_asflags --cpu=${tgt_isa##armv}
@@ -732,18 +729,19 @@ process_common_toolchain() {
            add_cflags -arch ${tgt_isa}
            add_ldflags -arch_only ${tgt_isa}

-            add_cflags  "-isysroot ${SDK_PATH}/SDKs/iPhoneOS4.3.sdk"
+            add_cflags  "-isysroot /Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS4.2.sdk"

            # This should be overridable
-            alt_libc=${SDK_PATH}/SDKs/iPhoneOS4.3.sdk
+            alt_libc=${SDK_PATH}/SDKs/iPhoneOS4.2.sdk

            # Add the paths for the alternate libc
-            for d in usr/include usr/include/gcc/darwin/4.2/ usr/lib/gcc/arm-apple-darwin10/4.2.1/include/; do
+#            for d in usr/include usr/include/gcc/darwin/4.0/; do
+            for d in usr/include usr/include/gcc/darwin/4.0/ usr/lib/gcc/arm-apple-darwin10/4.2.1/include/; do
                try_dir="${alt_libc}/${d}"
                [ -d "${try_dir}" ] && add_cflags -I"${try_dir}"
            done

-            for d in lib usr/lib usr/lib/system; do
+            for d in lib usr/lib; do
                try_dir="${alt_libc}/${d}"
                [ -d "${try_dir}" ] && add_ldflags -L"${try_dir}"
            done
@@ -754,24 +752,41 @@ process_common_toolchain() {
        linux*)
            enable linux
            if enabled rvct; then
-                # Check if we have CodeSourcery GCC in PATH. Needed for
-                # libraries
-                hash arm-none-linux-gnueabi-gcc 2>&- || \
-                  die "Couldn't find CodeSourcery GCC from PATH"
+                # Compiling with RVCT requires an alternate libc (glibc) when
+                # targetting linux.
+                disabled builtin_libc \
+                    || die "Must supply --libc when targetting *-linux-rvct"

-                # Use armcc as a linker to enable translation of
-                # some gcc specific options such as -lm and -lpthread.
-                LD="armcc --translate_gcc"
+                # Set up compiler
+                add_cflags --library_interface=aeabi_glibc
+                add_cflags --no_hide_all
+                add_cflags --dwarf2

-                # create configuration file (uses path to CodeSourcery GCC)
-                armcc --arm_linux_configure --arm_linux_config_file=arm_linux.cfg
+                # Set up linker
+                add_ldflags --sysv --no_startup --no_ref_cpp_init
+                add_ldflags --entry=_start
+                add_ldflags --keep '"*(.init)"' --keep '"*(.fini)"'
+                add_ldflags --keep '"*(.init_array)"' --keep '"*(.fini_array)"'
+                add_ldflags --dynamiclinker=/lib/ld-linux.so.3
+                add_extralibs libc.so.6 -lc_nonshared crt1.o crti.o crtn.o

-                add_cflags --arm_linux_paths --arm_linux_config_file=arm_linux.cfg
-                add_asflags --no_hide_all --apcs=/interwork
-                add_ldflags --arm_linux_paths --arm_linux_config_file=arm_linux.cfg
-                enabled pic && add_cflags --apcs=/fpic
-                enabled pic && add_asflags --apcs=/fpic
-                enabled shared && add_cflags --shared
+                # Add the paths for the alternate libc
+                for d in usr/include; do
+                    try_dir="${alt_libc}/${d}"
+                    [ -d "${try_dir}" ] && add_cflags -J"${try_dir}"
+                done
+                add_cflags -J"${RVCT31INC}"
+                for d in lib usr/lib; do
+                    try_dir="${alt_libc}/${d}"
+                    [ -d "${try_dir}" ] && add_ldflags -L"${try_dir}"
+                done
+
+
+                # glibc has some struct members named __align, which is a
+                # storage modifier in RVCT. If we need to use this modifier,
+                # we'll have to #undef it in our code. Note that this must
+                # happen AFTER all libc inclues.
+                add_cflags -D__align=x_align_x
            fi
        ;;

@@ -870,8 +885,6 @@ process_common_toolchain() {
                link_with_cc=gcc
                tune_cflags="-march="
            setup_gnu_toolchain
-                #for 32 bit x86 builds, -O3 did not turn on this flag
-                enabled optimizations && check_add_cflags -fomit-frame-pointer
                ;;
        esac

@@ -939,23 +952,15 @@ process_common_toolchain() {
    enabled gcov &&
        check_add_cflags -fprofile-arcs -ftest-coverage &&
        check_add_ldflags -fprofile-arcs -ftest-coverage
-
    if enabled optimizations; then
-        if enabled rvct; then
-            enabled small && check_add_cflags -Ospace || check_add_cflags -Otime
-        else
-            enabled small && check_add_cflags -O2 ||  check_add_cflags -O3
-        fi
+        enabled rvct && check_add_cflags -Otime
+        enabled small && check_add_cflags -O2 || check_add_cflags -O3
    fi

    # Position Independent Code (PIC) support, for building relocatable
    # shared objects
    enabled gcc && enabled pic && check_add_cflags -fPIC

-    # Work around longjmp interception on glibc >= 2.11, to improve binary
-    # compatibility. See http://code.google.com/p/webm/issues/detail?id=166
-    enabled linux && check_add_cflags -D_FORTIFY_SOURCE=0
-
    # Check for strip utility variant
    ${STRIP} -V 2>/dev/null | grep GNU >/dev/null && enable gnu_strip

@@ -974,9 +979,6 @@ EOF
        esac
    fi

-    # for sysconf(3) and friends.
-    check_header unistd.h
-
    # glibc needs these
    if enabled linux; then
        add_cflags -D_LARGEFILE_SOURCE
--- a/build/make/gen_msvs_proj.sh
+++ b/build/make/gen_msvs_proj.sh
@@ -365,7 +365,7 @@ generate_vcproj() {
                            DebugInformationFormat="1" \
                            Detect64BitPortabilityProblems="true" \

-                        $uses_asm && tag Tool Name="YASM"  IncludePaths="$incs" Debug="true"
+                        $uses_asm && tag Tool Name="YASM"  IncludePaths="$incs" Debug="1"
                    ;;
                    *)
                        tag Tool \
@@ -379,7 +379,7 @@ generate_vcproj() {
                            DebugInformationFormat="1" \
                            Detect64BitPortabilityProblems="true" \

-                        $uses_asm && tag Tool Name="YASM"  IncludePaths="$incs" Debug="true"
+                        $uses_asm && tag Tool Name="YASM"  IncludePaths="$incs" Debug="1"
                    ;;
                esac
            ;;
@@ -447,8 +447,6 @@ generate_vcproj() {
                    obj_int_extract)
                        tag Tool \
                            Name="VCCLCompilerTool" \
-                            Optimization="2" \
-                            FavorSizeorSpeed="1" \
                            AdditionalIncludeDirectories="$incs" \
                            PreprocessorDefinitions="WIN32;NDEBUG;_CONSOLE;_CRT_SECURE_NO_WARNINGS;_CRT_SECURE_NO_DEPRECATE" \
                            RuntimeLibrary="$release_runtime" \
@@ -464,8 +462,6 @@ generate_vcproj() {

                        tag Tool \
                            Name="VCCLCompilerTool" \
-                            Optimization="2" \
-                            FavorSizeorSpeed="1" \
                            AdditionalIncludeDirectories="$incs" \
                            PreprocessorDefinitions="WIN32;NDEBUG;_CRT_SECURE_NO_WARNINGS;_CRT_SECURE_NO_DEPRECATE;$defines" \
                            RuntimeLibrary="$release_runtime" \
@@ -480,8 +476,6 @@ generate_vcproj() {
                        tag Tool \
                            Name="VCCLCompilerTool" \
                            AdditionalIncludeDirectories="$incs" \
-                            Optimization="2" \
-                            FavorSizeorSpeed="1" \
                            PreprocessorDefinitions="WIN32;NDEBUG;_CRT_SECURE_NO_WARNINGS;_CRT_SECURE_NO_DEPRECATE;$defines" \
                            RuntimeLibrary="$release_runtime" \
                            UsePrecompiledHeader="0" \
--- a/build/make/obj_int_extract.c
+++ b/build/make/obj_int_extract.c
@@ -9,13 +9,25 @@
 */


-#include <stdarg.h>
 #include <stdio.h>
 #include <stdlib.h>
-#include <string.h>

 #include "vpx_config.h"
+
+#if defined(_MSC_VER) || defined(__MINGW32__)
+#include <io.h>
+#include <share.h>
 #include "vpx/vpx_integer.h"
+#else
+#include <stdint.h>
+#include <unistd.h>
+#endif
+
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <stdarg.h>

 typedef enum
 {
@@ -35,6 +47,7 @@ int log_msg(const char *fmt, ...)
 }

 #if defined(__GNUC__) && __GNUC__
+
 #if defined(__MACH__)

 #include <mach-o/loader.h>
@@ -212,6 +225,73 @@ bail:

 }

+int main(int argc, char **argv)
+{
+    int fd;
+    char *f;
+    struct stat stat_buf;
+    uint8_t *file_buf;
+    int res;
+
+    if (argc < 2 || argc > 3)
+    {
+        fprintf(stderr, "Usage: %s [output format] <obj file>\n\n", argv[0]);
+        fprintf(stderr, "  <obj file>\tMachO format object file to parse\n");
+        fprintf(stderr, "Output Formats:\n");
+        fprintf(stderr, "  gas  - compatible with GNU assembler\n");
+        fprintf(stderr, "  rvds - compatible with armasm\n");
+        goto bail;
+    }
+
+    f = argv[2];
+
+    if (!((!strcmp(argv[1], "rvds")) || (!strcmp(argv[1], "gas"))))
+        f = argv[1];
+
+    fd = open(f, O_RDONLY);
+
+    if (fd < 0)
+    {
+        perror("Unable to open file");
+        goto bail;
+    }
+
+    if (fstat(fd, &stat_buf))
+    {
+        perror("stat");
+        goto bail;
+    }
+
+    file_buf = malloc(stat_buf.st_size);
+
+    if (!file_buf)
+    {
+        perror("malloc");
+        goto bail;
+    }
+
+    if (read(fd, file_buf, stat_buf.st_size) != stat_buf.st_size)
+    {
+        perror("read");
+        goto bail;
+    }
+
+    if (close(fd))
+    {
+        perror("close");
+        goto bail;
+    }
+
+    res = parse_macho(file_buf, stat_buf.st_size);
+    free(file_buf);
+
+    if (!res)
+        return EXIT_SUCCESS;
+
+bail:
+    return EXIT_FAILURE;
+}
+
 #elif defined(__ELF__)
 #include "elf.h"

@@ -660,24 +740,96 @@ bail:
    return 1;
 }

+int main(int argc, char **argv)
+{
+    int fd;
+    output_fmt_t mode;
+    char *f;
+    struct stat stat_buf;
+    uint8_t *file_buf;
+    int res;
+
+    if (argc < 2 || argc > 3)
+    {
+        fprintf(stderr, "Usage: %s [output format] <obj file>\n\n", argv[0]);
+        fprintf(stderr, "  <obj file>\tELF format object file to parse\n");
+        fprintf(stderr, "Output Formats:\n");
+        fprintf(stderr, "  gas  - compatible with GNU assembler\n");
+        fprintf(stderr, "  rvds - compatible with armasm\n");
+        goto bail;
+    }
+
+    f = argv[2];
+
+    if (!strcmp(argv[1], "rvds"))
+        mode = OUTPUT_FMT_RVDS;
+    else if (!strcmp(argv[1], "gas"))
+        mode = OUTPUT_FMT_GAS;
+    else
+        f = argv[1];
+
+
+    fd = open(f, O_RDONLY);
+
+    if (fd < 0)
+    {
+        perror("Unable to open file");
+        goto bail;
+    }
+
+    if (fstat(fd, &stat_buf))
+    {
+        perror("stat");
+        goto bail;
+    }
+
+    file_buf = malloc(stat_buf.st_size);
+
+    if (!file_buf)
+    {
+        perror("malloc");
+        goto bail;
+    }
+
+    if (read(fd, file_buf, stat_buf.st_size) != stat_buf.st_size)
+    {
+        perror("read");
+        goto bail;
+    }
+
+    if (close(fd))
+    {
+        perror("close");
+        goto bail;
+    }
+
+    res = parse_elf(file_buf, stat_buf.st_size, mode);
+    free(file_buf);
+
+    if (!res)
+        return EXIT_SUCCESS;
+
+bail:
+    return EXIT_FAILURE;
+}
+#endif
 #endif
-#endif /* defined(__GNUC__) && __GNUC__ */


-#if defined(_MSC_VER) || defined(__MINGW32__) || defined(__CYGWIN__)
+#if defined(_MSC_VER) || defined(__MINGW32__)
 /*  See "Microsoft Portable Executable and Common Object File Format Specification"
    for reference.
 */
 #define get_le32(x) ((*(x)) | (*(x+1)) << 8 |(*(x+2)) << 16 | (*(x+3)) << 24 )
 #define get_le16(x) ((*(x)) | (*(x+1)) << 8)

-int parse_coff(uint8_t *buf, size_t sz)
+int parse_coff(unsigned __int8 *buf, size_t sz)
 {
    unsigned int nsections, symtab_ptr, symtab_sz, strtab_ptr;
    unsigned int sectionrawdata_ptr;
    unsigned int i;
-    uint8_t *ptr;
-    uint32_t symoffset;
+    unsigned __int8 *ptr;
+    unsigned __int32 symoffset;

    char **sectionlist;  //this array holds all section names in their correct order.
    //it is used to check if the symbol is in .bss or .data section.
@@ -755,7 +907,7 @@ int parse_coff(uint8_t *buf, size_t sz)

    for (i = 0; i < symtab_sz; i++)
    {
-        int16_t section = get_le16(ptr + 12); //section number
+        __int16 section = get_le16(ptr + 12); //section number

        if (section > 0 && ptr[16] == 2)
        {
@@ -826,21 +978,20 @@ bail:

    return 1;
 }
-#endif /* defined(_MSC_VER) || defined(__MINGW32__) || defined(__CYGWIN__) */

 int main(int argc, char **argv)
 {
-    output_fmt_t mode = OUTPUT_FMT_PLAIN;
+    int fd;
+    output_fmt_t mode;
    const char *f;
-    uint8_t *file_buf;
+    struct _stat stat_buf;
+    unsigned __int8 *file_buf;
    int res;
-    FILE *fp;
-    long int file_size;

    if (argc < 2 || argc > 3)
    {
        fprintf(stderr, "Usage: %s [output format] <obj file>\n\n", argv[0]);
-        fprintf(stderr, "  <obj file>\tobject file to parse\n");
+        fprintf(stderr, "  <obj file>\tELF format object file to parse\n");
        fprintf(stderr, "Output Formats:\n");
        fprintf(stderr, "  gas  - compatible with GNU assembler\n");
        fprintf(stderr, "  rvds - compatible with armasm\n");
@@ -856,22 +1007,15 @@ int main(int argc, char **argv)
    else
        f = argv[1];

-    fp = fopen(f, "rb");
+    fd = _sopen(f, _O_BINARY, _SH_DENYNO, _S_IREAD | _S_IWRITE);

-    if (!fp)
-    {
-        perror("Unable to open file");
-        goto bail;
-    }
-
-    if (fseek(fp, 0, SEEK_END))
+    if (_fstat(fd, &stat_buf))
    {
        perror("stat");
        goto bail;
    }

-    file_size = ftell(fp);
-    file_buf = malloc(file_size);
+    file_buf = malloc(stat_buf.st_size);

    if (!file_buf)
    {
@@ -879,30 +1023,19 @@ int main(int argc, char **argv)
        goto bail;
    }

-    rewind(fp);
-
-    if (fread(file_buf, sizeof(char), file_size, fp) != file_size)
+    if (_read(fd, file_buf, stat_buf.st_size) != stat_buf.st_size)
    {
        perror("read");
        goto bail;
    }

-    if (fclose(fp))
+    if (_close(fd))
    {
        perror("close");
        goto bail;
    }

-#if defined(__GNUC__) && __GNUC__
-#if defined(__MACH__)
-    res = parse_macho(file_buf, file_size);
-#elif defined(__ELF__)
-    res = parse_elf(file_buf, file_size, mode);
-#endif
-#endif
-#if defined(_MSC_VER) || defined(__MINGW32__) || defined(__CYGWIN__)
-    res = parse_coff(file_buf, file_size);
-#endif
+    res = parse_coff(file_buf, stat_buf.st_size);

    free(file_buf);

@@ -912,3 +1045,4 @@ int main(int argc, char **argv)
 bail:
    return EXIT_FAILURE;
 }
+#endif
--- a/21
+++ b/21
@@ -31,16 +31,14 @@ Advanced options:
  ${toggle_md5}                   support for output of checksum data
  ${toggle_static_msvcrt}         use static MSVCRT (VS builds only)
  ${toggle_vp8}                   VP8 codec support
-  ${toggle_internal_stats}        output of encoder internal stats for debug, if supported (encoders)
+  ${toggle_psnr}                  output of PSNR data, if supported (encoders)
  ${toggle_mem_tracker}           track memory usage
  ${toggle_postproc}              postprocessing
  ${toggle_multithread}           multithreaded encoding and decoding.
  ${toggle_spatial_resampling}    spatial sampling (scaling) support
  ${toggle_realtime_only}         enable this option while building for real-time encoding
-  ${toggle_error_concealment}     enable this option to get a decoder which is able to conceal losses
  ${toggle_runtime_cpu_detect}    runtime cpu detection
  ${toggle_shared}                shared library support
-  ${toggle_static}                static library support
  ${toggle_small}                 favor smaller size over speed
  ${toggle_postproc_visualizer}   macro block / block level visualizers

@@ -154,7 +152,6 @@ enabled doxygen && php -v >/dev/null 2>&1 && enable install_docs
 enable install_bins
 enable install_libs

-enable static
 enable optimizations
 enable fast_unaligned #allow unaligned accesses, if supported by hw
 enable md5
@@ -214,7 +211,6 @@ HAVE_LIST="
    alt_tree_layout
    pthread_h
    sys_mman_h
-    unistd_h
 "
 CONFIG_LIST="
    external_build
@@ -244,7 +240,7 @@ CONFIG_LIST="
    runtime_cpu_detect
    postproc
    multithread
-    internal_stats
+    psnr
    ${CODECS}
    ${CODEC_FAMILIES}
    encoders
@@ -252,9 +248,7 @@ CONFIG_LIST="
    static_msvcrt
    spatial_resampling
    realtime_only
-    error_concealment
    shared
-    static
    small
    postproc_visualizer
    os_support
@@ -287,16 +281,14 @@ CMDLINE_SELECT="
    dc_recon
    postproc
    multithread
-    internal_stats
+    psnr
    ${CODECS}
    ${CODEC_FAMILIES}
    static_msvcrt
    mem_tracker
    spatial_resampling
    realtime_only
-    error_concealment
    shared
-    static
    small
    postproc_visualizer
 "
@@ -385,7 +377,6 @@ process_targets() {
    if [ -f "${source_path}/build/make/version.sh" ]; then
        local ver=`"$source_path/build/make/version.sh" --bare $source_path`
        DIST_DIR="${DIST_DIR}-${ver}"
-        VERSION_STRING=${ver}
        ver=${ver%%-*}
        VERSION_PATCH=${ver##*.}
        ver=${ver%.*}
@@ -394,8 +385,6 @@ process_targets() {
        VERSION_MAJOR=${ver%.*}
    fi
    enabled child || cat <<EOF >> config.mk
-
-PREFIX=${prefix}
 ifeq (\$(MAKECMDGOALS),dist)
 DIST_DIR?=${DIST_DIR}
 else
@@ -403,8 +392,6 @@ DIST_DIR?=\$(DESTDIR)${prefix}
 endif
 LIBSUBDIR=${libdir##${prefix}/}

-VERSION_STRING=${VERSION_STRING}
-
 VERSION_MAJOR=${VERSION_MAJOR}
 VERSION_MINOR=${VERSION_MINOR}
 VERSION_PATCH=${VERSION_PATCH}
@@ -499,7 +486,7 @@ process_toolchain() {
        check_add_cflags -Wpointer-arith
        check_add_cflags -Wtype-limits
        check_add_cflags -Wcast-qual
-        enabled extra_warnings || check_add_cflags -Wno-unused-function
+        enabled extra_warnings || check_add_cflags -Wno-unused
    fi

    if enabled icc; then
--- a/examples.mk
+++ b/examples.mk
@@ -16,7 +16,7 @@ UTILS-$(CONFIG_DECODERS)    += vpxdec.c
 vpxdec.SRCS                 += md5_utils.c md5_utils.h
 vpxdec.SRCS                 += vpx_ports/vpx_timer.h
 vpxdec.SRCS                 += vpx/vpx_integer.h
-vpxdec.SRCS                 += args.c args.h
+vpxdec.SRCS                 += args.c args.h vpx_ports/config.h
 vpxdec.SRCS                 += tools_common.c tools_common.h
 vpxdec.SRCS                 += nestegg/halloc/halloc.h
 vpxdec.SRCS                 += nestegg/halloc/src/align.h
@@ -30,7 +30,7 @@ vpxdec.DESCRIPTION           = Full featured decoder
 UTILS-$(CONFIG_ENCODERS)    += vpxenc.c
 vpxenc.SRCS                 += args.c args.h y4minput.c y4minput.h
 vpxenc.SRCS                 += tools_common.c tools_common.h
-vpxenc.SRCS                 += vpx_ports/mem_ops.h
+vpxenc.SRCS                 += vpx_ports/config.h vpx_ports/mem_ops.h
 vpxenc.SRCS                 += vpx_ports/mem_ops_aligned.h
 vpxenc.SRCS                 += libmkv/EbmlIDs.h
 vpxenc.SRCS                 += libmkv/EbmlWriter.c
@@ -77,11 +77,6 @@ GEN_EXAMPLES-$(CONFIG_ENCODERS) += decode_with_drops.c
 endif
 decode_with_drops.GUID           = CE5C53C4-8DDA-438A-86ED-0DDD3CDB8D26
 decode_with_drops.DESCRIPTION    = Drops frames while decoding
-ifeq ($(CONFIG_DECODERS),yes)
-GEN_EXAMPLES-$(CONFIG_ERROR_CONCEALMENT) += decode_with_partial_drops.c
-endif
-decode_with_partial_drops.GUID           = 61C2D026-5754-46AC-916F-1343ECC5537E
-decode_with_partial_drops.DESCRIPTION    = Drops parts of frames while decoding
 GEN_EXAMPLES-$(CONFIG_ENCODERS) += error_resilient.c
 error_resilient.GUID             = DF5837B9-4145-4F92-A031-44E4F832E00C
 error_resilient.DESCRIPTION      = Error Resiliency Feature
@@ -127,8 +122,8 @@ else
    LIB_PATH := $(call enabled,LIB_PATH)
    INC_PATH := $(call enabled,INC_PATH)
 endif
-INTERNAL_CFLAGS = $(addprefix -I,$(INC_PATH))
-INTERNAL_LDFLAGS += $(addprefix -L,$(LIB_PATH))
+CFLAGS += $(addprefix -I,$(INC_PATH))
+LDFLAGS += $(addprefix -L,$(LIB_PATH))


 # Expand list of selected examples to build (as specified above)
@@ -167,10 +162,8 @@ BINS-$(NOT_MSVS)           += $(addprefix $(BUILD_PFX),$(ALL_EXAMPLES:.c=))

 # Instantiate linker template for all examples.
 CODEC_LIB=$(if $(CONFIG_DEBUG_LIBS),vpx_g,vpx)
-CODEC_LIB_SUF=$(if $(CONFIG_SHARED),.so,.a)
 $(foreach bin,$(BINS-yes),\
-    $(if $(BUILD_OBJS),$(eval $(bin):\
-        $(LIB_PATH)/lib$(CODEC_LIB)$(CODEC_LIB_SUF)))\
+    $(if $(BUILD_OBJS),$(eval $(bin): $(LIB_PATH)/lib$(CODEC_LIB).a))\
    $(if $(BUILD_OBJS),$(eval $(call linker_template,$(bin),\
        $(call objs,$($(notdir $(bin)).SRCS)) \
        -l$(CODEC_LIB) $(addprefix -l,$(CODEC_EXTRA_LIBS))\
@@ -221,8 +214,7 @@ $(1): $($(1:.vcproj=).SRCS)
            --ver=$$(CONFIG_VS_VERSION)\
            --proj-guid=$$($$(@:.vcproj=).GUID)\
            $$(if $$(CONFIG_STATIC_MSVCRT),--static-crt) \
-            --out=$$@ $$(INTERNAL_CFLAGS) $$(CFLAGS) \
-            $$(INTERNAL_LDFLAGS) $$(LDFLAGS) -l$$(CODEC_LIB) -lwinmm $$^
+            --out=$$@ $$(CFLAGS) $$(LDFLAGS) -l$$(CODEC_LIB) -lwinmm $$^
 endef
 PROJECTS-$(CONFIG_MSVS) += $(ALL_EXAMPLES:.c=.vcproj)
 INSTALL-BINS-$(CONFIG_MSVS) += $(foreach p,$(VS_PLATFORMS),\
--- a/examples/decode_to_md5.txt
+++ b/examples/decode_to_md5.txt
@@ -34,8 +34,8 @@ MD5Init(&md5);
 for(plane=0; plane < 3; plane++) {
    unsigned char *buf =img->planes[plane];

-    for(y=0; y < (plane ? (img->d_h + 1) >> 1 : img->d_h); y++) {
-        MD5Update(&md5, buf, (plane ? (img->d_w + 1) >> 1 : img->d_w));
+    for(y=0; y<img->d_h >> (plane?1:0); y++) {
+        MD5Update(&md5, buf, img->d_w >> (plane?1:0));
        buf += img->stride[plane];
    }
 }
--- a/examples/decode_with_partial_drops.txt
+++ b/examples/decode_with_partial_drops.txt
@@ -1,238 +0,0 @@
-@TEMPLATE decoder_tmpl.c
-Decode With Partial Drops Example
-=========================
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ INTRODUCTION
-This is an example utility which drops a series of frames (or parts of frames),
-as specified on the command line. This is useful for observing the error
-recovery features of the codec.
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ INTRODUCTION
-
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ EXTRA_INCLUDES
-#include <time.h>
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ EXTRA_INCLUDES
-
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ HELPERS
-struct parsed_header
-{
-    char key_frame;
-    int version;
-    char show_frame;
-    int first_part_size;
-};
-
-int next_packet(struct parsed_header* hdr, int pos, int length, int mtu)
-{
-    int size = 0;
-    int remaining = length - pos;
-    /* Uncompressed part is 3 bytes for P frames and 10 bytes for I frames */
-    int uncomp_part_size = (hdr->key_frame ? 10 : 3);
-    /* number of bytes yet to send from header and the first partition */
-    int remainFirst = uncomp_part_size + hdr->first_part_size - pos;
-    if (remainFirst > 0)
-    {
-        if (remainFirst <= mtu)
-        {
-            size = remainFirst;
-        }
-        else
-        {
-            size = mtu;
-        }
-
-        return size;
-    }
-
-    /* second partition; just slot it up according to MTU */
-    if (remaining <= mtu)
-    {
-        size = remaining;
-        return size;
-    }
-    return mtu;
-}
-
-void throw_packets(unsigned char* frame, int* size, int loss_rate,
-                   int* thrown, int* kept)
-{
-    unsigned char loss_frame[256*1024];
-    int pkg_size = 1;
-    int pos = 0;
-    int loss_pos = 0;
-    struct parsed_header hdr;
-    unsigned int tmp;
-    int mtu = 1500;
-
-    if (*size < 3)
-    {
-        return;
-    }
-    putc('|', stdout);
-    /* parse uncompressed 3 bytes */
-    tmp = (frame[2] << 16) | (frame[1] << 8) | frame[0];
-    hdr.key_frame = !(tmp & 0x1); /* inverse logic */
-    hdr.version = (tmp >> 1) & 0x7;
-    hdr.show_frame = (tmp >> 4) & 0x1;
-    hdr.first_part_size = (tmp >> 5) & 0x7FFFF;
-
-    /* don't drop key frames */
-    if (hdr.key_frame)
-    {
-        int i;
-        *kept = *size/mtu + ((*size % mtu > 0) ? 1 : 0); /* approximate */
-        for (i=0; i < *kept; i++)
-            putc('.', stdout);
-        return;
-    }
-
-    while ((pkg_size = next_packet(&hdr, pos, *size, mtu)) > 0)
-    {
-        int loss_event = ((rand() + 1.0)/(RAND_MAX + 1.0) < loss_rate/100.0);
-        if (*thrown == 0 && !loss_event)
-        {
-            memcpy(loss_frame + loss_pos, frame + pos, pkg_size);
-            loss_pos += pkg_size;
-            (*kept)++;
-            putc('.', stdout);
-        }
-        else
-        {
-            (*thrown)++;
-            putc('X', stdout);
-        }
-        pos += pkg_size;
-    }
-    memcpy(frame, loss_frame, loss_pos);
-    memset(frame + loss_pos, 0, *size - loss_pos);
-    *size = loss_pos;
-}
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ HELPERS
-
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ DEC_INIT
-/* Initialize codec */
-flags = VPX_CODEC_USE_ERROR_CONCEALMENT;
-res = vpx_codec_dec_init(&codec, interface, &dec_cfg, flags);
-if(res)
-    die_codec(&codec, "Failed to initialize decoder");
-
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ DEC_INIT
-
-Usage
-----
-This example adds a single argument to the `simple_decoder` example,
-which specifies the range or pattern of frames to drop. The parameter is
-parsed as follows:
-
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ USAGE
-if(argc < 4 || argc > 6)
-    die("Usage: %s <infile> <outfile> [-t <num threads>] <N-M|N/M|L,S>\n",
-        argv[0]);
-{
-    char *nptr;
-    int arg_num = 3;
-    if (argc == 6 && strncmp(argv[arg_num++], "-t", 2) == 0)
-        dec_cfg.threads = strtol(argv[arg_num++], NULL, 0);
-    n = strtol(argv[arg_num], &nptr, 0);
-    mode = (*nptr == '\0' || *nptr == ',') ? 2 : (*nptr == '-') ? 1 : 0;
-
-    m = strtol(nptr+1, NULL, 0);
-    if((!n && !m) || (*nptr != '-' && *nptr != '/' &&
-        *nptr != '\0' && *nptr != ','))
-        die("Couldn't parse pattern %s\n", argv[3]);
-}
-seed = (m > 0) ? m : (unsigned int)time(NULL);
-srand(seed);thrown_frame = 0;
-printf("Seed: %u\n", seed);
-printf("Threads: %d\n", dec_cfg.threads);
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ USAGE
-
-
-Dropping A Range Of Frames
--------------------------
-To drop a range of frames, specify the starting frame and the ending
-frame to drop, separated by a dash. The following command will drop
-frames 5 through 10 (base 1).
-
-  $ ./decode_with_partial_drops in.ivf out.i420 5-10
-
-
-Dropping A Pattern Of Frames
----------------------------
-To drop a pattern of frames, specify the number of frames to drop and
-the number of frames after which to repeat the pattern, separated by
-a forward-slash. The following command will drop 3 of 7 frames.
-Specifically, it will decode 4 frames, then drop 3 frames, and then
-repeat.
-
-  $ ./decode_with_partial_drops in.ivf out.i420 3/7
-
-Dropping Random Parts Of Frames
-------------------------------
-A third argument tuple is available to split the frame into 1500 bytes pieces
-and randomly drop pieces rather than frames. The frame will be split at
-partition boundaries where possible. The following example will seed the RNG
-with the seed 123 and drop approximately 5% of the pieces. Pieces which
-are depending on an already dropped piece will also be dropped.
-
-  $ ./decode_with_partial_drops in.ivf out.i420 5,123
-
-
-Extra Variables
---------------
-This example maintains the pattern passed on the command line in the
-`n`, `m`, and `is_range` variables:
-
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ EXTRA_VARS
-int              n, m, mode;
-unsigned int     seed;
-int              thrown=0, kept=0;
-int              thrown_frame=0, kept_frame=0;
-vpx_codec_dec_cfg_t  dec_cfg = {0};
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ EXTRA_VARS
-
-
-Making The Drop Decision
------------------------
-The example decides whether to drop the frame based on the current
-frame number, immediately before decoding the frame.
-
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PRE_DECODE
-/* Decide whether to throw parts of the frame or the whole frame
-   depending on the drop mode */
-thrown_frame = 0;
-kept_frame = 0;
-switch (mode)
-{
-case 0:
-    if (m - (frame_cnt-1)%m <= n)
-    {
-        frame_sz = 0;
-    }
-    break;
-case 1:
-    if (frame_cnt >= n && frame_cnt <= m)
-    {
-        frame_sz = 0;
-    }
-    break;
-case 2:
-    throw_packets(frame, &frame_sz, n, &thrown_frame, &kept_frame);
-    break;
-default: break;
-}
-if (mode < 2)
-{
-    if (frame_sz == 0)
-    {
-        putc('X', stdout);
-        thrown_frame++;
-    }
-    else
-    {
-        putc('.', stdout);
-        kept_frame++;
-    }
-}
-thrown += thrown_frame;
-kept += kept_frame;
-fflush(stdout);
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PRE_DECODE
--- a/examples/decoder_tmpl.c
+++ b/examples/decoder_tmpl.c
@@ -42,8 +42,6 @@ static void die(const char *fmt, ...) {

@DIE_CODEC

-@HELPERS
-
 int main(int argc, char **argv) {
    FILE            *infile, *outfile;
    vpx_codec_ctx_t  codec;
--- a/examples/decoder_tmpl.txt
+++ b/examples/decoder_tmpl.txt
@@ -47,9 +47,8 @@ while((img = vpx_codec_get_frame(&codec, &iter))) {
 for(plane=0; plane < 3; plane++) {
    unsigned char *buf =img->planes[plane];

-    for(y=0; y < (plane ? (img->d_h + 1) >> 1 : img->d_h); y++) {
-        if(fwrite(buf, 1, (plane ? (img->d_w + 1) >> 1 : img->d_w),
-           outfile));
+    for(y=0; y<img->d_h >> (plane?1:0); y++) {
+        if(fwrite(buf, 1, img->d_w >> (plane?1:0), outfile));
        buf += img->stride[plane];
    }
 }
--- a/examples/encoder_tmpl.c
+++ b/examples/encoder_tmpl.c
@@ -111,6 +111,8 @@ int main(int argc, char **argv) {
    vpx_codec_ctx_t      codec;
    vpx_codec_enc_cfg_t  cfg;
    int                  frame_cnt = 0;
+    unsigned char        file_hdr[IVF_FILE_HDR_SZ];
+    unsigned char        frame_hdr[IVF_FRAME_HDR_SZ];
    vpx_image_t          raw;
    vpx_codec_err_t      res;
    long                 width;
--- a/examples/postproc.txt
+++ b/examples/postproc.txt
@@ -21,7 +21,7 @@ res = vpx_codec_dec_init(&codec, interface, NULL,
 if(res == VPX_CODEC_INCAPABLE) {
    printf("NOTICE: Postproc not supported by %s\n",
           vpx_codec_iface_name(interface));
-    res = vpx_codec_dec_init(&codec, interface, NULL, flags);
+    res = vpx_codec_dec_init(&codec, interface, NULL, 0);
 }
 if(res)
    die_codec(&codec, "Failed to initialize decoder");
--- a/libmkv/EbmlIDs.h
+++ b/libmkv/EbmlIDs.h
@@ -120,7 +120,7 @@ enum mkv
    //video
    Video = 0xE0,
    FlagInterlaced = 0x9A,
-    StereoMode = 0x53B8,
+//  StereoMode = 0x53B8,
    PixelWidth = 0xB0,
    PixelHeight = 0xBA,
    PixelCropBottom = 0x54AA,
--- a/libmkv/EbmlWriter.c
+++ b/libmkv/EbmlWriter.c
@@ -11,7 +11,6 @@
 #include <stdlib.h>
 #include <wchar.h>
 #include <string.h>
-#include <limits.h>
 #if defined(_MSC_VER)
 #define LITERALU64(n) n
 #else
@@ -34,7 +33,7 @@ void Ebml_WriteLen(EbmlGlobal *glob, long long val)

    val |= (LITERALU64(0x000000000000080) << ((size - 1) * 7));

-    Ebml_Serialize(glob, (void *) &val, sizeof(val), size);
+    Ebml_Serialize(glob, (void *) &val, size);
 }

 void Ebml_WriteString(EbmlGlobal *glob, const char *str)
@@ -61,26 +60,21 @@ void Ebml_WriteUTF8(EbmlGlobal *glob, const wchar_t *wstr)

 void Ebml_WriteID(EbmlGlobal *glob, unsigned long class_id)
 {
-    int len;
-
    if (class_id >= 0x01000000)
-        len = 4;
+        Ebml_Serialize(glob, (void *)&class_id, 4);
    else if (class_id >= 0x00010000)
-        len = 3;
+        Ebml_Serialize(glob, (void *)&class_id, 3);
    else if (class_id >= 0x00000100)
-        len = 2;
+        Ebml_Serialize(glob, (void *)&class_id, 2);
    else
-        len = 1;
-
-    Ebml_Serialize(glob, (void *)&class_id, sizeof(class_id), len);
+        Ebml_Serialize(glob, (void *)&class_id, 1);
 }
-
 void Ebml_SerializeUnsigned64(EbmlGlobal *glob, unsigned long class_id, uint64_t ui)
 {
    unsigned char sizeSerialized = 8 | 0x80;
    Ebml_WriteID(glob, class_id);
-    Ebml_Serialize(glob, &sizeSerialized, sizeof(sizeSerialized), 1);
-    Ebml_Serialize(glob, &ui, sizeof(ui), 8);
+    Ebml_Serialize(glob, &sizeSerialized, 1);
+    Ebml_Serialize(glob, &ui, 8);
 }

 void Ebml_SerializeUnsigned(EbmlGlobal *glob, unsigned long class_id, unsigned long ui)
@@ -103,8 +97,8 @@ void Ebml_SerializeUnsigned(EbmlGlobal *glob, unsigned long class_id, unsigned l
    }

    sizeSerialized = 0x80 | size;
-    Ebml_Serialize(glob, &sizeSerialized, sizeof(sizeSerialized), 1);
-    Ebml_Serialize(glob, &ui, sizeof(ui), size);
+    Ebml_Serialize(glob, &sizeSerialized, 1);
+    Ebml_Serialize(glob, &ui, size);
 }
 //TODO: perhaps this is a poor name for this id serializer helper function
 void Ebml_SerializeBinary(EbmlGlobal *glob, unsigned long class_id, unsigned long bin)
@@ -125,14 +119,14 @@ void Ebml_SerializeFloat(EbmlGlobal *glob, unsigned long class_id, double d)
    unsigned char len = 0x88;

    Ebml_WriteID(glob, class_id);
-    Ebml_Serialize(glob, &len, sizeof(len), 1);
-    Ebml_Serialize(glob,  &d, sizeof(d), 8);
+    Ebml_Serialize(glob, &len, 1);
+    Ebml_Serialize(glob,  &d, 8);
 }

 void Ebml_WriteSigned16(EbmlGlobal *glob, short val)
 {
    signed long out = ((val & 0x003FFFFF) | 0x00200000) << 8;
-    Ebml_Serialize(glob, &out, sizeof(out), 3);
+    Ebml_Serialize(glob, &out, 3);
 }

 void Ebml_SerializeString(EbmlGlobal *glob, unsigned long class_id, const char *s)
@@ -149,6 +143,7 @@ void Ebml_SerializeUTF8(EbmlGlobal *glob, unsigned long class_id, wchar_t *s)

 void Ebml_SerializeData(EbmlGlobal *glob, unsigned long class_id, unsigned char *data, unsigned long data_length)
 {
+    unsigned char size = 4;
    Ebml_WriteID(glob, class_id);
    Ebml_WriteLen(glob, data_length);
    Ebml_Write(glob,  data, data_length);
--- a/libmkv/EbmlWriter.h
+++ b/libmkv/EbmlWriter.h
@@ -15,7 +15,7 @@
 #include "vpx/vpx_integer.h"

 typedef struct EbmlGlobal EbmlGlobal;
-void  Ebml_Serialize(EbmlGlobal *glob, const void *, int, unsigned long);
+void  Ebml_Serialize(EbmlGlobal *glob, const void *, unsigned long);
 void  Ebml_Write(EbmlGlobal *glob, const void *, unsigned long);
 /////

--- a/libmkv/WebMElement.c
+++ b/libmkv/WebMElement.c
@@ -35,11 +35,11 @@ void writeSimpleBlock(EbmlGlobal *glob, unsigned char trackNumber, short timeCod
    Ebml_WriteID(glob, SimpleBlock);
    unsigned long blockLength = 4 + dataLength;
    blockLength |= 0x10000000; //TODO check length < 0x0FFFFFFFF
-    Ebml_Serialize(glob, &blockLength, sizeof(blockLength), 4);
+    Ebml_Serialize(glob, &blockLength, 4);
    trackNumber |= 0x80;  //TODO check track nubmer < 128
    Ebml_Write(glob, &trackNumber, 1);
    //Ebml_WriteSigned16(glob, timeCode,2); //this is 3 bytes
-    Ebml_Serialize(glob, &timeCode, sizeof(timeCode), 2);
+    Ebml_Serialize(glob, &timeCode, 2);
    unsigned char flags = 0x00 | (isKeyframe ? 0x80 : 0x00) | (lacingFlag << 1) | discardable;
    Ebml_Write(glob, &flags, 1);
    Ebml_Write(glob, data, dataLength);
--- a/libs.mk
+++ b/libs.mk
@@ -35,7 +35,6 @@ ifeq ($(CONFIG_VP8_ENCODER),yes)
  CODEC_SRCS-yes += $(addprefix $(VP8_PREFIX),$(call enabled,VP8_CX_SRCS))
  CODEC_EXPORTS-yes += $(addprefix $(VP8_PREFIX),$(VP8_CX_EXPORTS))
  CODEC_SRCS-yes += $(VP8_PREFIX)vp8cx.mk vpx/vp8.h vpx/vp8cx.h vpx/vp8e.h
-  CODEC_SRCS-$(ARCH_ARM) += $(VP8_PREFIX)vp8cx_arm.mk
  INSTALL-LIBS-yes += include/vpx/vp8.h include/vpx/vp8e.h include/vpx/vp8cx.h
  INSTALL_MAPS += include/vpx/% $(SRC_PATH_BARE)/$(VP8_PREFIX)/%
  CODEC_DOC_SRCS += vpx/vp8.h vpx/vp8cx.h
@@ -48,7 +47,6 @@ ifeq ($(CONFIG_VP8_DECODER),yes)
  CODEC_SRCS-yes += $(addprefix $(VP8_PREFIX),$(call enabled,VP8_DX_SRCS))
  CODEC_EXPORTS-yes += $(addprefix $(VP8_PREFIX),$(VP8_DX_EXPORTS))
  CODEC_SRCS-yes += $(VP8_PREFIX)vp8dx.mk vpx/vp8.h vpx/vp8dx.h
-  CODEC_SRCS-$(ARCH_ARM) += $(VP8_PREFIX)vp8dx_arm.mk
  INSTALL-LIBS-yes += include/vpx/vp8.h include/vpx/vp8dx.h
  INSTALL_MAPS += include/vpx/% $(SRC_PATH_BARE)/$(VP8_PREFIX)/%
  CODEC_DOC_SRCS += vpx/vp8.h vpx/vp8dx.h
@@ -91,7 +89,6 @@ $(eval $(if $(filter universal%,$(TOOLCHAIN)),LIPO_LIBVPX,BUILD_LIBVPX):=yes)

 CODEC_SRCS-$(BUILD_LIBVPX) += build/make/version.sh
 CODEC_SRCS-$(BUILD_LIBVPX) += vpx/vpx_integer.h
-CODEC_SRCS-$(BUILD_LIBVPX) += vpx_ports/asm_offsets.h
 CODEC_SRCS-$(BUILD_LIBVPX) += vpx_ports/vpx_timer.h
 CODEC_SRCS-$(BUILD_LIBVPX) += vpx_ports/mem.h
 CODEC_SRCS-$(BUILD_LIBVPX) += $(BUILD_PFX)vpx_config.c
@@ -103,7 +100,7 @@ CODEC_SRCS-$(BUILD_LIBVPX) += vpx_ports/x86_abi_support.asm
 CODEC_SRCS-$(BUILD_LIBVPX) += vpx_ports/x86_cpuid.c
 endif
 CODEC_SRCS-$(ARCH_ARM) += vpx_ports/arm_cpudetect.c
-CODEC_SRCS-$(ARCH_ARM) += vpx_ports/arm.h
+CODEC_SRCS-$(ARCH_ARM) += $(BUILD_PFX)vpx_config.asm
 CODEC_EXPORTS-$(BUILD_LIBVPX) += vpx/exports_com
 CODEC_EXPORTS-$(CONFIG_ENCODERS) += vpx/exports_enc
 CODEC_EXPORTS-$(CONFIG_DECODERS) += vpx/exports_dec
@@ -124,7 +121,7 @@ INSTALL-LIBS-$(CONFIG_SHARED) += $(foreach p,$(VS_PLATFORMS),$(LIBSUBDIR)/$(p)/v
 INSTALL-LIBS-$(CONFIG_SHARED) += $(foreach p,$(VS_PLATFORMS),$(LIBSUBDIR)/$(p)/vpx.exp)
 endif
 else
-INSTALL-LIBS-$(CONFIG_STATIC) += $(LIBSUBDIR)/libvpx.a
+INSTALL-LIBS-yes += $(LIBSUBDIR)/libvpx.a
 INSTALL-LIBS-$(CONFIG_DEBUG_LIBS) += $(LIBSUBDIR)/libvpx_g.a
 endif

@@ -132,14 +129,6 @@ CODEC_SRCS=$(call enabled,CODEC_SRCS)
 INSTALL-SRCS-$(CONFIG_CODEC_SRCS) += $(CODEC_SRCS)
 INSTALL-SRCS-$(CONFIG_CODEC_SRCS) += $(call enabled,CODEC_EXPORTS)

-
-# Generate a list of all enabled sources, in particular for exporting to gyp
-# based build systems.
-libvpx_srcs.txt:
-	@echo "    [CREATE] $@"
-	@echo $(CODEC_SRCS) | xargs -n1 echo | sort -u > $@
-
-
 ifeq ($(CONFIG_EXTERNAL_BUILD),yes)
 ifeq ($(CONFIG_MSVS),yes)

@@ -188,15 +177,14 @@ endif
 else
 LIBVPX_OBJS=$(call objs,$(CODEC_SRCS))
 OBJS-$(BUILD_LIBVPX) += $(LIBVPX_OBJS)
-LIBS-$(if $(BUILD_LIBVPX),$(CONFIG_STATIC)) += $(BUILD_PFX)libvpx.a $(BUILD_PFX)libvpx_g.a
+LIBS-$(BUILD_LIBVPX) += $(BUILD_PFX)libvpx.a $(BUILD_PFX)libvpx_g.a
 $(BUILD_PFX)libvpx_g.a: $(LIBVPX_OBJS)

 BUILD_LIBVPX_SO         := $(if $(BUILD_LIBVPX),$(CONFIG_SHARED))
 LIBVPX_SO               := libvpx.so.$(VERSION_MAJOR).$(VERSION_MINOR).$(VERSION_PATCH)
-LIBS-$(BUILD_LIBVPX_SO) += $(BUILD_PFX)$(LIBVPX_SO)\
-                           $(notdir $(LIBVPX_SO_SYMLINKS))
+LIBS-$(BUILD_LIBVPX_SO) += $(BUILD_PFX)$(LIBVPX_SO)
 $(BUILD_PFX)$(LIBVPX_SO): $(LIBVPX_OBJS) libvpx.ver
-$(BUILD_PFX)$(LIBVPX_SO): extralibs += -lm
+$(BUILD_PFX)$(LIBVPX_SO): extralibs += -lm -pthread
 $(BUILD_PFX)$(LIBVPX_SO): SONAME = libvpx.so.$(VERSION_MAJOR)
 $(BUILD_PFX)$(LIBVPX_SO): SO_VERSION_SCRIPT = libvpx.ver
 LIBVPX_SO_SYMLINKS      := $(addprefix $(LIBSUBDIR)/, \
@@ -210,41 +198,12 @@ libvpx.ver: $(call enabled,CODEC_EXPORTS)
 	$(qexec)echo "local: *; };" >> $@
 CLEAN-OBJS += libvpx.ver

-define libvpx_symlink_template
-$(1): $(2)
-	@echo "    [LN]      $$@"
-	$(qexec)ln -sf $(LIBVPX_SO) $$@
-endef
-
-$(eval $(call libvpx_symlink_template,\
-    $(addprefix $(BUILD_PFX),$(notdir $(LIBVPX_SO_SYMLINKS))),\
-    $(BUILD_PFX)$(LIBVPX_SO)))
-$(eval $(call libvpx_symlink_template,\
-    $(addprefix $(DIST_DIR)/,$(LIBVPX_SO_SYMLINKS)),\
-    $(DIST_DIR)/$(LIBSUBDIR)/$(LIBVPX_SO)))
+$(addprefix $(DIST_DIR)/,$(LIBVPX_SO_SYMLINKS)):
+	@echo "    [LN]      $@"
+	$(qexec)ln -sf $(LIBVPX_SO) $@

 INSTALL-LIBS-$(CONFIG_SHARED) += $(LIBVPX_SO_SYMLINKS)
 INSTALL-LIBS-$(CONFIG_SHARED) += $(LIBSUBDIR)/$(LIBVPX_SO)
-
-LIBS-$(BUILD_LIBVPX) += vpx.pc
-vpx.pc: config.mk libs.mk
-	@echo "    [CREATE] $@"
-	$(qexec)echo '# pkg-config file from libvpx $(VERSION_STRING)' > $@
-	$(qexec)echo 'prefix=$(PREFIX)' >> $@
-	$(qexec)echo 'exec_prefix=$${prefix}' >> $@
-	$(qexec)echo 'libdir=$${prefix}/lib' >> $@
-	$(qexec)echo 'includedir=$${prefix}/include' >> $@
-	$(qexec)echo '' >> $@
-	$(qexec)echo 'Name: vpx' >> $@
-	$(qexec)echo 'Description: WebM Project VPx codec implementation' >> $@
-	$(qexec)echo 'Version: $(VERSION_MAJOR).$(VERSION_MINOR).$(VERSION_PATCH)' >> $@
-	$(qexec)echo 'Requires:' >> $@
-	$(qexec)echo 'Conflicts:' >> $@
-	$(qexec)echo 'Libs: -L$${libdir} -lvpx' >> $@
-	$(qexec)echo 'Cflags: -I$${includedir}' >> $@
-INSTALL-LIBS-yes += $(LIBSUBDIR)/pkgconfig/vpx.pc
-INSTALL_MAPS += $(LIBSUBDIR)/pkgconfig/%.pc %.pc
-CLEAN-OBJS += vpx.pc
 endif

 LIBS-$(LIPO_LIBVPX) += libvpx.a
@@ -278,24 +237,8 @@ $(filter %$(ASM).o,$(OBJS-yes)): $(BUILD_PFX)vpx_config.asm
 #
 # Calculate platform- and compiler-specific offsets for hand coded assembly
 #
-
-ifeq ($(filter icc gcc,$(TGT_CC)), $(TGT_CC))
-    $(BUILD_PFX)asm_com_offsets.asm: $(BUILD_PFX)$(VP8_PREFIX)common/asm_com_offsets.c.S
-	grep -w EQU $< | tr -d '$$\#' $(ADS2GAS) > $@
-    $(BUILD_PFX)$(VP8_PREFIX)common/asm_com_offsets.c.S: $(VP8_PREFIX)common/asm_com_offsets.c
-    CLEAN-OBJS += $(BUILD_PFX)asm_com_offsets.asm $(BUILD_PFX)$(VP8_PREFIX)common/asm_com_offsets.c.S
-
-    $(BUILD_PFX)asm_enc_offsets.asm: $(BUILD_PFX)$(VP8_PREFIX)encoder/asm_enc_offsets.c.S
-	grep -w EQU $< | tr -d '$$\#' $(ADS2GAS) > $@
-    $(BUILD_PFX)$(VP8_PREFIX)encoder/asm_enc_offsets.c.S: $(VP8_PREFIX)encoder/asm_enc_offsets.c
-    CLEAN-OBJS += $(BUILD_PFX)asm_enc_offsets.asm $(BUILD_PFX)$(VP8_PREFIX)encoder/asm_enc_offsets.c.S
-
-    $(BUILD_PFX)asm_dec_offsets.asm: $(BUILD_PFX)$(VP8_PREFIX)decoder/asm_dec_offsets.c.S
-	grep -w EQU $< | tr -d '$$\#' $(ADS2GAS) > $@
-    $(BUILD_PFX)$(VP8_PREFIX)decoder/asm_dec_offsets.c.S: $(VP8_PREFIX)decoder/asm_dec_offsets.c
-    CLEAN-OBJS += $(BUILD_PFX)asm_dec_offsets.asm $(BUILD_PFX)$(VP8_PREFIX)decoder/asm_dec_offsets.c.S
-else
-  ifeq ($(filter rvct,$(TGT_CC)), $(TGT_CC))
+ifeq ($(CONFIG_EXTERNAL_BUILD),) # Visual Studio uses obj_int_extract.bat
+  ifeq ($(ARCH_ARM), yes)
    asm_com_offsets.asm: obj_int_extract
    asm_com_offsets.asm: $(VP8_PREFIX)common/asm_com_offsets.c.o
 	./obj_int_extract rvds $< $(ADS2GAS) > $@
@@ -303,19 +246,23 @@ else
    CLEAN-OBJS += asm_com_offsets.asm
    $(filter %$(ASM).o,$(OBJS-yes)): $(BUILD_PFX)asm_com_offsets.asm

-    asm_enc_offsets.asm: obj_int_extract
-    asm_enc_offsets.asm: $(VP8_PREFIX)encoder/asm_enc_offsets.c.o
+    ifeq ($(CONFIG_VP8_ENCODER), yes)
+      asm_enc_offsets.asm: obj_int_extract
+      asm_enc_offsets.asm: $(VP8_PREFIX)encoder/asm_enc_offsets.c.o
 	./obj_int_extract rvds $< $(ADS2GAS) > $@
-    OBJS-yes += $(VP8_PREFIX)encoder/asm_enc_offsets.c.o
-    CLEAN-OBJS += asm_enc_offsets.asm
-    $(filter %$(ASM).o,$(OBJS-yes)): $(BUILD_PFX)asm_enc_offsets.asm
+      OBJS-yes += $(VP8_PREFIX)encoder/asm_enc_offsets.c.o
+      CLEAN-OBJS += asm_enc_offsets.asm
+      $(filter %$(ASM).o,$(OBJS-yes)): $(BUILD_PFX)asm_enc_offsets.asm
+    endif

-    asm_dec_offsets.asm: obj_int_extract
-    asm_dec_offsets.asm: $(VP8_PREFIX)decoder/asm_dec_offsets.c.o
+    ifeq ($(CONFIG_VP8_DECODER), yes)
+      asm_dec_offsets.asm: obj_int_extract
+      asm_dec_offsets.asm: $(VP8_PREFIX)decoder/asm_dec_offsets.c.o
 	./obj_int_extract rvds $< $(ADS2GAS) > $@
-    OBJS-yes += $(VP8_PREFIX)decoder/asm_dec_offsets.c.o
-    CLEAN-OBJS += asm_dec_offsets.asm
-    $(filter %$(ASM).o,$(OBJS-yes)): $(BUILD_PFX)asm_dec_offsets.asm
+      OBJS-yes += $(VP8_PREFIX)decoder/asm_dec_offsets.c.o
+      CLEAN-OBJS += asm_dec_offsets.asm
+      $(filter %$(ASM).o,$(OBJS-yes)): $(BUILD_PFX)asm_dec_offsets.asm
+    endif
  endif
 endif

--- a/tools/author_first_release.sh
+++ b/tools/author_first_release.sh
@@ -1,15 +0,0 @@
-#!/bin/bash
-##
-## List the release each author first contributed to.
-##
-## Usage: author_first_release.sh [TAGS]
-##
-## If the TAGS arguments are unspecified, all tags reported by `git tag`
-## will be considered.
-##
-tags=${@:-$(git tag)}
-for tag in $tags; do
-  git shortlog -n -e -s $tag |
-      cut -f2- |
-      awk "{print \"${tag#v}\t\"\$0}"
-done | sort -k2  | uniq -f2
--- a/vp8/common/alloccommon.c
+++ b/vp8/common/alloccommon.c
@@ -9,7 +9,7 @@
 */


-#include "vpx_config.h"
+#include "vpx_ports/config.h"
 #include "blockd.h"
 #include "vpx_mem/vpx_mem.h"
 #include "onyxc_int.h"
@@ -27,9 +27,6 @@ static void update_mode_info_border(MODE_INFO *mi, int rows, int cols)

    for (i = 0; i < rows; i++)
    {
-        /* TODO(holmer): Bug? This updates the last element of each row
-         * rather than the border element!
-         */
        vpx_memset(&mi[i*cols-1], 0, sizeof(MODE_INFO));
    }
 }
@@ -46,11 +43,9 @@ void vp8_de_alloc_frame_buffers(VP8_COMMON *oci)

    vpx_free(oci->above_context);
    vpx_free(oci->mip);
-    vpx_free(oci->prev_mip);

    oci->above_context = 0;
    oci->mip = 0;
-    oci->prev_mip = 0;

 }

@@ -70,9 +65,9 @@ int vp8_alloc_frame_buffers(VP8_COMMON *oci, int width, int height)

    for (i = 0; i < NUM_YV12_BUFFERS; i++)
    {
-        oci->fb_idx_ref_cnt[i] = 0;
-        oci->yv12_fb[i].flags = 0;
-        if (vp8_yv12_alloc_frame_buffer(&oci->yv12_fb[i], width, height, VP8BORDERINPIXELS) < 0)
+      oci->fb_idx_ref_cnt[0] = 0;
+
+      if (vp8_yv12_alloc_frame_buffer(&oci->yv12_fb[i],  width, height, VP8BORDERINPIXELS) < 0)
        {
            vp8_de_alloc_frame_buffers(oci);
            return 1;
@@ -115,21 +110,6 @@ int vp8_alloc_frame_buffers(VP8_COMMON *oci, int width, int height)

    oci->mi = oci->mip + oci->mode_info_stride + 1;

-    /* allocate memory for last frame MODE_INFO array */
-#if CONFIG_ERROR_CONCEALMENT
-    oci->prev_mip = vpx_calloc((oci->mb_cols + 1) * (oci->mb_rows + 1), sizeof(MODE_INFO));
-
-    if (!oci->prev_mip)
-    {
-        vp8_de_alloc_frame_buffers(oci);
-        return 1;
-    }
-
-    oci->prev_mi = oci->prev_mip + oci->mode_info_stride + 1;
-#else
-    oci->prev_mip = NULL;
-    oci->prev_mi = NULL;
-#endif

    oci->above_context = vpx_calloc(sizeof(ENTROPY_CONTEXT_PLANES) * oci->mb_cols, 1);

@@ -140,9 +120,6 @@ int vp8_alloc_frame_buffers(VP8_COMMON *oci, int width, int height)
    }

    update_mode_info_border(oci->mi, oci->mb_rows, oci->mb_cols);
-#if CONFIG_ERROR_CONCEALMENT
-    update_mode_info_border(oci->prev_mi, oci->mb_rows, oci->mb_cols);
-#endif

    return 0;
 }
@@ -152,32 +129,32 @@ void vp8_setup_version(VP8_COMMON *cm)
    {
    case 0:
        cm->no_lpf = 0;
-        cm->filter_type = NORMAL_LOOPFILTER;
+        cm->simpler_lpf = 0;
        cm->use_bilinear_mc_filter = 0;
        cm->full_pixel = 0;
        break;
    case 1:
        cm->no_lpf = 0;
-        cm->filter_type = SIMPLE_LOOPFILTER;
+        cm->simpler_lpf = 1;
        cm->use_bilinear_mc_filter = 1;
        cm->full_pixel = 0;
        break;
    case 2:
        cm->no_lpf = 1;
-        cm->filter_type = NORMAL_LOOPFILTER;
+        cm->simpler_lpf = 0;
        cm->use_bilinear_mc_filter = 1;
        cm->full_pixel = 0;
        break;
    case 3:
        cm->no_lpf = 1;
-        cm->filter_type = SIMPLE_LOOPFILTER;
+        cm->simpler_lpf = 1;
        cm->use_bilinear_mc_filter = 1;
        cm->full_pixel = 1;
        break;
    default:
        /*4,5,6,7 are reserved for future use*/
        cm->no_lpf = 0;
-        cm->filter_type = NORMAL_LOOPFILTER;
+        cm->simpler_lpf = 0;
        cm->use_bilinear_mc_filter = 0;
        cm->full_pixel = 0;
        break;
@@ -186,13 +163,13 @@ void vp8_setup_version(VP8_COMMON *cm)
 void vp8_create_common(VP8_COMMON *oci)
 {
    vp8_machine_specific_config(oci);
-
+    vp8_default_coef_probs(oci);
    vp8_init_mbmode_probs(oci);
    vp8_default_bmode_probs(oci->fc.bmode_prob);

    oci->mb_no_coeff_skip = 1;
    oci->no_lpf = 0;
-    oci->filter_type = NORMAL_LOOPFILTER;
+    oci->simpler_lpf = 0;
    oci->use_bilinear_mc_filter = 0;
    oci->full_pixel = 0;
    oci->multi_token_partition = ONE_PARTITION;
--- a/vp8/common/arm/arm_systemdependent.c
+++ b/vp8/common/arm/arm_systemdependent.c
@@ -9,7 +9,7 @@
 */


-#include "vpx_config.h"
+#include "vpx_ports/config.h"
 #include "vpx_ports/arm.h"
 #include "vp8/common/g_common.h"
 #include "vp8/common/pragmas.h"
@@ -24,17 +24,14 @@ void vp8_arch_arm_common_init(VP8_COMMON *ctx)
 #if CONFIG_RUNTIME_CPU_DETECT
    VP8_COMMON_RTCD *rtcd = &ctx->rtcd;
    int flags = arm_cpu_caps();
+    int has_edsp = flags & HAS_EDSP;
+    int has_media = flags & HAS_MEDIA;
+    int has_neon = flags & HAS_NEON;
    rtcd->flags = flags;

    /* Override default functions with fastest ones for this CPU. */
-#if HAVE_ARMV5TE
-    if (flags & HAS_EDSP)
-    {
-    }
-#endif
-
 #if HAVE_ARMV6
-    if (flags & HAS_MEDIA)
+    if (has_media)
    {
        rtcd->subpix.sixtap16x16   = vp8_sixtap_predict16x16_armv6;
        rtcd->subpix.sixtap8x8     = vp8_sixtap_predict8x8_armv6;
@@ -54,11 +51,9 @@ void vp8_arch_arm_common_init(VP8_COMMON *ctx)
        rtcd->loopfilter.normal_b_v  = vp8_loop_filter_bv_armv6;
        rtcd->loopfilter.normal_mb_h = vp8_loop_filter_mbh_armv6;
        rtcd->loopfilter.normal_b_h  = vp8_loop_filter_bh_armv6;
-        rtcd->loopfilter.simple_mb_v =
-                vp8_loop_filter_simple_vertical_edge_armv6;
+        rtcd->loopfilter.simple_mb_v = vp8_loop_filter_mbvs_armv6;
        rtcd->loopfilter.simple_b_v  = vp8_loop_filter_bvs_armv6;
-        rtcd->loopfilter.simple_mb_h =
-                vp8_loop_filter_simple_horizontal_edge_armv6;
+        rtcd->loopfilter.simple_mb_h = vp8_loop_filter_mbhs_armv6;
        rtcd->loopfilter.simple_b_h  = vp8_loop_filter_bhs_armv6;

        rtcd->recon.copy16x16   = vp8_copy_mem16x16_v6;
@@ -71,7 +66,7 @@ void vp8_arch_arm_common_init(VP8_COMMON *ctx)
 #endif

 #if HAVE_ARMV7
-    if (flags & HAS_NEON)
+    if (has_neon)
    {
        rtcd->subpix.sixtap16x16   = vp8_sixtap_predict16x16_neon;
        rtcd->subpix.sixtap8x8     = vp8_sixtap_predict8x8_neon;
--- a/vp8/common/arm/armv6/bilinearfilter_v6.asm
+++ b/vp8/common/arm/armv6/bilinearfilter_v6.asm
@@ -30,12 +30,12 @@
    ldr     r4, [sp, #36]                   ; width

    mov     r12, r3                         ; outer-loop counter
-
-    add     r7, r2, r4                      ; preload next row
-    pld     [r0, r7]
-
    sub     r2, r2, r4                      ; src increment for height loop

+    ;;IF ARCHITECTURE=6
+    pld     [r0]
+    ;;ENDIF
+
    ldr     r5, [r11]                       ; load up filter coefficients

    mov     r3, r3, lsl #1                  ; height*2
@@ -96,8 +96,9 @@
    add     r0, r0, r2                      ; move to next input row
    subs    r12, r12, #1

-    add     r9, r2, r4, lsl #1              ; adding back block width
-    pld     [r0, r9]                        ; preload next row
+    ;;IF ARCHITECTURE=6
+    pld     [r0]
+    ;;ENDIF

    add     r11, r11, #2                    ; move over to next column
    mov     r1, r11
--- a/vp8/common/arm/armv6/copymem16x16_v6.asm
+++ b/vp8/common/arm/armv6/copymem16x16_v6.asm
@@ -22,7 +22,9 @@
    ;push   {r4-r7}

    ;preload
-    pld     [r0, #31]                ; preload for next 16x16 block
+    pld     [r0]
+    pld     [r0, r1]
+    pld     [r0, r1, lsl #1]

    ands    r4, r0, #15
    beq     copy_mem16x16_fast
@@ -88,8 +90,6 @@ copy_mem16x16_1_loop
    ldrneb  r6, [r0, #2]
    ldrneb  r7, [r0, #3]

-    pld     [r0, #31]               ; preload for next 16x16 block
-
    bne     copy_mem16x16_1_loop

    ldmia       sp!, {r4 - r7}
@@ -121,8 +121,6 @@ copy_mem16x16_4_loop
    ldrne   r6, [r0, #8]
    ldrne   r7, [r0, #12]

-    pld     [r0, #31]               ; preload for next 16x16 block
-
    bne     copy_mem16x16_4_loop

    ldmia       sp!, {r4 - r7}
@@ -150,7 +148,6 @@ copy_mem16x16_8_loop

    add     r2, r2, r3

-    pld     [r0, #31]               ; preload for next 16x16 block
    bne     copy_mem16x16_8_loop

    ldmia       sp!, {r4 - r7}
@@ -174,7 +171,6 @@ copy_mem16x16_fast_loop
    ;stm        r2, {r4-r7}
    add     r2, r2, r3

-    pld     [r0, #31]               ; preload for next 16x16 block
    bne     copy_mem16x16_fast_loop

    ldmia       sp!, {r4 - r7}
--- a/vp8/common/arm/armv6/filter_v6.asm
+++ b/vp8/common/arm/armv6/filter_v6.asm
@@ -10,8 +10,6 @@


    EXPORT  |vp8_filter_block2d_first_pass_armv6|
-    EXPORT  |vp8_filter_block2d_first_pass_16x16_armv6|
-    EXPORT  |vp8_filter_block2d_first_pass_8x8_armv6|
    EXPORT  |vp8_filter_block2d_second_pass_armv6|
    EXPORT  |vp8_filter4_block2d_second_pass_armv6|
    EXPORT  |vp8_filter_block2d_first_pass_only_armv6|
@@ -42,6 +40,11 @@
    add     r12, r3, #16                    ; square off the output
    sub     sp, sp, #4

+    ;;IF ARCHITECTURE=6
+    ;pld        [r0, #-2]
+    ;;pld       [r0, #30]
+    ;;ENDIF
+
    ldr     r4, [r11]                       ; load up packed filter coefficients
    ldr     r5, [r11, #4]
    ldr     r6, [r11, #8]
@@ -98,10 +101,15 @@

    bne     width_loop_1st_6

+    ;;add       r9, r2, #30                 ; attempt to load 2 adjacent cache lines
+    ;;IF ARCHITECTURE=6
+    ;pld        [r0, r2]
+    ;;pld       [r0, r9]
+    ;;ENDIF
+
    ldr     r1, [sp]                        ; load and update dst address
    subs    r7, r7, #0x10000
    add     r0, r0, r2                      ; move to next input line
-
    add     r1, r1, #2                      ; move over to next column
    str     r1, [sp]

@@ -112,192 +120,6 @@

    ENDP

-; --------------------------
-; 16x16 version
-; -----------------------------
-|vp8_filter_block2d_first_pass_16x16_armv6| PROC
-    stmdb   sp!, {r4 - r11, lr}
-
-    ldr     r11, [sp, #40]                  ; vp8_filter address
-    ldr     r7, [sp, #36]                   ; output height
-
-    add     r4, r2, #18                     ; preload next low
-    pld     [r0, r4]
-
-    sub     r2, r2, r3                      ; inside loop increments input array,
-                                            ; so the height loop only needs to add
-                                            ; r2 - width to the input pointer
-
-    mov     r3, r3, lsl #1                  ; multiply width by 2 because using shorts
-    add     r12, r3, #16                    ; square off the output
-    sub     sp, sp, #4
-
-    ldr     r4, [r11]                       ; load up packed filter coefficients
-    ldr     r5, [r11, #4]
-    ldr     r6, [r11, #8]
-
-    str     r1, [sp]                        ; push destination to stack
-    mov     r7, r7, lsl #16                 ; height is top part of counter
-
-; six tap filter
-|height_loop_1st_16_6|
-    ldrb    r8, [r0, #-2]                   ; load source data
-    ldrb    r9, [r0, #-1]
-    ldrb    r10, [r0], #2
-    orr     r7, r7, r3, lsr #2              ; construct loop counter
-
-|width_loop_1st_16_6|
-    ldrb    r11, [r0, #-1]
-
-    pkhbt   lr, r8, r9, lsl #16             ; r9 | r8
-    pkhbt   r8, r9, r10, lsl #16            ; r10 | r9
-
-    ldrb    r9, [r0]
-
-    smuad   lr, lr, r4                      ; apply the filter
-    pkhbt   r10, r10, r11, lsl #16          ; r11 | r10
-    smuad   r8, r8, r4
-    pkhbt   r11, r11, r9, lsl #16           ; r9 | r11
-
-    smlad   lr, r10, r5, lr
-    ldrb    r10, [r0, #1]
-    smlad   r8, r11, r5, r8
-    ldrb    r11, [r0, #2]
-
-    sub     r7, r7, #1
-
-    pkhbt   r9, r9, r10, lsl #16            ; r10 | r9
-    pkhbt   r10, r10, r11, lsl #16          ; r11 | r10
-
-    smlad   lr, r9, r6, lr
-    smlad   r11, r10, r6, r8
-
-    ands    r10, r7, #0xff                  ; test loop counter
-
-    add     lr, lr, #0x40                   ; round_shift_and_clamp
-    ldrneb  r8, [r0, #-2]                   ; load data for next loop
-    usat    lr, #8, lr, asr #7
-    add     r11, r11, #0x40
-    ldrneb  r9, [r0, #-1]
-    usat    r11, #8, r11, asr #7
-
-    strh    lr, [r1], r12                   ; result is transposed and stored, which
-                                            ; will make second pass filtering easier.
-    ldrneb  r10, [r0], #2
-    strh    r11, [r1], r12
-
-    bne     width_loop_1st_16_6
-
-    ldr     r1, [sp]                        ; load and update dst address
-    subs    r7, r7, #0x10000
-    add     r0, r0, r2                      ; move to next input line
-
-    add     r11, r2, #34                    ; adding back block width(=16)
-    pld     [r0, r11]                       ; preload next low
-
-    add     r1, r1, #2                      ; move over to next column
-    str     r1, [sp]
-
-    bne     height_loop_1st_16_6
-
-    add     sp, sp, #4
-    ldmia   sp!, {r4 - r11, pc}
-
-    ENDP
-
-; --------------------------
-; 8x8 version
-; -----------------------------
-|vp8_filter_block2d_first_pass_8x8_armv6| PROC
-    stmdb   sp!, {r4 - r11, lr}
-
-    ldr     r11, [sp, #40]                  ; vp8_filter address
-    ldr     r7, [sp, #36]                   ; output height
-
-    add     r4, r2, #10                     ; preload next low
-    pld     [r0, r4]
-
-    sub     r2, r2, r3                      ; inside loop increments input array,
-                                            ; so the height loop only needs to add
-                                            ; r2 - width to the input pointer
-
-    mov     r3, r3, lsl #1                  ; multiply width by 2 because using shorts
-    add     r12, r3, #16                    ; square off the output
-    sub     sp, sp, #4
-
-    ldr     r4, [r11]                       ; load up packed filter coefficients
-    ldr     r5, [r11, #4]
-    ldr     r6, [r11, #8]
-
-    str     r1, [sp]                        ; push destination to stack
-    mov     r7, r7, lsl #16                 ; height is top part of counter
-
-; six tap filter
-|height_loop_1st_8_6|
-    ldrb    r8, [r0, #-2]                   ; load source data
-    ldrb    r9, [r0, #-1]
-    ldrb    r10, [r0], #2
-    orr     r7, r7, r3, lsr #2              ; construct loop counter
-
-|width_loop_1st_8_6|
-    ldrb    r11, [r0, #-1]
-
-    pkhbt   lr, r8, r9, lsl #16             ; r9 | r8
-    pkhbt   r8, r9, r10, lsl #16            ; r10 | r9
-
-    ldrb    r9, [r0]
-
-    smuad   lr, lr, r4                      ; apply the filter
-    pkhbt   r10, r10, r11, lsl #16          ; r11 | r10
-    smuad   r8, r8, r4
-    pkhbt   r11, r11, r9, lsl #16           ; r9 | r11
-
-    smlad   lr, r10, r5, lr
-    ldrb    r10, [r0, #1]
-    smlad   r8, r11, r5, r8
-    ldrb    r11, [r0, #2]
-
-    sub     r7, r7, #1
-
-    pkhbt   r9, r9, r10, lsl #16            ; r10 | r9
-    pkhbt   r10, r10, r11, lsl #16          ; r11 | r10
-
-    smlad   lr, r9, r6, lr
-    smlad   r11, r10, r6, r8
-
-    ands    r10, r7, #0xff                  ; test loop counter
-
-    add     lr, lr, #0x40                   ; round_shift_and_clamp
-    ldrneb  r8, [r0, #-2]                   ; load data for next loop
-    usat    lr, #8, lr, asr #7
-    add     r11, r11, #0x40
-    ldrneb  r9, [r0, #-1]
-    usat    r11, #8, r11, asr #7
-
-    strh    lr, [r1], r12                   ; result is transposed and stored, which
-                                            ; will make second pass filtering easier.
-    ldrneb  r10, [r0], #2
-    strh    r11, [r1], r12
-
-    bne     width_loop_1st_8_6
-
-    ldr     r1, [sp]                        ; load and update dst address
-    subs    r7, r7, #0x10000
-    add     r0, r0, r2                      ; move to next input line
-
-    add     r11, r2, #18                    ; adding back block width(=8)
-    pld     [r0, r11]                       ; preload next low
-
-    add     r1, r1, #2                      ; move over to next column
-    str     r1, [sp]
-
-    bne     height_loop_1st_8_6
-
-    add     sp, sp, #4
-    ldmia   sp!, {r4 - r11, pc}
-
-    ENDP
-
 ;---------------------------------
 ; r0    short         *src_ptr,
 ; r1    unsigned char *output_ptr,
@@ -440,10 +262,6 @@
 |vp8_filter_block2d_first_pass_only_armv6| PROC
    stmdb   sp!, {r4 - r11, lr}

-    add     r7, r2, r3                      ; preload next low
-    add     r7, r7, #2
-    pld     [r0, r7]
-
    ldr     r4, [sp, #36]                   ; output pitch
    ldr     r11, [sp, #40]                  ; HFilter address
    sub     sp, sp, #8
@@ -512,15 +330,16 @@

    bne     width_loop_1st_only_6

+    ;;add       r9, r2, #30                 ; attempt to load 2 adjacent cache lines
+    ;;IF ARCHITECTURE=6
+    ;pld        [r0, r2]
+    ;;pld       [r0, r9]
+    ;;ENDIF
+
    ldr     lr, [sp]                        ; load back output pitch
    ldr     r12, [sp, #4]                   ; load back output pitch
    subs    r7, r7, #1
    add     r0, r0, r12                     ; updata src for next loop
-
-    add     r11, r12, r3                    ; preload next low
-    add     r11, r11, #2
-    pld     [r0, r11]
-
    add     r1, r1, lr                      ; update dst for next loop

    bne     height_loop_1st_only_6
--- a/vp8/common/arm/armv6/loopfilter_v6.asm
+++ b/vp8/common/arm/armv6/loopfilter_v6.asm
@@ -53,11 +53,14 @@ count       RN  r5

 ;r0     unsigned char *src_ptr,
 ;r1     int src_pixel_step,
-;r2     const char *blimit,
+;r2     const char *flimit,
 ;r3     const char *limit,
 ;stack  const char *thresh,
 ;stack  int  count

+;Note: All 16 elements in flimit are equal. So, in the code, only one load is needed
+;for flimit. Same way applies to limit and thresh.
+
 ;-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 |vp8_loop_filter_horizontal_edge_armv6| PROC
 ;-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
@@ -69,18 +72,14 @@ count       RN  r5
    sub         sp, sp, #16                 ; create temp buffer

    ldr         r9, [src], pstep            ; p3
-    ldrb        r4, [r2]                    ; blimit
+    ldr         r4, [r2], #4                ; flimit
    ldr         r10, [src], pstep           ; p2
-    ldrb        r2, [r3]                    ; limit
+    ldr         r2, [r3], #4                ; limit
    ldr         r11, [src], pstep           ; p1
-    orr         r4, r4, r4, lsl #8
-    ldrb        r3, [r6]                    ; thresh
-    orr         r2, r2, r2, lsl #8
+    uadd8       r4, r4, r4                  ; flimit * 2
+    ldr         r3, [r6], #4                ; thresh
    mov         count, count, lsl #1        ; 4-in-parallel
-    orr         r4, r4, r4, lsl #16
-    orr         r3, r3, r3, lsl #8
-    orr         r2, r2, r2, lsl #16
-    orr         r3, r3, r3, lsl #16
+    uadd8       r4, r4, r2                  ; flimit * 2 + limit

 |Hnext8|
    ; vp8_filter_mask() function
@@ -254,6 +253,12 @@ count       RN  r5

    subs        count, count, #1

+    ;pld            [src]
+    ;pld            [src, pstep]
+    ;pld            [src, pstep, lsl #1]
+    ;pld            [src, pstep, lsl #2]
+    ;pld            [src, pstep, lsl #3]
+
    ldrne       r9, [src], pstep            ; p3
    ldrne       r10, [src], pstep           ; p2
    ldrne       r11, [src], pstep           ; p1
@@ -276,18 +281,14 @@ count       RN  r5
    sub         sp, sp, #16                 ; create temp buffer

    ldr         r9, [src], pstep            ; p3
-    ldrb        r4, [r2]                    ; blimit
+    ldr         r4, [r2], #4                ; flimit
    ldr         r10, [src], pstep           ; p2
-    ldrb        r2, [r3]                    ; limit
+    ldr         r2, [r3], #4                ; limit
    ldr         r11, [src], pstep           ; p1
-    orr         r4, r4, r4, lsl #8
-    ldrb        r3, [r6]                    ; thresh
-    orr         r2, r2, r2, lsl #8
+    uadd8       r4, r4, r4                  ; flimit * 2
+    ldr         r3, [r6], #4                ; thresh
    mov         count, count, lsl #1        ; 4-in-parallel
-    orr         r4, r4, r4, lsl #16
-    orr         r3, r3, r3, lsl #8
-    orr         r2, r2, r2, lsl #16
-    orr         r3, r3, r3, lsl #16
+    uadd8       r4, r4, r2                  ; flimit * 2 + limit

 |MBHnext8|

@@ -589,19 +590,15 @@ count       RN  r5
    sub         sp, sp, #16                 ; create temp buffer

    ldr         r6, [src], pstep            ; load source data
-    ldrb        r4, [r2]                    ; blimit
+    ldr         r4, [r2], #4                ; flimit
    ldr         r7, [src], pstep
-    ldrb        r2, [r3]                    ; limit
+    ldr         r2, [r3], #4                ; limit
    ldr         r8, [src], pstep
-    orr         r4, r4, r4, lsl #8
-    ldrb        r3, [r12]                   ; thresh
-    orr         r2, r2, r2, lsl #8
+    uadd8       r4, r4, r4                  ; flimit * 2
+    ldr         r3, [r12], #4               ; thresh
    ldr         lr, [src], pstep
    mov         count, count, lsl #1        ; 4-in-parallel
-    orr         r4, r4, r4, lsl #16
-    orr         r3, r3, r3, lsl #8
-    orr         r2, r2, r2, lsl #16
-    orr         r3, r3, r3, lsl #16
+    uadd8       r4, r4, r2                  ; flimit * 2 + limit

 |Vnext8|

@@ -860,26 +857,18 @@ count       RN  r5
    sub         src, src, #4                ; move src pointer down by 4
    ldr         count, [sp, #40]            ; count for 8-in-parallel
    ldr         r12, [sp, #36]              ; load thresh address
-    pld         [src, #23]                  ; preload for next block
    sub         sp, sp, #16                 ; create temp buffer

    ldr         r6, [src], pstep            ; load source data
-    ldrb        r4, [r2]                    ; blimit
-    pld         [src, #23]
+    ldr         r4, [r2], #4                ; flimit
    ldr         r7, [src], pstep
-    ldrb        r2, [r3]                    ; limit
-    pld         [src, #23]
+    ldr         r2, [r3], #4                ; limit
    ldr         r8, [src], pstep
-    orr         r4, r4, r4, lsl #8
-    ldrb        r3, [r12]                   ; thresh
-    orr         r2, r2, r2, lsl #8
-    pld         [src, #23]
+    uadd8       r4, r4, r4                  ; flimit * 2
+    ldr         r3, [r12], #4               ; thresh
    ldr         lr, [src], pstep
    mov         count, count, lsl #1        ; 4-in-parallel
-    orr         r4, r4, r4, lsl #16
-    orr         r3, r3, r3, lsl #8
-    orr         r2, r2, r2, lsl #16
-    orr         r3, r3, r3, lsl #16
+    uadd8       r4, r4, r2                  ; flimit * 2 + limit

 |MBVnext8|
    ; vp8_filter_mask() function
@@ -919,7 +908,6 @@ count       RN  r5
    str         lr, [sp, #8]
    ldr         lr, [src], pstep

-
    TRANSPOSE_MATRIX r6, r7, r8, lr, r9, r10, r11, r12

    ldr         lr, [sp, #8]                ; load back (f)limit accumulator
@@ -968,7 +956,6 @@ count       RN  r5
    beq         mbvskip_filter               ; skip filtering


-
    ;vp8_hevmask() function
    ;calculate high edge variance

@@ -1136,7 +1123,6 @@ count       RN  r5
    smlabb      r8, r6, lr, r7
    smlatb      r6, r6, lr, r7
    smlabb      r9, r10, lr, r7
-
    smlatb      r10, r10, lr, r7
    ssat        r8, #8, r8, asr #7
    ssat        r6, #8, r6, asr #7
@@ -1256,13 +1242,9 @@ count       RN  r5
    sub         src, src, #4
    subs        count, count, #1

-    pld         [src, #23]                  ; preload for next block
    ldrne       r6, [src], pstep            ; load source data
-    pld         [src, #23]
    ldrne       r7, [src], pstep
-    pld         [src, #23]
    ldrne       r8, [src], pstep
-    pld         [src, #23]
    ldrne       lr, [src], pstep

    bne         MBVnext8
--- a/vp8/common/arm/armv6/simpleloopfilter_v6.asm
+++ b/vp8/common/arm/armv6/simpleloopfilter_v6.asm
@@ -45,28 +45,35 @@
    MEND


-
 src         RN  r0
 pstep       RN  r1

 ;r0     unsigned char *src_ptr,
 ;r1     int src_pixel_step,
-;r2     const char *blimit
+;r2     const char *flimit,
+;r3     const char *limit,
+;stack  const char *thresh,
+;stack  int  count
+
+; All 16 elements in flimit are equal. So, in the code, only one load is needed
+; for flimit. Same applies to limit. thresh is not used in simple looopfilter

 ;-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 |vp8_loop_filter_simple_horizontal_edge_armv6| PROC
 ;-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
    stmdb       sp!, {r4 - r11, lr}

-    ldrb        r12, [r2]                   ; blimit
+    ldr         r12, [r3]                   ; limit
    ldr         r3, [src, -pstep, lsl #1]   ; p1
    ldr         r4, [src, -pstep]           ; p0
    ldr         r5, [src]                   ; q0
    ldr         r6, [src, pstep]            ; q1
-    orr         r12, r12, r12, lsl #8       ; blimit
+    ldr         r7, [r2]                    ; flimit
    ldr         r2, c0x80808080
-    orr         r12, r12, r12, lsl #16      ; blimit
-    mov         r9, #4                      ; double the count. we're doing 4 at a time
+    ldr         r9, [sp, #40]               ; count for 8-in-parallel
+    uadd8       r7, r7, r7                  ; flimit * 2
+    mov         r9, r9, lsl #1              ; double the count. we're doing 4 at a time
+    uadd8       r12, r7, r12                ; flimit * 2 + limit
    mov         lr, #0                      ; need 0 in a couple places

 |simple_hnext8|
@@ -141,32 +148,30 @@ pstep       RN  r1
 ;-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
    stmdb       sp!, {r4 - r11, lr}

-    ldrb        r12, [r2]                   ; r12: blimit
+    ldr         r12, [r2]                   ; r12: flimit
    ldr         r2, c0x80808080
-    orr         r12, r12, r12, lsl #8
+    ldr         r7, [r3]                    ; limit

    ; load soure data to r7, r8, r9, r10
    ldrh        r3, [src, #-2]
-    pld         [src, #23]                  ; preload for next block
    ldrh        r4, [src], pstep
-    orr         r12, r12, r12, lsl #16
+    uadd8       r12, r12, r12               ; flimit * 2

    ldrh        r5, [src, #-2]
-    pld         [src, #23]
    ldrh        r6, [src], pstep
+    uadd8       r12, r12, r7                ; flimit * 2 + limit

    pkhbt       r7, r3, r4, lsl #16

    ldrh        r3, [src, #-2]
-    pld         [src, #23]
    ldrh        r4, [src], pstep
+    ldr         r11, [sp, #40]              ; count (r11) for 8-in-parallel

    pkhbt       r8, r5, r6, lsl #16

    ldrh        r5, [src, #-2]
-    pld         [src, #23]
    ldrh        r6, [src], pstep
-    mov         r11, #4                     ; double the count. we're doing 4 at a time
+    mov         r11, r11, lsl #1            ; 4-in-parallel

 |simple_vnext8|
    ; vp8_simple_filter_mask() function
@@ -254,23 +259,19 @@ pstep       RN  r1

    ; load soure data to r7, r8, r9, r10
    ldrneh      r3, [src, #-2]
-    pld         [src, #23]                  ; preload for next block
    ldrneh      r4, [src], pstep

    ldrneh      r5, [src, #-2]
-    pld         [src, #23]
    ldrneh      r6, [src], pstep

    pkhbt       r7, r3, r4, lsl #16

    ldrneh      r3, [src, #-2]
-    pld         [src, #23]
    ldrneh      r4, [src], pstep

    pkhbt       r8, r5, r6, lsl #16

    ldrneh      r5, [src, #-2]
-    pld         [src, #23]
    ldrneh      r6, [src], pstep

    bne         simple_vnext8
--- a/vp8/common/arm/armv6/sixtappredict8x4_v6.asm
+++ b/vp8/common/arm/armv6/sixtappredict8x4_v6.asm
@@ -32,12 +32,9 @@
    beq         skip_firstpass_filter

 ;first-pass filter
-    adr         r12, filter8_coeff
+    ldr         r12, _filter8_coeff_
    sub         r0, r0, r1, lsl #1

-    add         r3, r1, #10                 ; preload next low
-    pld         [r0, r3]
-
    add         r2, r12, r2, lsl #4         ;calculate filter location
    add         r0, r0, #3                  ;adjust src only for loading convinience

@@ -113,9 +110,6 @@

    add         r0, r0, r1                  ; move to next input line

-    add         r11, r1, #18                ; preload next low. adding back block width(=8), which is subtracted earlier
-    pld         [r0, r11]
-
    bne         first_pass_hloop_v6

 ;second pass filter
@@ -127,7 +121,7 @@ secondpass_filter
    cmp         r3, #0
    beq         skip_secondpass_filter

-    adr         r12, filter8_coeff
+    ldr         r12, _filter8_coeff_
    add         lr, r12, r3, lsl #4         ;calculate filter location

    mov         r2, #0x00080000
@@ -251,6 +245,8 @@ skip_secondpass_hloop
 ;-----------------
 ;One word each is reserved. Label filter_coeff can be used to access the data.
 ;Data address: filter_coeff, filter_coeff+4, filter_coeff+8 ...
+_filter8_coeff_
+    DCD     filter8_coeff
 filter8_coeff
    DCD     0x00000000,     0x00000080,     0x00000000,     0x00000000
    DCD     0xfffa0000,     0x000c007b,     0x0000ffff,     0x00000000
--- a/vp8/common/arm/filter_arm.c
+++ b/vp8/common/arm/filter_arm.c
@@ -9,7 +9,7 @@
 */


-#include "vpx_config.h"
+#include "vpx_ports/config.h"
 #include <math.h>
 #include "vp8/common/filter.h"
 #include "vp8/common/subpixel.h"
@@ -25,28 +25,6 @@ extern void vp8_filter_block2d_first_pass_armv6
    const short *vp8_filter
 );

-// 8x8
-extern void vp8_filter_block2d_first_pass_8x8_armv6
-(
-    unsigned char *src_ptr,
-    short         *output_ptr,
-    unsigned int src_pixels_per_line,
-    unsigned int output_width,
-    unsigned int output_height,
-    const short *vp8_filter
-);
-
-// 16x16
-extern void vp8_filter_block2d_first_pass_16x16_armv6
-(
-    unsigned char *src_ptr,
-    short         *output_ptr,
-    unsigned int src_pixels_per_line,
-    unsigned int output_width,
-    unsigned int output_height,
-    const short *vp8_filter
-);
-
 extern void vp8_filter_block2d_second_pass_armv6
 (
    short         *src_ptr,
@@ -165,12 +143,12 @@ void vp8_sixtap_predict8x8_armv6
    {
        if (yoffset & 0x1)
        {
-            vp8_filter_block2d_first_pass_8x8_armv6(src_ptr - src_pixels_per_line, FData + 1, src_pixels_per_line, 8, 11, HFilter);
+            vp8_filter_block2d_first_pass_armv6(src_ptr - src_pixels_per_line, FData + 1, src_pixels_per_line, 8, 11, HFilter);
            vp8_filter4_block2d_second_pass_armv6(FData + 2, dst_ptr, dst_pitch, 8, VFilter);
        }
        else
        {
-            vp8_filter_block2d_first_pass_8x8_armv6(src_ptr - (2 * src_pixels_per_line), FData, src_pixels_per_line, 8, 13, HFilter);
+            vp8_filter_block2d_first_pass_armv6(src_ptr - (2 * src_pixels_per_line), FData, src_pixels_per_line, 8, 13, HFilter);
            vp8_filter_block2d_second_pass_armv6(FData + 2, dst_ptr, dst_pitch, 8, VFilter);
        }
    }
@@ -207,12 +185,12 @@ void vp8_sixtap_predict16x16_armv6
    {
        if (yoffset & 0x1)
        {
-            vp8_filter_block2d_first_pass_16x16_armv6(src_ptr - src_pixels_per_line, FData + 1, src_pixels_per_line, 16, 19, HFilter);
+            vp8_filter_block2d_first_pass_armv6(src_ptr - src_pixels_per_line, FData + 1, src_pixels_per_line, 16, 19, HFilter);
            vp8_filter4_block2d_second_pass_armv6(FData + 2, dst_ptr, dst_pitch, 16, VFilter);
        }
        else
        {
-            vp8_filter_block2d_first_pass_16x16_armv6(src_ptr - (2 * src_pixels_per_line), FData, src_pixels_per_line, 16, 21, HFilter);
+            vp8_filter_block2d_first_pass_armv6(src_ptr - (2 * src_pixels_per_line), FData, src_pixels_per_line, 16, 21, HFilter);
            vp8_filter_block2d_second_pass_armv6(FData + 2, dst_ptr, dst_pitch, 16, VFilter);
        }
    }
--- a/vp8/common/arm/loopfilter_arm.c
+++ b/vp8/common/arm/loopfilter_arm.c
@@ -9,107 +9,135 @@
 */


-#include "vpx_config.h"
+#include "vpx_ports/config.h"
+#include <math.h>
 #include "vp8/common/loopfilter.h"
 #include "vp8/common/onyxc_int.h"

-#if HAVE_ARMV6
 extern prototype_loopfilter(vp8_loop_filter_horizontal_edge_armv6);
 extern prototype_loopfilter(vp8_loop_filter_vertical_edge_armv6);
 extern prototype_loopfilter(vp8_mbloop_filter_horizontal_edge_armv6);
 extern prototype_loopfilter(vp8_mbloop_filter_vertical_edge_armv6);
-#endif
+extern prototype_loopfilter(vp8_loop_filter_simple_horizontal_edge_armv6);
+extern prototype_loopfilter(vp8_loop_filter_simple_vertical_edge_armv6);

-#if HAVE_ARMV7
-typedef void loopfilter_y_neon(unsigned char *src, int pitch,
-        unsigned char blimit, unsigned char limit, unsigned char thresh);
-typedef void loopfilter_uv_neon(unsigned char *u, int pitch,
-        unsigned char blimit, unsigned char limit, unsigned char thresh,
-        unsigned char *v);
+extern prototype_loopfilter(vp8_loop_filter_horizontal_edge_y_neon);
+extern prototype_loopfilter(vp8_loop_filter_vertical_edge_y_neon);
+extern prototype_loopfilter(vp8_mbloop_filter_horizontal_edge_y_neon);
+extern prototype_loopfilter(vp8_mbloop_filter_vertical_edge_y_neon);
+extern prototype_loopfilter(vp8_loop_filter_simple_horizontal_edge_neon);
+extern prototype_loopfilter(vp8_loop_filter_simple_vertical_edge_neon);

-extern loopfilter_y_neon vp8_loop_filter_horizontal_edge_y_neon;
-extern loopfilter_y_neon vp8_loop_filter_vertical_edge_y_neon;
-extern loopfilter_y_neon vp8_mbloop_filter_horizontal_edge_y_neon;
-extern loopfilter_y_neon vp8_mbloop_filter_vertical_edge_y_neon;
+extern loop_filter_uvfunction vp8_loop_filter_horizontal_edge_uv_neon;
+extern loop_filter_uvfunction vp8_loop_filter_vertical_edge_uv_neon;
+extern loop_filter_uvfunction vp8_mbloop_filter_horizontal_edge_uv_neon;
+extern loop_filter_uvfunction vp8_mbloop_filter_vertical_edge_uv_neon;

-extern loopfilter_uv_neon vp8_loop_filter_horizontal_edge_uv_neon;
-extern loopfilter_uv_neon vp8_loop_filter_vertical_edge_uv_neon;
-extern loopfilter_uv_neon vp8_mbloop_filter_horizontal_edge_uv_neon;
-extern loopfilter_uv_neon vp8_mbloop_filter_vertical_edge_uv_neon;
-#endif

 #if HAVE_ARMV6
 /*ARMV6 loopfilter functions*/
 /* Horizontal MB filtering */
 void vp8_loop_filter_mbh_armv6(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
-                               int y_stride, int uv_stride, loop_filter_info *lfi)
+                               int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    vp8_mbloop_filter_horizontal_edge_armv6(y_ptr, y_stride, lfi->mblim, lfi->lim, lfi->hev_thr, 2);
+    (void) simpler_lpf;
+    vp8_mbloop_filter_horizontal_edge_armv6(y_ptr, y_stride, lfi->mbflim, lfi->lim, lfi->thr, 2);

    if (u_ptr)
-        vp8_mbloop_filter_horizontal_edge_armv6(u_ptr, uv_stride, lfi->mblim, lfi->lim, lfi->hev_thr, 1);
+        vp8_mbloop_filter_horizontal_edge_armv6(u_ptr, uv_stride, lfi->mbflim, lfi->lim, lfi->thr, 1);

    if (v_ptr)
-        vp8_mbloop_filter_horizontal_edge_armv6(v_ptr, uv_stride, lfi->mblim, lfi->lim, lfi->hev_thr, 1);
+        vp8_mbloop_filter_horizontal_edge_armv6(v_ptr, uv_stride, lfi->mbflim, lfi->lim, lfi->thr, 1);
+}
+
+void vp8_loop_filter_mbhs_armv6(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
+                                int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
+{
+    (void) u_ptr;
+    (void) v_ptr;
+    (void) uv_stride;
+    (void) simpler_lpf;
+    vp8_loop_filter_simple_horizontal_edge_armv6(y_ptr, y_stride, lfi->mbflim, lfi->lim, lfi->thr, 2);
 }

 /* Vertical MB Filtering */
 void vp8_loop_filter_mbv_armv6(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
-                               int y_stride, int uv_stride, loop_filter_info *lfi)
+                               int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    vp8_mbloop_filter_vertical_edge_armv6(y_ptr, y_stride, lfi->mblim, lfi->lim, lfi->hev_thr, 2);
+    (void) simpler_lpf;
+    vp8_mbloop_filter_vertical_edge_armv6(y_ptr, y_stride, lfi->mbflim, lfi->lim, lfi->thr, 2);

    if (u_ptr)
-        vp8_mbloop_filter_vertical_edge_armv6(u_ptr, uv_stride, lfi->mblim, lfi->lim, lfi->hev_thr, 1);
+        vp8_mbloop_filter_vertical_edge_armv6(u_ptr, uv_stride, lfi->mbflim, lfi->lim, lfi->thr, 1);

    if (v_ptr)
-        vp8_mbloop_filter_vertical_edge_armv6(v_ptr, uv_stride, lfi->mblim, lfi->lim, lfi->hev_thr, 1);
+        vp8_mbloop_filter_vertical_edge_armv6(v_ptr, uv_stride, lfi->mbflim, lfi->lim, lfi->thr, 1);
+}
+
+void vp8_loop_filter_mbvs_armv6(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
+                                int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
+{
+    (void) u_ptr;
+    (void) v_ptr;
+    (void) uv_stride;
+    (void) simpler_lpf;
+    vp8_loop_filter_simple_vertical_edge_armv6(y_ptr, y_stride, lfi->mbflim, lfi->lim, lfi->thr, 2);
 }

 /* Horizontal B Filtering */
 void vp8_loop_filter_bh_armv6(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
-                              int y_stride, int uv_stride, loop_filter_info *lfi)
+                              int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    vp8_loop_filter_horizontal_edge_armv6(y_ptr + 4 * y_stride, y_stride, lfi->blim, lfi->lim, lfi->hev_thr, 2);
-    vp8_loop_filter_horizontal_edge_armv6(y_ptr + 8 * y_stride, y_stride, lfi->blim, lfi->lim, lfi->hev_thr, 2);
-    vp8_loop_filter_horizontal_edge_armv6(y_ptr + 12 * y_stride, y_stride, lfi->blim, lfi->lim, lfi->hev_thr, 2);
+    (void) simpler_lpf;
+    vp8_loop_filter_horizontal_edge_armv6(y_ptr + 4 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_horizontal_edge_armv6(y_ptr + 8 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_horizontal_edge_armv6(y_ptr + 12 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);

    if (u_ptr)
-        vp8_loop_filter_horizontal_edge_armv6(u_ptr + 4 * uv_stride, uv_stride, lfi->blim, lfi->lim, lfi->hev_thr, 1);
+        vp8_loop_filter_horizontal_edge_armv6(u_ptr + 4 * uv_stride, uv_stride, lfi->flim, lfi->lim, lfi->thr, 1);

    if (v_ptr)
-        vp8_loop_filter_horizontal_edge_armv6(v_ptr + 4 * uv_stride, uv_stride, lfi->blim, lfi->lim, lfi->hev_thr, 1);
+        vp8_loop_filter_horizontal_edge_armv6(v_ptr + 4 * uv_stride, uv_stride, lfi->flim, lfi->lim, lfi->thr, 1);
 }

-void vp8_loop_filter_bhs_armv6(unsigned char *y_ptr, int y_stride,
-                               const unsigned char *blimit)
+void vp8_loop_filter_bhs_armv6(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
+                               int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    vp8_loop_filter_simple_horizontal_edge_armv6(y_ptr + 4 * y_stride, y_stride, blimit);
-    vp8_loop_filter_simple_horizontal_edge_armv6(y_ptr + 8 * y_stride, y_stride, blimit);
-    vp8_loop_filter_simple_horizontal_edge_armv6(y_ptr + 12 * y_stride, y_stride, blimit);
+    (void) u_ptr;
+    (void) v_ptr;
+    (void) uv_stride;
+    (void) simpler_lpf;
+    vp8_loop_filter_simple_horizontal_edge_armv6(y_ptr + 4 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_simple_horizontal_edge_armv6(y_ptr + 8 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_simple_horizontal_edge_armv6(y_ptr + 12 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
 }

 /* Vertical B Filtering */
 void vp8_loop_filter_bv_armv6(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
-                              int y_stride, int uv_stride, loop_filter_info *lfi)
+                              int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    vp8_loop_filter_vertical_edge_armv6(y_ptr + 4, y_stride, lfi->blim, lfi->lim, lfi->hev_thr, 2);
-    vp8_loop_filter_vertical_edge_armv6(y_ptr + 8, y_stride, lfi->blim, lfi->lim, lfi->hev_thr, 2);
-    vp8_loop_filter_vertical_edge_armv6(y_ptr + 12, y_stride, lfi->blim, lfi->lim, lfi->hev_thr, 2);
+    (void) simpler_lpf;
+    vp8_loop_filter_vertical_edge_armv6(y_ptr + 4, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_vertical_edge_armv6(y_ptr + 8, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_vertical_edge_armv6(y_ptr + 12, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);

    if (u_ptr)
-        vp8_loop_filter_vertical_edge_armv6(u_ptr + 4, uv_stride, lfi->blim, lfi->lim, lfi->hev_thr, 1);
+        vp8_loop_filter_vertical_edge_armv6(u_ptr + 4, uv_stride, lfi->flim, lfi->lim, lfi->thr, 1);

    if (v_ptr)
-        vp8_loop_filter_vertical_edge_armv6(v_ptr + 4, uv_stride, lfi->blim, lfi->lim, lfi->hev_thr, 1);
+        vp8_loop_filter_vertical_edge_armv6(v_ptr + 4, uv_stride, lfi->flim, lfi->lim, lfi->thr, 1);
 }

-void vp8_loop_filter_bvs_armv6(unsigned char *y_ptr, int y_stride,
-                               const unsigned char *blimit)
+void vp8_loop_filter_bvs_armv6(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
+                               int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    vp8_loop_filter_simple_vertical_edge_armv6(y_ptr + 4, y_stride, blimit);
-    vp8_loop_filter_simple_vertical_edge_armv6(y_ptr + 8, y_stride, blimit);
-    vp8_loop_filter_simple_vertical_edge_armv6(y_ptr + 12, y_stride, blimit);
+    (void) u_ptr;
+    (void) v_ptr;
+    (void) uv_stride;
+    (void) simpler_lpf;
+    vp8_loop_filter_simple_vertical_edge_armv6(y_ptr + 4, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_simple_vertical_edge_armv6(y_ptr + 8, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_simple_vertical_edge_armv6(y_ptr + 12, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
 }
 #endif

@@ -117,60 +145,93 @@ void vp8_loop_filter_bvs_armv6(unsigned char *y_ptr, int y_stride,
 /* NEON loopfilter functions */
 /* Horizontal MB filtering */
 void vp8_loop_filter_mbh_neon(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
-                              int y_stride, int uv_stride, loop_filter_info *lfi)
+                              int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    unsigned char mblim = *lfi->mblim;
-    unsigned char lim = *lfi->lim;
-    unsigned char hev_thr = *lfi->hev_thr;
-    vp8_mbloop_filter_horizontal_edge_y_neon(y_ptr, y_stride, mblim, lim, hev_thr);
+    (void) simpler_lpf;
+    vp8_mbloop_filter_horizontal_edge_y_neon(y_ptr, y_stride, lfi->mbflim, lfi->lim, lfi->thr, 2);

    if (u_ptr)
-        vp8_mbloop_filter_horizontal_edge_uv_neon(u_ptr, uv_stride, mblim, lim, hev_thr, v_ptr);
+        vp8_mbloop_filter_horizontal_edge_uv_neon(u_ptr, uv_stride, lfi->mbflim, lfi->lim, lfi->thr, v_ptr);
+}
+
+void vp8_loop_filter_mbhs_neon(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
+                               int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
+{
+    (void) u_ptr;
+    (void) v_ptr;
+    (void) uv_stride;
+    (void) simpler_lpf;
+    vp8_loop_filter_simple_horizontal_edge_neon(y_ptr, y_stride, lfi->mbflim, lfi->lim, lfi->thr, 2);
 }

 /* Vertical MB Filtering */
 void vp8_loop_filter_mbv_neon(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
-                              int y_stride, int uv_stride, loop_filter_info *lfi)
+                              int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    unsigned char mblim = *lfi->mblim;
-    unsigned char lim = *lfi->lim;
-    unsigned char hev_thr = *lfi->hev_thr;
-
-    vp8_mbloop_filter_vertical_edge_y_neon(y_ptr, y_stride, mblim, lim, hev_thr);
+    (void) simpler_lpf;
+    vp8_mbloop_filter_vertical_edge_y_neon(y_ptr, y_stride, lfi->mbflim, lfi->lim, lfi->thr, 2);

    if (u_ptr)
-        vp8_mbloop_filter_vertical_edge_uv_neon(u_ptr, uv_stride, mblim, lim, hev_thr, v_ptr);
+        vp8_mbloop_filter_vertical_edge_uv_neon(u_ptr, uv_stride, lfi->mbflim, lfi->lim, lfi->thr, v_ptr);
+}
+
+void vp8_loop_filter_mbvs_neon(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
+                               int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
+{
+    (void) u_ptr;
+    (void) v_ptr;
+    (void) uv_stride;
+    (void) simpler_lpf;
+    vp8_loop_filter_simple_vertical_edge_neon(y_ptr, y_stride, lfi->mbflim, lfi->lim, lfi->thr, 2);
 }

 /* Horizontal B Filtering */
 void vp8_loop_filter_bh_neon(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
-                             int y_stride, int uv_stride, loop_filter_info *lfi)
+                             int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    unsigned char blim = *lfi->blim;
-    unsigned char lim = *lfi->lim;
-    unsigned char hev_thr = *lfi->hev_thr;
-
-    vp8_loop_filter_horizontal_edge_y_neon(y_ptr + 4 * y_stride, y_stride, blim, lim, hev_thr);
-    vp8_loop_filter_horizontal_edge_y_neon(y_ptr + 8 * y_stride, y_stride, blim, lim, hev_thr);
-    vp8_loop_filter_horizontal_edge_y_neon(y_ptr + 12 * y_stride, y_stride, blim, lim, hev_thr);
+    (void) simpler_lpf;
+    vp8_loop_filter_horizontal_edge_y_neon(y_ptr + 4 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_horizontal_edge_y_neon(y_ptr + 8 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_horizontal_edge_y_neon(y_ptr + 12 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);

    if (u_ptr)
-        vp8_loop_filter_horizontal_edge_uv_neon(u_ptr + 4 * uv_stride, uv_stride, blim, lim, hev_thr, v_ptr + 4 * uv_stride);
+        vp8_loop_filter_horizontal_edge_uv_neon(u_ptr + 4 * uv_stride, uv_stride, lfi->flim, lfi->lim, lfi->thr, v_ptr + 4 * uv_stride);
+}
+
+void vp8_loop_filter_bhs_neon(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
+                              int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
+{
+    (void) u_ptr;
+    (void) v_ptr;
+    (void) uv_stride;
+    (void) simpler_lpf;
+    vp8_loop_filter_simple_horizontal_edge_neon(y_ptr + 4 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_simple_horizontal_edge_neon(y_ptr + 8 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_simple_horizontal_edge_neon(y_ptr + 12 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
 }

 /* Vertical B Filtering */
 void vp8_loop_filter_bv_neon(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
-                             int y_stride, int uv_stride, loop_filter_info *lfi)
+                             int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    unsigned char blim = *lfi->blim;
-    unsigned char lim = *lfi->lim;
-    unsigned char hev_thr = *lfi->hev_thr;
-
-    vp8_loop_filter_vertical_edge_y_neon(y_ptr + 4, y_stride, blim, lim, hev_thr);
-    vp8_loop_filter_vertical_edge_y_neon(y_ptr + 8, y_stride, blim, lim, hev_thr);
-    vp8_loop_filter_vertical_edge_y_neon(y_ptr + 12, y_stride, blim, lim, hev_thr);
+    (void) simpler_lpf;
+    vp8_loop_filter_vertical_edge_y_neon(y_ptr + 4, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_vertical_edge_y_neon(y_ptr + 8, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_vertical_edge_y_neon(y_ptr + 12, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);

    if (u_ptr)
-        vp8_loop_filter_vertical_edge_uv_neon(u_ptr + 4, uv_stride, blim, lim, hev_thr, v_ptr + 4);
+        vp8_loop_filter_vertical_edge_uv_neon(u_ptr + 4, uv_stride, lfi->flim, lfi->lim, lfi->thr, v_ptr + 4);
+}
+
+void vp8_loop_filter_bvs_neon(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
+                              int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
+{
+    (void) u_ptr;
+    (void) v_ptr;
+    (void) uv_stride;
+    (void) simpler_lpf;
+    vp8_loop_filter_simple_vertical_edge_neon(y_ptr + 4, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_simple_vertical_edge_neon(y_ptr + 8, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_simple_vertical_edge_neon(y_ptr + 12, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
 }
 #endif
--- a/vp8/common/arm/loopfilter_arm.h
+++ b/vp8/common/arm/loopfilter_arm.h
@@ -12,17 +12,15 @@
 #ifndef LOOPFILTER_ARM_H
 #define LOOPFILTER_ARM_H

-#include "vpx_config.h"
-
 #if HAVE_ARMV6
 extern prototype_loopfilter_block(vp8_loop_filter_mbv_armv6);
 extern prototype_loopfilter_block(vp8_loop_filter_bv_armv6);
 extern prototype_loopfilter_block(vp8_loop_filter_mbh_armv6);
 extern prototype_loopfilter_block(vp8_loop_filter_bh_armv6);
-extern prototype_simple_loopfilter(vp8_loop_filter_bvs_armv6);
-extern prototype_simple_loopfilter(vp8_loop_filter_bhs_armv6);
-extern prototype_simple_loopfilter(vp8_loop_filter_simple_horizontal_edge_armv6);
-extern prototype_simple_loopfilter(vp8_loop_filter_simple_vertical_edge_armv6);
+extern prototype_loopfilter_block(vp8_loop_filter_mbvs_armv6);
+extern prototype_loopfilter_block(vp8_loop_filter_bvs_armv6);
+extern prototype_loopfilter_block(vp8_loop_filter_mbhs_armv6);
+extern prototype_loopfilter_block(vp8_loop_filter_bhs_armv6);

 #if !CONFIG_RUNTIME_CPU_DETECT
 #undef  vp8_lf_normal_mb_v
@@ -38,29 +36,28 @@ extern prototype_simple_loopfilter(vp8_loop_filter_simple_vertical_edge_armv6);
 #define vp8_lf_normal_b_h vp8_loop_filter_bh_armv6

 #undef  vp8_lf_simple_mb_v
-#define vp8_lf_simple_mb_v vp8_loop_filter_simple_vertical_edge_armv6
+#define vp8_lf_simple_mb_v vp8_loop_filter_mbvs_armv6

 #undef  vp8_lf_simple_b_v
 #define vp8_lf_simple_b_v vp8_loop_filter_bvs_armv6

 #undef  vp8_lf_simple_mb_h
-#define vp8_lf_simple_mb_h vp8_loop_filter_simple_horizontal_edge_armv6
+#define vp8_lf_simple_mb_h vp8_loop_filter_mbhs_armv6

 #undef  vp8_lf_simple_b_h
 #define vp8_lf_simple_b_h vp8_loop_filter_bhs_armv6
-#endif /* !CONFIG_RUNTIME_CPU_DETECT */
-
-#endif /* HAVE_ARMV6 */
+#endif
+#endif

 #if HAVE_ARMV7
 extern prototype_loopfilter_block(vp8_loop_filter_mbv_neon);
 extern prototype_loopfilter_block(vp8_loop_filter_bv_neon);
 extern prototype_loopfilter_block(vp8_loop_filter_mbh_neon);
 extern prototype_loopfilter_block(vp8_loop_filter_bh_neon);
-extern prototype_simple_loopfilter(vp8_loop_filter_mbvs_neon);
-extern prototype_simple_loopfilter(vp8_loop_filter_bvs_neon);
-extern prototype_simple_loopfilter(vp8_loop_filter_mbhs_neon);
-extern prototype_simple_loopfilter(vp8_loop_filter_bhs_neon);
+extern prototype_loopfilter_block(vp8_loop_filter_mbvs_neon);
+extern prototype_loopfilter_block(vp8_loop_filter_bvs_neon);
+extern prototype_loopfilter_block(vp8_loop_filter_mbhs_neon);
+extern prototype_loopfilter_block(vp8_loop_filter_bhs_neon);

 #if !CONFIG_RUNTIME_CPU_DETECT
 #undef  vp8_lf_normal_mb_v
@@ -86,8 +83,7 @@ extern prototype_simple_loopfilter(vp8_loop_filter_bhs_neon);

 #undef  vp8_lf_simple_b_h
 #define vp8_lf_simple_b_h vp8_loop_filter_bhs_neon
-#endif /* !CONFIG_RUNTIME_CPU_DETECT */
+#endif
+#endif

-#endif /* HAVE_ARMV7 */
-
-#endif /* LOOPFILTER_ARM_H */
+#endif
--- a/vp8/common/arm/neon/bilinearpredict16x16_neon.asm
+++ b/vp8/common/arm/neon/bilinearpredict16x16_neon.asm
@@ -25,7 +25,7 @@
 |vp8_bilinear_predict16x16_neon| PROC
    push            {r4-r5, lr}

-    adr             r12, bifilter16_coeff
+    ldr             r12, _bifilter16_coeff_
    ldr             r4, [sp, #12]           ;load parameters from stack
    ldr             r5, [sp, #16]           ;load parameters from stack

@@ -351,6 +351,8 @@ filt_blk2d_spo16x16_loop_neon

 ;-----------------

+_bifilter16_coeff_
+    DCD     bifilter16_coeff
 bifilter16_coeff
    DCD     128, 0, 112, 16, 96, 32, 80, 48, 64, 64, 48, 80, 32, 96, 16, 112

--- a/vp8/common/arm/neon/bilinearpredict4x4_neon.asm
+++ b/vp8/common/arm/neon/bilinearpredict4x4_neon.asm
@@ -25,7 +25,7 @@
 |vp8_bilinear_predict4x4_neon| PROC
    push            {r4, lr}

-    adr             r12, bifilter4_coeff
+    ldr             r12, _bifilter4_coeff_
    ldr             r4, [sp, #8]            ;load parameters from stack
    ldr             lr, [sp, #12]           ;load parameters from stack

@@ -124,6 +124,8 @@ skip_secondpass_filter

 ;-----------------

+_bifilter4_coeff_
+    DCD     bifilter4_coeff
 bifilter4_coeff
    DCD     128, 0, 112, 16, 96, 32, 80, 48, 64, 64, 48, 80, 32, 96, 16, 112

--- a/vp8/common/arm/neon/bilinearpredict8x4_neon.asm
+++ b/vp8/common/arm/neon/bilinearpredict8x4_neon.asm
@@ -25,7 +25,7 @@
 |vp8_bilinear_predict8x4_neon| PROC
    push            {r4, lr}

-    adr             r12, bifilter8x4_coeff
+    ldr             r12, _bifilter8x4_coeff_
    ldr             r4, [sp, #8]            ;load parameters from stack
    ldr             lr, [sp, #12]           ;load parameters from stack

@@ -129,6 +129,8 @@ skip_secondpass_filter

 ;-----------------

+_bifilter8x4_coeff_
+    DCD     bifilter8x4_coeff
 bifilter8x4_coeff
    DCD     128, 0, 112, 16, 96, 32, 80, 48, 64, 64, 48, 80, 32, 96, 16, 112

--- a/vp8/common/arm/neon/bilinearpredict8x8_neon.asm
+++ b/vp8/common/arm/neon/bilinearpredict8x8_neon.asm
@@ -25,7 +25,7 @@
 |vp8_bilinear_predict8x8_neon| PROC
    push            {r4, lr}

-    adr             r12, bifilter8_coeff
+    ldr             r12, _bifilter8_coeff_
    ldr             r4, [sp, #8]            ;load parameters from stack
    ldr             lr, [sp, #12]           ;load parameters from stack

@@ -177,6 +177,8 @@ skip_secondpass_filter

 ;-----------------

+_bifilter8_coeff_
+    DCD     bifilter8_coeff
 bifilter8_coeff
    DCD     128, 0, 112, 16, 96, 32, 80, 48, 64, 64, 48, 80, 32, 96, 16, 112

--- a/vp8/common/arm/neon/iwalsh_neon.asm
+++ b/vp8/common/arm/neon/iwalsh_neon.asm
@@ -20,16 +20,19 @@
 |vp8_short_inv_walsh4x4_neon| PROC

    ; read in all four lines of values: d0->d3
-    vld1.i16 {q0-q1}, [r0@128]
+    vldm.64 r0, {q0, q1}

    ; first for loop
-    vadd.s16 d4, d0, d3 ;a = [0] + [12]
-    vadd.s16 d6, d1, d2 ;b = [4] + [8]
-    vsub.s16 d5, d0, d3 ;d = [0] - [12]
-    vsub.s16 d7, d1, d2 ;c = [4] - [8]

-    vadd.s16 q0, q2, q3 ; a+b d+c
-    vsub.s16 q1, q2, q3 ; a-b d-c
+    vadd.s16 d4, d0, d3 ;a = [0] + [12]
+    vadd.s16 d5, d1, d2 ;b = [4] + [8]
+    vsub.s16 d6, d1, d2 ;c = [4] - [8]
+    vsub.s16 d7, d0, d3 ;d = [0] - [12]
+
+    vadd.s16 d0, d4, d5 ;a + b
+    vadd.s16 d1, d6, d7 ;c + d
+    vsub.s16 d2, d4, d5 ;a - b
+    vsub.s16 d3, d7, d6 ;d - c

    vtrn.32 d0, d2 ;d0:  0  1  8  9
                   ;d2:  2  3 10 11
@@ -44,22 +47,29 @@
    ; second for loop

    vadd.s16 d4, d0, d3 ;a = [0] + [3]
-    vadd.s16 d6, d1, d2 ;b = [1] + [2]
-    vsub.s16 d5, d0, d3 ;d = [0] - [3]
-    vsub.s16 d7, d1, d2 ;c = [1] - [2]
+    vadd.s16 d5, d1, d2 ;b = [1] + [2]
+    vsub.s16 d6, d1, d2 ;c = [1] - [2]
+    vsub.s16 d7, d0, d3 ;d = [0] - [3]

-    vmov.i16 q8, #3
+    vadd.s16 d0, d4, d5 ;e = a + b
+    vadd.s16 d1, d6, d7 ;f = c + d
+    vsub.s16 d2, d4, d5 ;g = a - b
+    vsub.s16 d3, d7, d6 ;h = d - c

-    vadd.s16 q0, q2, q3 ; a+b d+c
-    vsub.s16 q1, q2, q3 ; a-b d-c
-
-    vadd.i16 q0, q0, q8 ;e/f += 3
-    vadd.i16 q1, q1, q8 ;g/h += 3
+    vmov.i16 q2, #3
+    vadd.i16 q0, q0, q2 ;e/f += 3
+    vadd.i16 q1, q1, q2 ;g/h += 3

    vshr.s16 q0, q0, #3 ;e/f >> 3
    vshr.s16 q1, q1, #3 ;g/h >> 3

-    vst4.i16 {d0,d1,d2,d3}, [r1@128]
+    vtrn.32 d0, d2
+    vtrn.32 d1, d3
+    vtrn.16 d0, d1
+    vtrn.16 d2, d3
+
+    vstmia.16 r1!, {q0}
+    vstmia.16 r1!, {q1}

    bx lr
    ENDP    ; |vp8_short_inv_walsh4x4_neon|
@@ -67,13 +77,19 @@

 ;short vp8_short_inv_walsh4x4_1_neon(short *input, short *output)
 |vp8_short_inv_walsh4x4_1_neon| PROC
-    ldrsh r2, [r0]          ; load input[0]
-    add r3, r2, #3          ; add 3
-    add r2, r1, #16         ; base for last 8 output
-    asr r0, r3, #3          ; right shift 3
-    vdup.16 q0, r0          ; load and duplicate
-    vst1.16 {q0}, [r1@128]  ; write back 8
-    vst1.16 {q0}, [r2@128]  ; write back last 8
+    ; load a full line into a neon register
+    vld1.16  {q0}, [r0]
+    ; extract first element and replicate
+    vdup.16 q1, d0[0]
+    ; add 3 to all values
+    vmov.i16 q2, #3
+    vadd.i16 q3, q1, q2
+    ; right shift
+    vshr.s16 q3, q3, #3
+    ; write it back
+    vstmia.16 r1!, {q3}
+    vstmia.16 r1!, {q3}
+
    bx lr
    ENDP    ; |vp8_short_inv_walsh4x4_1_neon|

--- a/vp8/common/arm/neon/loopfilter_neon.asm
+++ b/vp8/common/arm/neon/loopfilter_neon.asm
@@ -14,97 +14,109 @@
    EXPORT  |vp8_loop_filter_vertical_edge_y_neon|
    EXPORT  |vp8_loop_filter_vertical_edge_uv_neon|
    ARM
+    REQUIRE8
+    PRESERVE8

    AREA ||.text||, CODE, READONLY, ALIGN=2

+; flimit, limit, and thresh should be positive numbers.
+; All 16 elements in these variables are equal.
+
+; void vp8_loop_filter_horizontal_edge_y_neon(unsigned char *src, int pitch,
+;                                             const signed char *flimit,
+;                                             const signed char *limit,
+;                                             const signed char *thresh,
+;                                             int count)
 ; r0    unsigned char *src
 ; r1    int pitch
-; r2    unsigned char blimit
-; r3    unsigned char limit
-; sp    unsigned char thresh,
+; r2    const signed char *flimit
+; r3    const signed char *limit
+; sp    const signed char *thresh,
+; sp+4  int count (unused)
 |vp8_loop_filter_horizontal_edge_y_neon| PROC
-    push        {lr}
-    vdup.u8     q0, r2                     ; duplicate blimit
-    vdup.u8     q1, r3                     ; duplicate limit
-    sub         r2, r0, r1, lsl #2         ; move src pointer down by 4 lines
-    ldr         r3, [sp, #4]               ; load thresh
-    add         r12, r2, r1
-    add         r1, r1, r1
+    stmdb       sp!, {lr}
+    vld1.s8     {d0[], d1[]}, [r2]          ; flimit
+    vld1.s8     {d2[], d3[]}, [r3]          ; limit
+    sub         r2, r0, r1, lsl #2          ; move src pointer down by 4 lines
+    ldr         r12, [sp, #4]               ; load thresh pointer

-    vdup.u8     q2, r3                     ; duplicate thresh
-
-    vld1.u8     {q3}, [r2@128], r1              ; p3
-    vld1.u8     {q4}, [r12@128], r1             ; p2
-    vld1.u8     {q5}, [r2@128], r1              ; p1
-    vld1.u8     {q6}, [r12@128], r1             ; p0
-    vld1.u8     {q7}, [r2@128], r1              ; q0
-    vld1.u8     {q8}, [r12@128], r1             ; q1
-    vld1.u8     {q9}, [r2@128]                  ; q2
-    vld1.u8     {q10}, [r12@128]                ; q3
-
-    sub         r2, r2, r1, lsl #1
-    sub         r12, r12, r1, lsl #1
+    vld1.u8     {q3}, [r2], r1              ; p3
+    vld1.u8     {q4}, [r2], r1              ; p2
+    vld1.u8     {q5}, [r2], r1              ; p1
+    vld1.u8     {q6}, [r2], r1              ; p0
+    vld1.u8     {q7}, [r2], r1              ; q0
+    vld1.u8     {q8}, [r2], r1              ; q1
+    vld1.u8     {q9}, [r2], r1              ; q2
+    vld1.u8     {q10}, [r2]                 ; q3
+    vld1.s8     {d4[], d5[]}, [r12]         ; thresh
+    sub         r0, r0, r1, lsl #1

    bl          vp8_loop_filter_neon

-    vst1.u8     {q5}, [r2@128], r1              ; store op1
-    vst1.u8     {q6}, [r12@128], r1             ; store op0
-    vst1.u8     {q7}, [r2@128], r1              ; store oq0
-    vst1.u8     {q8}, [r12@128], r1             ; store oq1
+    vst1.u8     {q5}, [r0], r1              ; store op1
+    vst1.u8     {q6}, [r0], r1              ; store op0
+    vst1.u8     {q7}, [r0], r1              ; store oq0
+    vst1.u8     {q8}, [r0], r1              ; store oq1

-    pop         {pc}
+    ldmia       sp!, {pc}
    ENDP        ; |vp8_loop_filter_horizontal_edge_y_neon|

-
+; void vp8_loop_filter_horizontal_edge_uv_neon(unsigned char *u, int pitch
+;                                              const signed char *flimit,
+;                                              const signed char *limit,
+;                                              const signed char *thresh,
+;                                              unsigned char *v)
 ; r0    unsigned char *u,
 ; r1    int pitch,
-; r2    unsigned char blimit
-; r3    unsigned char limit
-; sp    unsigned char thresh,
+; r2    const signed char *flimit,
+; r3    const signed char *limit,
+; sp    const signed char *thresh,
 ; sp+4  unsigned char *v
 |vp8_loop_filter_horizontal_edge_uv_neon| PROC
-    push        {lr}
-    vdup.u8     q0, r2                      ; duplicate blimit
-    vdup.u8     q1, r3                      ; duplicate limit
-    ldr         r12, [sp, #4]               ; load thresh
+    stmdb       sp!, {lr}
+    vld1.s8     {d0[], d1[]}, [r2]          ; flimit
+    vld1.s8     {d2[], d3[]}, [r3]          ; limit
    ldr         r2, [sp, #8]                ; load v ptr
-    vdup.u8     q2, r12                     ; duplicate thresh

    sub         r3, r0, r1, lsl #2          ; move u pointer down by 4 lines
-    sub         r12, r2, r1, lsl #2         ; move v pointer down by 4 lines
+    vld1.u8     {d6}, [r3], r1              ; p3
+    vld1.u8     {d8}, [r3], r1              ; p2
+    vld1.u8     {d10}, [r3], r1             ; p1
+    vld1.u8     {d12}, [r3], r1             ; p0
+    vld1.u8     {d14}, [r3], r1             ; q0
+    vld1.u8     {d16}, [r3], r1             ; q1
+    vld1.u8     {d18}, [r3], r1             ; q2
+    vld1.u8     {d20}, [r3]                 ; q3

-    vld1.u8     {d6}, [r3@64], r1              ; p3
-    vld1.u8     {d7}, [r12@64], r1             ; p3
-    vld1.u8     {d8}, [r3@64], r1              ; p2
-    vld1.u8     {d9}, [r12@64], r1             ; p2
-    vld1.u8     {d10}, [r3@64], r1             ; p1
-    vld1.u8     {d11}, [r12@64], r1            ; p1
-    vld1.u8     {d12}, [r3@64], r1             ; p0
-    vld1.u8     {d13}, [r12@64], r1            ; p0
-    vld1.u8     {d14}, [r3@64], r1             ; q0
-    vld1.u8     {d15}, [r12@64], r1            ; q0
-    vld1.u8     {d16}, [r3@64], r1             ; q1
-    vld1.u8     {d17}, [r12@64], r1            ; q1
-    vld1.u8     {d18}, [r3@64], r1             ; q2
-    vld1.u8     {d19}, [r12@64], r1            ; q2
-    vld1.u8     {d20}, [r3@64]                 ; q3
-    vld1.u8     {d21}, [r12@64]                ; q3
+    ldr         r3, [sp, #4]                ; load thresh pointer
+
+    sub         r12, r2, r1, lsl #2         ; move v pointer down by 4 lines
+    vld1.u8     {d7}, [r12], r1             ; p3
+    vld1.u8     {d9}, [r12], r1             ; p2
+    vld1.u8     {d11}, [r12], r1            ; p1
+    vld1.u8     {d13}, [r12], r1            ; p0
+    vld1.u8     {d15}, [r12], r1            ; q0
+    vld1.u8     {d17}, [r12], r1            ; q1
+    vld1.u8     {d19}, [r12], r1            ; q2
+    vld1.u8     {d21}, [r12]                ; q3
+
+    vld1.s8     {d4[], d5[]}, [r3]          ; thresh

    bl          vp8_loop_filter_neon

    sub         r0, r0, r1, lsl #1
    sub         r2, r2, r1, lsl #1

-    vst1.u8     {d10}, [r0@64], r1             ; store u op1
-    vst1.u8     {d11}, [r2@64], r1             ; store v op1
-    vst1.u8     {d12}, [r0@64], r1             ; store u op0
-    vst1.u8     {d13}, [r2@64], r1             ; store v op0
-    vst1.u8     {d14}, [r0@64], r1             ; store u oq0
-    vst1.u8     {d15}, [r2@64], r1             ; store v oq0
-    vst1.u8     {d16}, [r0@64]                 ; store u oq1
-    vst1.u8     {d17}, [r2@64]                 ; store v oq1
+    vst1.u8     {d10}, [r0], r1             ; store u op1
+    vst1.u8     {d11}, [r2], r1             ; store v op1
+    vst1.u8     {d12}, [r0], r1             ; store u op0
+    vst1.u8     {d13}, [r2], r1             ; store v op0
+    vst1.u8     {d14}, [r0], r1             ; store u oq0
+    vst1.u8     {d15}, [r2], r1             ; store v oq0
+    vst1.u8     {d16}, [r0]                 ; store u oq1
+    vst1.u8     {d17}, [r2]                 ; store v oq1

-    pop         {pc}
+    ldmia       sp!, {pc}
    ENDP        ; |vp8_loop_filter_horizontal_edge_uv_neon|

 ; void vp8_loop_filter_vertical_edge_y_neon(unsigned char *src, int pitch,
@@ -112,38 +124,39 @@
 ;                                           const signed char *limit,
 ;                                           const signed char *thresh,
 ;                                           int count)
-; r0    unsigned char *src
-; r1    int pitch
-; r2    unsigned char blimit
-; r3    unsigned char limit
-; sp    unsigned char thresh,
-
+; r0    unsigned char *src,
+; r1    int pitch,
+; r2    const signed char *flimit,
+; r3    const signed char *limit,
+; sp    const signed char *thresh,
+; sp+4  int count (unused)
 |vp8_loop_filter_vertical_edge_y_neon| PROC
-    push        {lr}
-    vdup.u8     q0, r2                     ; duplicate blimit
-    vdup.u8     q1, r3                     ; duplicate limit
-    sub         r2, r0, #4                 ; src ptr down by 4 columns
-    add         r1, r1, r1
-    ldr         r3, [sp, #4]               ; load thresh
-    add         r12, r2, r1, asr #1
+    stmdb       sp!, {lr}
+    vld1.s8     {d0[], d1[]}, [r2]          ; flimit
+    vld1.s8     {d2[], d3[]}, [r3]          ; limit
+    sub         r2, r0, #4                  ; src ptr down by 4 columns
+    sub         r0, r0, #2                  ; dst ptr
+    ldr         r12, [sp, #4]               ; load thresh pointer

-    vld1.u8     {d6}, [r2], r1
-    vld1.u8     {d8}, [r12], r1
+    vld1.u8     {d6}, [r2], r1              ; load first 8-line src data
+    vld1.u8     {d8}, [r2], r1
    vld1.u8     {d10}, [r2], r1
-    vld1.u8     {d12}, [r12], r1
+    vld1.u8     {d12}, [r2], r1
    vld1.u8     {d14}, [r2], r1
-    vld1.u8     {d16}, [r12], r1
+    vld1.u8     {d16}, [r2], r1
    vld1.u8     {d18}, [r2], r1
-    vld1.u8     {d20}, [r12], r1
+    vld1.u8     {d20}, [r2], r1
+
+    vld1.s8     {d4[], d5[]}, [r12]         ; thresh

    vld1.u8     {d7}, [r2], r1              ; load second 8-line src data
-    vld1.u8     {d9}, [r12], r1
+    vld1.u8     {d9}, [r2], r1
    vld1.u8     {d11}, [r2], r1
-    vld1.u8     {d13}, [r12], r1
+    vld1.u8     {d13}, [r2], r1
    vld1.u8     {d15}, [r2], r1
-    vld1.u8     {d17}, [r12], r1
-    vld1.u8     {d19}, [r2]
-    vld1.u8     {d21}, [r12]
+    vld1.u8     {d17}, [r2], r1
+    vld1.u8     {d19}, [r2], r1
+    vld1.u8     {d21}, [r2]

    ;transpose to 8x16 matrix
    vtrn.32     q3, q7
@@ -151,8 +164,6 @@
    vtrn.32     q5, q9
    vtrn.32     q6, q10

-    vdup.u8     q2, r3                     ; duplicate thresh
-
    vtrn.16     q3, q5
    vtrn.16     q4, q6
    vtrn.16     q7, q9
@@ -167,34 +178,28 @@

    vswp        d12, d11
    vswp        d16, d13
-
-    sub         r0, r0, #2                 ; dst ptr
-
    vswp        d14, d12
    vswp        d16, d15

-    add         r12, r0, r1, asr #1
-
    ;store op1, op0, oq0, oq1
    vst4.8      {d10[0], d11[0], d12[0], d13[0]}, [r0], r1
-    vst4.8      {d10[1], d11[1], d12[1], d13[1]}, [r12], r1
+    vst4.8      {d10[1], d11[1], d12[1], d13[1]}, [r0], r1
    vst4.8      {d10[2], d11[2], d12[2], d13[2]}, [r0], r1
-    vst4.8      {d10[3], d11[3], d12[3], d13[3]}, [r12], r1
+    vst4.8      {d10[3], d11[3], d12[3], d13[3]}, [r0], r1
    vst4.8      {d10[4], d11[4], d12[4], d13[4]}, [r0], r1
-    vst4.8      {d10[5], d11[5], d12[5], d13[5]}, [r12], r1
+    vst4.8      {d10[5], d11[5], d12[5], d13[5]}, [r0], r1
    vst4.8      {d10[6], d11[6], d12[6], d13[6]}, [r0], r1
-    vst4.8      {d10[7], d11[7], d12[7], d13[7]}, [r12], r1
-
+    vst4.8      {d10[7], d11[7], d12[7], d13[7]}, [r0], r1
    vst4.8      {d14[0], d15[0], d16[0], d17[0]}, [r0], r1
-    vst4.8      {d14[1], d15[1], d16[1], d17[1]}, [r12], r1
+    vst4.8      {d14[1], d15[1], d16[1], d17[1]}, [r0], r1
    vst4.8      {d14[2], d15[2], d16[2], d17[2]}, [r0], r1
-    vst4.8      {d14[3], d15[3], d16[3], d17[3]}, [r12], r1
+    vst4.8      {d14[3], d15[3], d16[3], d17[3]}, [r0], r1
    vst4.8      {d14[4], d15[4], d16[4], d17[4]}, [r0], r1
-    vst4.8      {d14[5], d15[5], d16[5], d17[5]}, [r12], r1
-    vst4.8      {d14[6], d15[6], d16[6], d17[6]}, [r0]
-    vst4.8      {d14[7], d15[7], d16[7], d17[7]}, [r12]
+    vst4.8      {d14[5], d15[5], d16[5], d17[5]}, [r0], r1
+    vst4.8      {d14[6], d15[6], d16[6], d17[6]}, [r0], r1
+    vst4.8      {d14[7], d15[7], d16[7], d17[7]}, [r0]

-    pop         {pc}
+    ldmia       sp!, {pc}
    ENDP        ; |vp8_loop_filter_vertical_edge_y_neon|

 ; void vp8_loop_filter_vertical_edge_uv_neon(unsigned char *u, int pitch
@@ -204,36 +209,38 @@
 ;                                            unsigned char *v)
 ; r0    unsigned char *u,
 ; r1    int pitch,
-; r2    unsigned char blimit
-; r3    unsigned char limit
-; sp    unsigned char thresh,
+; r2    const signed char *flimit,
+; r3    const signed char *limit,
+; sp    const signed char *thresh,
 ; sp+4  unsigned char *v
 |vp8_loop_filter_vertical_edge_uv_neon| PROC
-    push        {lr}
-    vdup.u8     q0, r2                      ; duplicate blimit
-    sub         r12, r0, #4                 ; move u pointer down by 4 columns
-    ldr         r2, [sp, #8]                ; load v ptr
-    vdup.u8     q1, r3                      ; duplicate limit
-    sub         r3, r2, #4                  ; move v pointer down by 4 columns
+    stmdb       sp!, {lr}
+    sub         r12, r0, #4                  ; move u pointer down by 4 columns
+    vld1.s8     {d0[], d1[]}, [r2]          ; flimit
+    vld1.s8     {d2[], d3[]}, [r3]          ; limit

-    vld1.u8     {d6}, [r12], r1             ;load u data
-    vld1.u8     {d7}, [r3], r1              ;load v data
+    ldr         r2, [sp, #8]                ; load v ptr
+
+    vld1.u8     {d6}, [r12], r1              ;load u data
    vld1.u8     {d8}, [r12], r1
-    vld1.u8     {d9}, [r3], r1
    vld1.u8     {d10}, [r12], r1
-    vld1.u8     {d11}, [r3], r1
    vld1.u8     {d12}, [r12], r1
-    vld1.u8     {d13}, [r3], r1
    vld1.u8     {d14}, [r12], r1
-    vld1.u8     {d15}, [r3], r1
    vld1.u8     {d16}, [r12], r1
-    vld1.u8     {d17}, [r3], r1
    vld1.u8     {d18}, [r12], r1
-    vld1.u8     {d19}, [r3], r1
    vld1.u8     {d20}, [r12]
+
+    sub         r3, r2, #4                  ; move v pointer down by 4 columns
+    vld1.u8     {d7}, [r3], r1              ;load v data
+    vld1.u8     {d9}, [r3], r1
+    vld1.u8     {d11}, [r3], r1
+    vld1.u8     {d13}, [r3], r1
+    vld1.u8     {d15}, [r3], r1
+    vld1.u8     {d17}, [r3], r1
+    vld1.u8     {d19}, [r3], r1
    vld1.u8     {d21}, [r3]

-    ldr        r12, [sp, #4]               ; load thresh
+    ldr         r12, [sp, #4]               ; load thresh pointer

    ;transpose to 8x16 matrix
    vtrn.32     q3, q7
@@ -241,8 +248,6 @@
    vtrn.32     q5, q9
    vtrn.32     q6, q10

-    vdup.u8     q2, r12                     ; duplicate thresh
-
    vtrn.16     q3, q5
    vtrn.16     q4, q6
    vtrn.16     q7, q9
@@ -253,16 +258,18 @@
    vtrn.8      q7, q8
    vtrn.8      q9, q10

+    vld1.s8     {d4[], d5[]}, [r12]         ; thresh
+
    bl          vp8_loop_filter_neon

+    sub         r0, r0, #2
+    sub         r2, r2, #2
+
    vswp        d12, d11
    vswp        d16, d13
    vswp        d14, d12
    vswp        d16, d15

-    sub         r0, r0, #2
-    sub         r2, r2, #2
-
    ;store op1, op0, oq0, oq1
    vst4.8      {d10[0], d11[0], d12[0], d13[0]}, [r0], r1
    vst4.8      {d14[0], d15[0], d16[0], d17[0]}, [r2], r1
@@ -281,7 +288,7 @@
    vst4.8      {d10[7], d11[7], d12[7], d13[7]}, [r0]
    vst4.8      {d14[7], d15[7], d16[7], d17[7]}, [r2]

-    pop         {pc}
+    ldmia       sp!, {pc}
    ENDP        ; |vp8_loop_filter_vertical_edge_uv_neon|

 ; void vp8_loop_filter_neon();
@@ -301,6 +308,7 @@
 ; q9    q2
 ; q10   q3
 |vp8_loop_filter_neon| PROC
+    ldr         r12, _lf_coeff_

    ; vp8_filter_mask
    vabd.u8     q11, q3, q4                 ; abs(p3 - p2)
@@ -309,44 +317,42 @@
    vabd.u8     q14, q8, q7                 ; abs(q1 - q0)
    vabd.u8     q3, q9, q8                  ; abs(q2 - q1)
    vabd.u8     q4, q10, q9                 ; abs(q3 - q2)
+    vabd.u8     q9, q6, q7                  ; abs(p0 - q0)

    vmax.u8     q11, q11, q12
    vmax.u8     q12, q13, q14
    vmax.u8     q3, q3, q4
    vmax.u8     q15, q11, q12

-    vabd.u8     q9, q6, q7                  ; abs(p0 - q0)
-
    ; vp8_hevmask
    vcgt.u8     q13, q13, q2                ; (abs(p1 - p0) > thresh)*-1
    vcgt.u8     q14, q14, q2                ; (abs(q1 - q0) > thresh)*-1
    vmax.u8     q15, q15, q3

-    vmov.u8     q10, #0x80                   ; 0x80
+    vadd.u8     q0, q0, q0                  ; flimit * 2
+    vadd.u8     q0, q0, q1                  ; flimit * 2 + limit
+    vcge.u8     q15, q1, q15

    vabd.u8     q2, q5, q8                  ; a = abs(p1 - q1)
    vqadd.u8    q9, q9, q9                  ; b = abs(p0 - q0) * 2
+    vshr.u8     q2, q2, #1                  ; a = a / 2
+    vqadd.u8    q9, q9, q2                  ; a = b + a
+    vcge.u8     q9, q0, q9                  ; (a > flimit * 2 + limit) * -1

-    vcge.u8     q15, q1, q15
+    vld1.u8     {q0}, [r12]!

    ; vp8_filter() function
    ; convert to signed
-    veor        q7, q7, q10                 ; qs0
-    vshr.u8     q2, q2, #1                  ; a = a / 2
-    veor        q6, q6, q10                 ; ps0
+    veor        q7, q7, q0                  ; qs0
+    veor        q6, q6, q0                  ; ps0
+    veor        q5, q5, q0                  ; ps1
+    veor        q8, q8, q0                  ; qs1

-    veor        q5, q5, q10                 ; ps1
-    vqadd.u8    q9, q9, q2                  ; a = b + a
-
-    veor        q8, q8, q10                 ; qs1
-
-    vmov.u8     q10, #3                     ; #3
+    vld1.u8     {q10}, [r12]!

    vsubl.s8    q2, d14, d12                ; ( qs0 - ps0)
    vsubl.s8    q11, d15, d13

-    vcge.u8     q9, q0, q9                  ; (a > flimit * 2 + limit) * -1
-
    vmovl.u8    q4, d20

    vqsub.s8    q1, q5, q8                  ; vp8_filter = clamp(ps1-qs1)
@@ -361,7 +367,7 @@
    vaddw.s8    q2, q2, d2
    vaddw.s8    q11, q11, d3

-    vmov.u8     q9, #4                      ; #4
+    vld1.u8     {q9}, [r12]!

    ; vp8_filter = clamp(vp8_filter + 3 * ( qs0 - ps0))
    vqmovn.s16  d2, q2
@@ -373,20 +379,19 @@
    vshr.s8     q2, q2, #3                  ; Filter2 >>= 3
    vshr.s8     q1, q1, #3                  ; Filter1 >>= 3

-
    vqadd.s8    q11, q6, q2                 ; u = clamp(ps0 + Filter2)
    vqsub.s8    q10, q7, q1                 ; u = clamp(qs0 - Filter1)

    ; outer tap adjustments: ++vp8_filter >> 1
    vrshr.s8    q1, q1, #1
    vbic        q1, q1, q14                 ; vp8_filter &= ~hev
-    vmov.u8     q0, #0x80                   ; 0x80
+
    vqadd.s8    q13, q5, q1                 ; u = clamp(ps1 + vp8_filter)
    vqsub.s8    q12, q8, q1                 ; u = clamp(qs1 - vp8_filter)

+    veor        q5, q13, q0                 ; *op1 = u^0x80
    veor        q6, q11, q0                 ; *op0 = u^0x80
    veor        q7, q10, q0                 ; *oq0 = u^0x80
-    veor        q5, q13, q0                 ; *op1 = u^0x80
    veor        q8, q12, q0                 ; *oq1 = u^0x80

    bx          lr
@@ -394,4 +399,12 @@

 ;-----------------

+_lf_coeff_
+    DCD     lf_coeff
+lf_coeff
+    DCD     0x80808080, 0x80808080, 0x80808080, 0x80808080
+    DCD     0x03030303, 0x03030303, 0x03030303, 0x03030303
+    DCD     0x04040404, 0x04040404, 0x04040404, 0x04040404
+    DCD     0x01010101, 0x01010101, 0x01010101, 0x01010101
+
    END
--- a/vp8/common/arm/neon/loopfiltersimplehorizontaledge_neon.asm
+++ b/vp8/common/arm/neon/loopfiltersimplehorizontaledge_neon.asm
@@ -9,109 +9,107 @@
 ;


-    ;EXPORT  |vp8_loop_filter_simple_horizontal_edge_neon|
-    EXPORT  |vp8_loop_filter_bhs_neon|
-    EXPORT  |vp8_loop_filter_mbhs_neon|
+    EXPORT  |vp8_loop_filter_simple_horizontal_edge_neon|
    ARM
+    REQUIRE8
    PRESERVE8

    AREA ||.text||, CODE, READONLY, ALIGN=2
-
-; r0    unsigned char *s, PRESERVE
-; r1    int p, PRESERVE
-; q1    limit, PRESERVE
+;Note: flimit, limit, and thresh shpuld be positive numbers. All 16 elements in flimit
+;are equal. So, in the code, only one load is needed
+;for flimit. Same way applies to limit and thresh.
+; r0    unsigned char *s,
+; r1    int p, //pitch
+; r2    const signed char *flimit,
+; r3    const signed char *limit,
+; stack(r4) const signed char *thresh,
+; //stack(r5)   int count --unused

 |vp8_loop_filter_simple_horizontal_edge_neon| PROC
+    sub         r0, r0, r1, lsl #1          ; move src pointer down by 2 lines

-    sub         r3, r0, r1, lsl #1          ; move src pointer down by 2 lines
-
-    vld1.u8     {q7}, [r0@128], r1          ; q0
-    vld1.u8     {q5}, [r3@128], r1          ; p0
-    vld1.u8     {q8}, [r0@128]              ; q1
-    vld1.u8     {q6}, [r3@128]              ; p1
+    ldr         r12, _lfhy_coeff_
+    vld1.u8     {q5}, [r0], r1              ; p1
+    vld1.s8     {d2[], d3[]}, [r2]          ; flimit
+    vld1.s8     {d26[], d27[]}, [r3]        ; limit -> q13
+    vld1.u8     {q6}, [r0], r1              ; p0
+    vld1.u8     {q0}, [r12]!                ; 0x80
+    vld1.u8     {q7}, [r0], r1              ; q0
+    vld1.u8     {q10}, [r12]!               ; 0x03
+    vld1.u8     {q8}, [r0]                  ; q1

+    ;vp8_filter_mask() function
    vabd.u8     q15, q6, q7                 ; abs(p0 - q0)
    vabd.u8     q14, q5, q8                 ; abs(p1 - q1)
-
    vqadd.u8    q15, q15, q15               ; abs(p0 - q0) * 2
    vshr.u8     q14, q14, #1                ; abs(p1 - q1) / 2
-    vmov.u8     q0, #0x80                   ; 0x80
-    vmov.s16    q13, #3
    vqadd.u8    q15, q15, q14               ; abs(p0 - q0) * 2 + abs(p1 - q1) / 2

+    ;vp8_filter() function
    veor        q7, q7, q0                  ; qs0: q0 offset to convert to a signed value
    veor        q6, q6, q0                  ; ps0: p0 offset to convert to a signed value
    veor        q5, q5, q0                  ; ps1: p1 offset to convert to a signed value
    veor        q8, q8, q0                  ; qs1: q1 offset to convert to a signed value

-    vcge.u8     q15, q1, q15                ; (abs(p0 - q0)*2 + abs(p1-q1)/2 > limit)*-1
+    vadd.u8     q1, q1, q1                  ; flimit * 2
+    vadd.u8     q1, q1, q13                 ; flimit * 2 + limit
+    vcge.u8     q15, q1, q15                ; (abs(p0 - q0)*2 + abs(p1-q1)/2 > flimit*2 + limit)*-1

+;;;;;;;;;;
+    ;vqsub.s8   q2, q7, q6                  ; ( qs0 - ps0)
    vsubl.s8    q2, d14, d12                ; ( qs0 - ps0)
    vsubl.s8    q3, d15, d13

    vqsub.s8    q4, q5, q8                  ; q4: vp8_filter = vp8_signed_char_clamp(ps1-qs1)

-    vmul.s16    q2, q2, q13                 ;  3 * ( qs0 - ps0)
-    vmul.s16    q3, q3, q13
+    ;vmul.i8    q2, q2, q10                 ;  3 * ( qs0 - ps0)
+    vadd.s16    q11, q2, q2                 ;  3 * ( qs0 - ps0)
+    vadd.s16    q12, q3, q3

-    vmov.u8     q10, #0x03                  ; 0x03
-    vmov.u8     q9, #0x04                   ; 0x04
+    vld1.u8     {q9}, [r12]!                ; 0x04
+
+    vadd.s16    q2, q2, q11
+    vadd.s16    q3, q3, q12

    vaddw.s8    q2, q2, d8                  ; vp8_filter + 3 * ( qs0 - ps0)
    vaddw.s8    q3, q3, d9

+    ;vqadd.s8   q4, q4, q2                  ; vp8_filter = vp8_signed_char_clamp(vp8_filter + 3 * ( qs0 - ps0))
    vqmovn.s16  d8, q2                      ; vp8_filter = vp8_signed_char_clamp(vp8_filter + 3 * ( qs0 - ps0))
    vqmovn.s16  d9, q3
+;;;;;;;;;;;;;

-    vand        q14, q4, q15                ; vp8_filter &= mask
+    vand        q4, q4, q15                 ; vp8_filter &= mask

-    vqadd.s8    q2, q14, q10                ; Filter2 = vp8_signed_char_clamp(vp8_filter+3)
-    vqadd.s8    q3, q14, q9                 ; Filter1 = vp8_signed_char_clamp(vp8_filter+4)
+    vqadd.s8    q2, q4, q10                 ; Filter2 = vp8_signed_char_clamp(vp8_filter+3)
+    vqadd.s8    q4, q4, q9                  ; Filter1 = vp8_signed_char_clamp(vp8_filter+4)
    vshr.s8     q2, q2, #3                  ; Filter2 >>= 3
-    vshr.s8     q4, q3, #3                  ; Filter1 >>= 3
+    vshr.s8     q4, q4, #3                  ; Filter1 >>= 3

-    sub         r0, r0, r1
+    sub         r0, r0, r1, lsl #1

    ;calculate output
    vqadd.s8    q11, q6, q2                 ; u = vp8_signed_char_clamp(ps0 + Filter2)
    vqsub.s8    q10, q7, q4                 ; u = vp8_signed_char_clamp(qs0 - Filter1)

+    add         r3, r0, r1
+
    veor        q6, q11, q0                 ; *op0 = u^0x80
    veor        q7, q10, q0                 ; *oq0 = u^0x80

-    vst1.u8     {q6}, [r3@128]              ; store op0
-    vst1.u8     {q7}, [r0@128]              ; store oq0
+    vst1.u8     {q6}, [r0]                  ; store op0
+    vst1.u8     {q7}, [r3]                  ; store oq0

    bx          lr
    ENDP        ; |vp8_loop_filter_simple_horizontal_edge_neon|

-; r0    unsigned char *y
-; r1    int ystride
-; r2    const unsigned char *blimit
+;-----------------

-|vp8_loop_filter_bhs_neon| PROC
-    push        {r4, lr}
-    ldrb        r3, [r2]                    ; load blim from mem
-    vdup.s8     q1, r3                      ; duplicate blim
-
-    add         r0, r0, r1, lsl #2          ; src = y_ptr + 4 * y_stride
-    bl          vp8_loop_filter_simple_horizontal_edge_neon
-    ; vp8_loop_filter_simple_horizontal_edge_neon preserves r0, r1 and q1
-    add         r0, r0, r1, lsl #2          ; src = y_ptr + 8* y_stride
-    bl          vp8_loop_filter_simple_horizontal_edge_neon
-    add         r0, r0, r1, lsl #2          ; src = y_ptr + 12 * y_stride
-    pop         {r4, lr}
-    b           vp8_loop_filter_simple_horizontal_edge_neon
-    ENDP        ;|vp8_loop_filter_bhs_neon|
-
-; r0    unsigned char *y
-; r1    int ystride
-; r2    const unsigned char *blimit
-
-|vp8_loop_filter_mbhs_neon| PROC
-    ldrb        r3, [r2]                   ; load blim from mem
-    vdup.s8     q1, r3                     ; duplicate mblim
-    b           vp8_loop_filter_simple_horizontal_edge_neon
-    ENDP        ;|vp8_loop_filter_bhs_neon|
+_lfhy_coeff_
+    DCD     lfhy_coeff
+lfhy_coeff
+    DCD     0x80808080, 0x80808080, 0x80808080, 0x80808080
+    DCD     0x03030303, 0x03030303, 0x03030303, 0x03030303
+    DCD     0x04040404, 0x04040404, 0x04040404, 0x04040404

    END
--- a/vp8/common/arm/neon/loopfiltersimpleverticaledge_neon.asm
+++ b/vp8/common/arm/neon/loopfiltersimpleverticaledge_neon.asm
@@ -9,54 +9,60 @@
 ;


-    ;EXPORT  |vp8_loop_filter_simple_vertical_edge_neon|
-    EXPORT |vp8_loop_filter_bvs_neon|
-    EXPORT |vp8_loop_filter_mbvs_neon|
+    EXPORT  |vp8_loop_filter_simple_vertical_edge_neon|
    ARM
+    REQUIRE8
    PRESERVE8

    AREA ||.text||, CODE, READONLY, ALIGN=2
-
-; r0    unsigned char *s, PRESERVE
-; r1    int p, PRESERVE
-; q1    limit, PRESERVE
+;Note: flimit, limit, and thresh should be positive numbers. All 16 elements in flimit
+;are equal. So, in the code, only one load is needed
+;for flimit. Same way applies to limit and thresh.
+; r0    unsigned char *s,
+; r1    int p, //pitch
+; r2    const signed char *flimit,
+; r3    const signed char *limit,
+; stack(r4) const signed char *thresh,
+; //stack(r5)   int count --unused

 |vp8_loop_filter_simple_vertical_edge_neon| PROC
    sub         r0, r0, #2                  ; move src pointer down by 2 columns
-    add         r12, r1, r1
-    add         r3, r0, r1

-    vld4.8      {d6[0], d7[0], d8[0], d9[0]}, [r0], r12
-    vld4.8      {d6[1], d7[1], d8[1], d9[1]}, [r3], r12
-    vld4.8      {d6[2], d7[2], d8[2], d9[2]}, [r0], r12
-    vld4.8      {d6[3], d7[3], d8[3], d9[3]}, [r3], r12
-    vld4.8      {d6[4], d7[4], d8[4], d9[4]}, [r0], r12
-    vld4.8      {d6[5], d7[5], d8[5], d9[5]}, [r3], r12
-    vld4.8      {d6[6], d7[6], d8[6], d9[6]}, [r0], r12
-    vld4.8      {d6[7], d7[7], d8[7], d9[7]}, [r3], r12
+    vld4.8      {d6[0], d7[0], d8[0], d9[0]}, [r0], r1
+    vld1.s8     {d2[], d3[]}, [r2]          ; flimit
+    vld1.s8     {d26[], d27[]}, [r3]        ; limit -> q13
+    vld4.8      {d6[1], d7[1], d8[1], d9[1]}, [r0], r1
+    ldr         r12, _vlfy_coeff_
+    vld4.8      {d6[2], d7[2], d8[2], d9[2]}, [r0], r1
+    vld4.8      {d6[3], d7[3], d8[3], d9[3]}, [r0], r1
+    vld4.8      {d6[4], d7[4], d8[4], d9[4]}, [r0], r1
+    vld4.8      {d6[5], d7[5], d8[5], d9[5]}, [r0], r1
+    vld4.8      {d6[6], d7[6], d8[6], d9[6]}, [r0], r1
+    vld4.8      {d6[7], d7[7], d8[7], d9[7]}, [r0], r1

-    vld4.8      {d10[0], d11[0], d12[0], d13[0]}, [r0], r12
-    vld4.8      {d10[1], d11[1], d12[1], d13[1]}, [r3], r12
-    vld4.8      {d10[2], d11[2], d12[2], d13[2]}, [r0], r12
-    vld4.8      {d10[3], d11[3], d12[3], d13[3]}, [r3], r12
-    vld4.8      {d10[4], d11[4], d12[4], d13[4]}, [r0], r12
-    vld4.8      {d10[5], d11[5], d12[5], d13[5]}, [r3], r12
-    vld4.8      {d10[6], d11[6], d12[6], d13[6]}, [r0], r12
-    vld4.8      {d10[7], d11[7], d12[7], d13[7]}, [r3]
+    vld4.8      {d10[0], d11[0], d12[0], d13[0]}, [r0], r1
+    vld1.u8     {q0}, [r12]!                ; 0x80
+    vld4.8      {d10[1], d11[1], d12[1], d13[1]}, [r0], r1
+    vld1.u8     {q11}, [r12]!               ; 0x03
+    vld4.8      {d10[2], d11[2], d12[2], d13[2]}, [r0], r1
+    vld1.u8     {q12}, [r12]!               ; 0x04
+    vld4.8      {d10[3], d11[3], d12[3], d13[3]}, [r0], r1
+    vld4.8      {d10[4], d11[4], d12[4], d13[4]}, [r0], r1
+    vld4.8      {d10[5], d11[5], d12[5], d13[5]}, [r0], r1
+    vld4.8      {d10[6], d11[6], d12[6], d13[6]}, [r0], r1
+    vld4.8      {d10[7], d11[7], d12[7], d13[7]}, [r0], r1

    vswp        d7, d10
    vswp        d12, d9
+    ;vswp       q4, q5                      ; p1:q3, p0:q5, q0:q4, q1:q6

    ;vp8_filter_mask() function
    ;vp8_hevmask() function
    sub         r0, r0, r1, lsl #4
    vabd.u8     q15, q5, q4                 ; abs(p0 - q0)
    vabd.u8     q14, q3, q6                 ; abs(p1 - q1)
-
    vqadd.u8    q15, q15, q15               ; abs(p0 - q0) * 2
    vshr.u8     q14, q14, #1                ; abs(p1 - q1) / 2
-    vmov.u8     q0, #0x80                   ; 0x80
-    vmov.s16    q11, #3
    vqadd.u8    q15, q15, q14               ; abs(p0 - q0) * 2 + abs(p1 - q1) / 2

    veor        q4, q4, q0                  ; qs0: q0 offset to convert to a signed value
@@ -64,91 +70,87 @@
    veor        q3, q3, q0                  ; ps1: p1 offset to convert to a signed value
    veor        q6, q6, q0                  ; qs1: q1 offset to convert to a signed value

+    vadd.u8     q1, q1, q1                  ; flimit * 2
+    vadd.u8     q1, q1, q13                 ; flimit * 2 + limit
    vcge.u8     q15, q1, q15                ; abs(p0 - q0)*2 + abs(p1-q1)/2 > flimit*2 + limit)*-1

+    ;vp8_filter() function
+;;;;;;;;;;
+    ;vqsub.s8   q2, q5, q4                  ; ( qs0 - ps0)
    vsubl.s8    q2, d8, d10                 ; ( qs0 - ps0)
    vsubl.s8    q13, d9, d11

-    vqsub.s8    q14, q3, q6                  ; vp8_filter = vp8_signed_char_clamp(ps1-qs1)
+    vqsub.s8    q1, q3, q6                  ; vp8_filter = vp8_signed_char_clamp(ps1-qs1)

-    vmul.s16    q2, q2, q11                 ;  3 * ( qs0 - ps0)
-    vmul.s16    q13, q13, q11
+    ;vmul.i8    q2, q2, q11                 ; vp8_filter = vp8_signed_char_clamp(vp8_filter + 3 * ( qs0 - ps0))
+    vadd.s16    q10, q2, q2                 ;  3 * ( qs0 - ps0)
+    vadd.s16    q14, q13, q13
+    vadd.s16    q2, q2, q10
+    vadd.s16    q13, q13, q14

-    vmov.u8     q11, #0x03                  ; 0x03
-    vmov.u8     q12, #0x04                  ; 0x04
+    ;vqadd.s8   q1, q1, q2
+    vaddw.s8    q2, q2, d2                  ; vp8_filter + 3 * ( qs0 - ps0)
+    vaddw.s8    q13, q13, d3

-    vaddw.s8    q2, q2, d28                  ; vp8_filter + 3 * ( qs0 - ps0)
-    vaddw.s8    q13, q13, d29
-
-    vqmovn.s16  d28, q2                      ; vp8_filter = vp8_signed_char_clamp(vp8_filter + 3 * ( qs0 - ps0))
-    vqmovn.s16  d29, q13
+    vqmovn.s16  d2, q2                      ; vp8_filter = vp8_signed_char_clamp(vp8_filter + 3 * ( qs0 - ps0))
+    vqmovn.s16  d3, q13

    add         r0, r0, #1
-    add         r3, r0, r1
+    add         r2, r0, r1
+;;;;;;;;;;;

-    vand        q14, q14, q15                 ; vp8_filter &= mask
+    vand        q1, q1, q15                 ; vp8_filter &= mask

-    vqadd.s8    q2, q14, q11                 ; Filter2 = vp8_signed_char_clamp(vp8_filter+3)
-    vqadd.s8    q3, q14, q12                 ; Filter1 = vp8_signed_char_clamp(vp8_filter+4)
+    vqadd.s8    q2, q1, q11                 ; Filter2 = vp8_signed_char_clamp(vp8_filter+3)
+    vqadd.s8    q1, q1, q12                 ; Filter1 = vp8_signed_char_clamp(vp8_filter+4)
    vshr.s8     q2, q2, #3                  ; Filter2 >>= 3
-    vshr.s8     q14, q3, #3                  ; Filter1 >>= 3
+    vshr.s8     q1, q1, #3                  ; Filter1 >>= 3

    ;calculate output
+    vqsub.s8    q10, q4, q1                 ; u = vp8_signed_char_clamp(qs0 - Filter1)
    vqadd.s8    q11, q5, q2                 ; u = vp8_signed_char_clamp(ps0 + Filter2)
-    vqsub.s8    q10, q4, q14                 ; u = vp8_signed_char_clamp(qs0 - Filter1)

-    veor        q6, q11, q0                 ; *op0 = u^0x80
    veor        q7, q10, q0                 ; *oq0 = u^0x80
-    add         r12, r1, r1
+    veor        q6, q11, q0                 ; *op0 = u^0x80
+
+    add         r3, r2, r1
    vswp        d13, d14
+    add         r12, r3, r1

    ;store op1, op0, oq0, oq1
-    vst2.8      {d12[0], d13[0]}, [r0], r12
-    vst2.8      {d12[1], d13[1]}, [r3], r12
-    vst2.8      {d12[2], d13[2]}, [r0], r12
-    vst2.8      {d12[3], d13[3]}, [r3], r12
-    vst2.8      {d12[4], d13[4]}, [r0], r12
-    vst2.8      {d12[5], d13[5]}, [r3], r12
-    vst2.8      {d12[6], d13[6]}, [r0], r12
-    vst2.8      {d12[7], d13[7]}, [r3], r12
-    vst2.8      {d14[0], d15[0]}, [r0], r12
-    vst2.8      {d14[1], d15[1]}, [r3], r12
-    vst2.8      {d14[2], d15[2]}, [r0], r12
-    vst2.8      {d14[3], d15[3]}, [r3], r12
-    vst2.8      {d14[4], d15[4]}, [r0], r12
-    vst2.8      {d14[5], d15[5]}, [r3], r12
-    vst2.8      {d14[6], d15[6]}, [r0], r12
-    vst2.8      {d14[7], d15[7]}, [r3]
+    vst2.8      {d12[0], d13[0]}, [r0]
+    vst2.8      {d12[1], d13[1]}, [r2]
+    vst2.8      {d12[2], d13[2]}, [r3]
+    vst2.8      {d12[3], d13[3]}, [r12], r1
+    add         r0, r12, r1
+    vst2.8      {d12[4], d13[4]}, [r12]
+    vst2.8      {d12[5], d13[5]}, [r0], r1
+    add         r2, r0, r1
+    vst2.8      {d12[6], d13[6]}, [r0]
+    vst2.8      {d12[7], d13[7]}, [r2], r1
+    add         r3, r2, r1
+    vst2.8      {d14[0], d15[0]}, [r2]
+    vst2.8      {d14[1], d15[1]}, [r3], r1
+    add         r12, r3, r1
+    vst2.8      {d14[2], d15[2]}, [r3]
+    vst2.8      {d14[3], d15[3]}, [r12], r1
+    add         r0, r12, r1
+    vst2.8      {d14[4], d15[4]}, [r12]
+    vst2.8      {d14[5], d15[5]}, [r0], r1
+    add         r2, r0, r1
+    vst2.8      {d14[6], d15[6]}, [r0]
+    vst2.8      {d14[7], d15[7]}, [r2]

    bx          lr
    ENDP        ; |vp8_loop_filter_simple_vertical_edge_neon|

-; r0    unsigned char *y
-; r1    int ystride
-; r2    const unsigned char *blimit
+;-----------------

-|vp8_loop_filter_bvs_neon| PROC
-    push        {r4, lr}
-    ldrb        r3, [r2]                   ; load blim from mem
-    mov         r4, r0
-    add         r0, r0, #4
-    vdup.s8     q1, r3                     ; duplicate blim
-    bl          vp8_loop_filter_simple_vertical_edge_neon
-    ; vp8_loop_filter_simple_vertical_edge_neon preserves  r1 and q1
-    add         r0, r4, #8
-    bl          vp8_loop_filter_simple_vertical_edge_neon
-    add         r0, r4, #12
-    pop         {r4, lr}
-    b           vp8_loop_filter_simple_vertical_edge_neon
-    ENDP        ;|vp8_loop_filter_bvs_neon|
+_vlfy_coeff_
+    DCD     vlfy_coeff
+vlfy_coeff
+    DCD     0x80808080, 0x80808080, 0x80808080, 0x80808080
+    DCD     0x03030303, 0x03030303, 0x03030303, 0x03030303
+    DCD     0x04040404, 0x04040404, 0x04040404, 0x04040404

-; r0    unsigned char *y
-; r1    int ystride
-; r2    const unsigned char *blimit
-
-|vp8_loop_filter_mbvs_neon| PROC
-    ldrb        r3, [r2]                   ; load mblim from mem
-    vdup.s8     q1, r3                     ; duplicate mblim
-    b           vp8_loop_filter_simple_vertical_edge_neon
-    ENDP        ;|vp8_loop_filter_bvs_neon|
    END
--- a/vp8/common/arm/neon/mbloopfilter_neon.asm
+++ b/vp8/common/arm/neon/mbloopfilter_neon.asm
@@ -14,143 +14,155 @@
    EXPORT  |vp8_mbloop_filter_vertical_edge_y_neon|
    EXPORT  |vp8_mbloop_filter_vertical_edge_uv_neon|
    ARM
+    REQUIRE8
+    PRESERVE8

    AREA ||.text||, CODE, READONLY, ALIGN=2

+; flimit, limit, and thresh should be positive numbers.
+; All 16 elements in these variables are equal.
+
 ; void vp8_mbloop_filter_horizontal_edge_y_neon(unsigned char *src, int pitch,
-;                                               const unsigned char *blimit,
-;                                               const unsigned char *limit,
-;                                               const unsigned char *thresh)
+;                                               const signed char *flimit,
+;                                               const signed char *limit,
+;                                               const signed char *thresh,
+;                                               int count)
 ; r0    unsigned char *src,
 ; r1    int pitch,
-; r2    unsigned char blimit
-; r3    unsigned char limit
-; sp    unsigned char thresh,
+; r2    const signed char *flimit,
+; r3    const signed char *limit,
+; sp    const signed char *thresh,
+; sp+4  int count (unused)
 |vp8_mbloop_filter_horizontal_edge_y_neon| PROC
-    push        {lr}
-    add         r1, r1, r1                  ; double stride
-    ldr         r12, [sp, #4]               ; load thresh
-    sub         r0, r0, r1, lsl #1          ; move src pointer down by 4 lines
-    vdup.u8     q2, r12                     ; thresh
-    add         r12, r0, r1,  lsr #1        ; move src pointer up by 1 line
+    stmdb       sp!, {lr}
+    sub         r0, r0, r1, lsl #2          ; move src pointer down by 4 lines
+    ldr         r12, [sp, #4]               ; load thresh pointer

-    vld1.u8     {q3}, [r0@128], r1              ; p3
-    vld1.u8     {q4}, [r12@128], r1             ; p2
-    vld1.u8     {q5}, [r0@128], r1              ; p1
-    vld1.u8     {q6}, [r12@128], r1             ; p0
-    vld1.u8     {q7}, [r0@128], r1              ; q0
-    vld1.u8     {q8}, [r12@128], r1             ; q1
-    vld1.u8     {q9}, [r0@128], r1              ; q2
-    vld1.u8     {q10}, [r12@128], r1            ; q3
-
-    bl          vp8_mbloop_filter_neon
-
-    sub         r12, r12, r1, lsl #2
-    add         r0, r12, r1, lsr #1
-
-    vst1.u8     {q4}, [r12@128],r1         ; store op2
-    vst1.u8     {q5}, [r0@128],r1          ; store op1
-    vst1.u8     {q6}, [r12@128], r1        ; store op0
-    vst1.u8     {q7}, [r0@128],r1          ; store oq0
-    vst1.u8     {q8}, [r12@128]            ; store oq1
-    vst1.u8     {q9}, [r0@128]             ; store oq2
-
-    pop         {pc}
-    ENDP        ; |vp8_mbloop_filter_horizontal_edge_y_neon|
-
-; void vp8_mbloop_filter_horizontal_edge_uv_neon(unsigned char *u, int pitch,
-;                                                const unsigned char *blimit,
-;                                                const unsigned char *limit,
-;                                                const unsigned char *thresh,
-;                                                unsigned char *v)
-; r0    unsigned char *u,
-; r1    int pitch,
-; r2    unsigned char blimit
-; r3    unsigned char limit
-; sp    unsigned char thresh,
-; sp+4  unsigned char *v
-
-|vp8_mbloop_filter_horizontal_edge_uv_neon| PROC
-    push        {lr}
-    ldr         r12, [sp, #4]                 ; load thresh
-    sub         r0, r0, r1, lsl #2            ; move u pointer down by 4 lines
-    vdup.u8     q2, r12                       ; thresh
-    ldr         r12, [sp, #8]                 ; load v ptr
-    sub         r12, r12, r1, lsl #2          ; move v pointer down by 4 lines
-
-    vld1.u8     {d6}, [r0@64], r1              ; p3
-    vld1.u8     {d7}, [r12@64], r1              ; p3
-    vld1.u8     {d8}, [r0@64], r1              ; p2
-    vld1.u8     {d9}, [r12@64], r1              ; p2
-    vld1.u8     {d10}, [r0@64], r1             ; p1
-    vld1.u8     {d11}, [r12@64], r1             ; p1
-    vld1.u8     {d12}, [r0@64], r1             ; p0
-    vld1.u8     {d13}, [r12@64], r1             ; p0
-    vld1.u8     {d14}, [r0@64], r1             ; q0
-    vld1.u8     {d15}, [r12@64], r1             ; q0
-    vld1.u8     {d16}, [r0@64], r1             ; q1
-    vld1.u8     {d17}, [r12@64], r1             ; q1
-    vld1.u8     {d18}, [r0@64], r1             ; q2
-    vld1.u8     {d19}, [r12@64], r1             ; q2
-    vld1.u8     {d20}, [r0@64], r1             ; q3
-    vld1.u8     {d21}, [r12@64], r1             ; q3
+    vld1.u8     {q3}, [r0], r1              ; p3
+    vld1.s8     {d2[], d3[]}, [r3]          ; limit
+    vld1.u8     {q4}, [r0], r1              ; p2
+    vld1.s8     {d4[], d5[]}, [r12]         ; thresh
+    vld1.u8     {q5}, [r0], r1              ; p1
+    vld1.u8     {q6}, [r0], r1              ; p0
+    vld1.u8     {q7}, [r0], r1              ; q0
+    vld1.u8     {q8}, [r0], r1              ; q1
+    vld1.u8     {q9}, [r0], r1              ; q2
+    vld1.u8     {q10}, [r0], r1             ; q3

    bl          vp8_mbloop_filter_neon

    sub         r0, r0, r1, lsl #3
-    sub         r12, r12, r1, lsl #3
+    add         r0, r0, r1
+    add         r2, r0, r1
+    add         r3, r2, r1
+
+    vst1.u8     {q4}, [r0]                  ; store op2
+    vst1.u8     {q5}, [r2]                  ; store op1
+    vst1.u8     {q6}, [r3], r1              ; store op0
+    add         r12, r3, r1
+    vst1.u8     {q7}, [r3]                  ; store oq0
+    vst1.u8     {q8}, [r12], r1             ; store oq1
+    vst1.u8     {q9}, [r12]             ; store oq2
+
+    ldmia       sp!, {pc}
+    ENDP        ; |vp8_mbloop_filter_horizontal_edge_y_neon|
+
+; void vp8_mbloop_filter_horizontal_edge_uv_neon(unsigned char *u, int pitch,
+;                                                const signed char *flimit,
+;                                                const signed char *limit,
+;                                                const signed char *thresh,
+;                                                unsigned char *v)
+; r0    unsigned char *u,
+; r1    int pitch,
+; r2    const signed char *flimit,
+; r3    const signed char *limit,
+; sp    const signed char *thresh,
+; sp+4  unsigned char *v
+|vp8_mbloop_filter_horizontal_edge_uv_neon| PROC
+    stmdb       sp!, {lr}
+    sub         r0, r0, r1, lsl #2          ; move u pointer down by 4 lines
+    vld1.s8     {d2[], d3[]}, [r3]          ; limit
+    ldr         r3, [sp, #8]                ; load v ptr
+    ldr         r12, [sp, #4]               ; load thresh pointer
+    sub         r3, r3, r1, lsl #2          ; move v pointer down by 4 lines
+
+    vld1.u8     {d6}, [r0], r1              ; p3
+    vld1.u8     {d7}, [r3], r1              ; p3
+    vld1.u8     {d8}, [r0], r1              ; p2
+    vld1.u8     {d9}, [r3], r1              ; p2
+    vld1.u8     {d10}, [r0], r1             ; p1
+    vld1.u8     {d11}, [r3], r1             ; p1
+    vld1.u8     {d12}, [r0], r1             ; p0
+    vld1.u8     {d13}, [r3], r1             ; p0
+    vld1.u8     {d14}, [r0], r1             ; q0
+    vld1.u8     {d15}, [r3], r1             ; q0
+    vld1.u8     {d16}, [r0], r1             ; q1
+    vld1.u8     {d17}, [r3], r1             ; q1
+    vld1.u8     {d18}, [r0], r1             ; q2
+    vld1.u8     {d19}, [r3], r1             ; q2
+    vld1.u8     {d20}, [r0], r1             ; q3
+    vld1.u8     {d21}, [r3], r1             ; q3
+
+    vld1.s8     {d4[], d5[]}, [r12]         ; thresh
+
+    bl          vp8_mbloop_filter_neon
+
+    sub         r0, r0, r1, lsl #3
+    sub         r3, r3, r1, lsl #3

    add         r0, r0, r1
-    add         r12, r12, r1
+    add         r3, r3, r1

-    vst1.u8     {d8}, [r0@64], r1              ; store u op2
-    vst1.u8     {d9}, [r12@64], r1              ; store v op2
-    vst1.u8     {d10}, [r0@64], r1             ; store u op1
-    vst1.u8     {d11}, [r12@64], r1             ; store v op1
-    vst1.u8     {d12}, [r0@64], r1             ; store u op0
-    vst1.u8     {d13}, [r12@64], r1             ; store v op0
-    vst1.u8     {d14}, [r0@64], r1             ; store u oq0
-    vst1.u8     {d15}, [r12@64], r1             ; store v oq0
-    vst1.u8     {d16}, [r0@64], r1             ; store u oq1
-    vst1.u8     {d17}, [r12@64], r1             ; store v oq1
-    vst1.u8     {d18}, [r0@64], r1             ; store u oq2
-    vst1.u8     {d19}, [r12@64], r1             ; store v oq2
+    vst1.u8     {d8}, [r0], r1              ; store u op2
+    vst1.u8     {d9}, [r3], r1              ; store v op2
+    vst1.u8     {d10}, [r0], r1             ; store u op1
+    vst1.u8     {d11}, [r3], r1             ; store v op1
+    vst1.u8     {d12}, [r0], r1             ; store u op0
+    vst1.u8     {d13}, [r3], r1             ; store v op0
+    vst1.u8     {d14}, [r0], r1             ; store u oq0
+    vst1.u8     {d15}, [r3], r1             ; store v oq0
+    vst1.u8     {d16}, [r0], r1             ; store u oq1
+    vst1.u8     {d17}, [r3], r1             ; store v oq1
+    vst1.u8     {d18}, [r0], r1             ; store u oq2
+    vst1.u8     {d19}, [r3], r1             ; store v oq2

-    pop         {pc}
+    ldmia       sp!, {pc}
    ENDP        ; |vp8_mbloop_filter_horizontal_edge_uv_neon|

 ; void vp8_mbloop_filter_vertical_edge_y_neon(unsigned char *src, int pitch,
-;                                             const unsigned char *blimit,
-;                                             const unsigned char *limit,
-;                                             const unsigned char *thresh)
+;                                             const signed char *flimit,
+;                                             const signed char *limit,
+;                                             const signed char *thresh,
+;                                             int count)
 ; r0    unsigned char *src,
 ; r1    int pitch,
-; r2    unsigned char blimit
-; r3    unsigned char limit
-; sp    unsigned char thresh,
+; r2    const signed char *flimit,
+; r3    const signed char *limit,
+; sp    const signed char *thresh,
+; sp+4  int count (unused)
 |vp8_mbloop_filter_vertical_edge_y_neon| PROC
-    push        {lr}
-    ldr         r12, [sp, #4]               ; load thresh
+    stmdb       sp!, {lr}
    sub         r0, r0, #4                  ; move src pointer down by 4 columns
-    vdup.s8     q2, r12                     ; thresh
-    add         r12, r0, r1, lsl #3         ; move src pointer down by 8 lines

    vld1.u8     {d6}, [r0], r1              ; load first 8-line src data
-    vld1.u8     {d7}, [r12], r1             ; load second 8-line src data
+    ldr         r12, [sp, #4]               ; load thresh pointer
    vld1.u8     {d8}, [r0], r1
-    vld1.u8     {d9}, [r12], r1
+    sub         sp, sp, #32
    vld1.u8     {d10}, [r0], r1
-    vld1.u8     {d11}, [r12], r1
    vld1.u8     {d12}, [r0], r1
-    vld1.u8     {d13}, [r12], r1
    vld1.u8     {d14}, [r0], r1
-    vld1.u8     {d15}, [r12], r1
    vld1.u8     {d16}, [r0], r1
-    vld1.u8     {d17}, [r12], r1
    vld1.u8     {d18}, [r0], r1
-    vld1.u8     {d19}, [r12], r1
    vld1.u8     {d20}, [r0], r1
-    vld1.u8     {d21}, [r12], r1
+
+    vld1.u8     {d7}, [r0], r1              ; load second 8-line src data
+    vld1.u8     {d9}, [r0], r1
+    vld1.u8     {d11}, [r0], r1
+    vld1.u8     {d13}, [r0], r1
+    vld1.u8     {d15}, [r0], r1
+    vld1.u8     {d17}, [r0], r1
+    vld1.u8     {d19}, [r0], r1
+    vld1.u8     {d21}, [r0], r1

    ;transpose to 8x16 matrix
    vtrn.32     q3, q7
@@ -168,17 +180,29 @@
    vtrn.8      q7, q8
    vtrn.8      q9, q10

-    sub         r0, r0, r1, lsl #3
+    vld1.s8     {d4[], d5[]}, [r12]         ; thresh
+    vld1.s8     {d2[], d3[]}, [r3]          ; limit
+    mov         r12, sp
+    vst1.u8     {q3}, [r12]!
+    vst1.u8     {q10}, [r12]!

    bl          vp8_mbloop_filter_neon

-    sub         r12, r12, r1, lsl #3
+    sub         r0, r0, r1, lsl #4
+
+    add         r2, r0, r1
+
+    add         r3, r2, r1
+
+    vld1.u8     {q3}, [sp]!
+    vld1.u8     {q10}, [sp]!

    ;transpose to 16x8 matrix
    vtrn.32     q3, q7
    vtrn.32     q4, q8
    vtrn.32     q5, q9
    vtrn.32     q6, q10
+    add         r12, r3, r1

    vtrn.16     q3, q5
    vtrn.16     q4, q6
@@ -191,30 +215,36 @@
    vtrn.8      q9, q10

    ;store op2, op1, op0, oq0, oq1, oq2
-    vst1.8      {d6}, [r0], r1
-    vst1.8      {d7}, [r12], r1
-    vst1.8      {d8}, [r0], r1
-    vst1.8      {d9}, [r12], r1
-    vst1.8      {d10}, [r0], r1
-    vst1.8      {d11}, [r12], r1
-    vst1.8      {d12}, [r0], r1
-    vst1.8      {d13}, [r12], r1
-    vst1.8      {d14}, [r0], r1
-    vst1.8      {d15}, [r12], r1
+    vst1.8      {d6}, [r0]
+    vst1.8      {d8}, [r2]
+    vst1.8      {d10}, [r3]
+    vst1.8      {d12}, [r12], r1
+    add         r0, r12, r1
+    vst1.8      {d14}, [r12]
    vst1.8      {d16}, [r0], r1
-    vst1.8      {d17}, [r12], r1
-    vst1.8      {d18}, [r0], r1
-    vst1.8      {d19}, [r12], r1
-    vst1.8      {d20}, [r0]
-    vst1.8      {d21}, [r12]
+    add         r2, r0, r1
+    vst1.8      {d18}, [r0]
+    vst1.8      {d20}, [r2], r1
+    add         r3, r2, r1
+    vst1.8      {d7}, [r2]
+    vst1.8      {d9}, [r3], r1
+    add         r12, r3, r1
+    vst1.8      {d11}, [r3]
+    vst1.8      {d13}, [r12], r1
+    add         r0, r12, r1
+    vst1.8      {d15}, [r12]
+    vst1.8      {d17}, [r0], r1
+    add         r2, r0, r1
+    vst1.8      {d19}, [r0]
+    vst1.8      {d21}, [r2]

-    pop         {pc}
+    ldmia       sp!, {pc}
    ENDP        ; |vp8_mbloop_filter_vertical_edge_y_neon|

 ; void vp8_mbloop_filter_vertical_edge_uv_neon(unsigned char *u, int pitch,
-;                                              const unsigned char *blimit,
-;                                              const unsigned char *limit,
-;                                              const unsigned char *thresh,
+;                                              const signed char *flimit,
+;                                              const signed char *limit,
+;                                              const signed char *thresh,
 ;                                              unsigned char *v)
 ; r0    unsigned char *u,
 ; r1    int pitch,
@@ -223,29 +253,30 @@
 ; sp    const signed char *thresh,
 ; sp+4  unsigned char *v
 |vp8_mbloop_filter_vertical_edge_uv_neon| PROC
-    push        {lr}
-    ldr         r12, [sp, #4]               ; load thresh
-    sub         r0, r0, #4                  ; move u pointer down by 4 columns
-    vdup.u8     q2, r12                     ; thresh
-    ldr         r12, [sp, #8]               ; load v ptr
-    sub         r12, r12, #4                ; move v pointer down by 4 columns
+    stmdb       sp!, {lr}
+    sub         r0, r0, #4                  ; move src pointer down by 4 columns
+    vld1.s8     {d2[], d3[]}, [r3]          ; limit
+    ldr         r3, [sp, #8]                ; load v ptr
+    ldr         r12, [sp, #4]               ; load thresh pointer
+
+    sub         r3, r3, #4                  ; move v pointer down by 4 columns

    vld1.u8     {d6}, [r0], r1              ;load u data
-    vld1.u8     {d7}, [r12], r1             ;load v data
+    vld1.u8     {d7}, [r3], r1              ;load v data
    vld1.u8     {d8}, [r0], r1
-    vld1.u8     {d9}, [r12], r1
+    vld1.u8     {d9}, [r3], r1
    vld1.u8     {d10}, [r0], r1
-    vld1.u8     {d11}, [r12], r1
+    vld1.u8     {d11}, [r3], r1
    vld1.u8     {d12}, [r0], r1
-    vld1.u8     {d13}, [r12], r1
+    vld1.u8     {d13}, [r3], r1
    vld1.u8     {d14}, [r0], r1
-    vld1.u8     {d15}, [r12], r1
+    vld1.u8     {d15}, [r3], r1
    vld1.u8     {d16}, [r0], r1
-    vld1.u8     {d17}, [r12], r1
+    vld1.u8     {d17}, [r3], r1
    vld1.u8     {d18}, [r0], r1
-    vld1.u8     {d19}, [r12], r1
+    vld1.u8     {d19}, [r3], r1
    vld1.u8     {d20}, [r0], r1
-    vld1.u8     {d21}, [r12], r1
+    vld1.u8     {d21}, [r3], r1

    ;transpose to 8x16 matrix
    vtrn.32     q3, q7
@@ -263,11 +294,19 @@
    vtrn.8      q7, q8
    vtrn.8      q9, q10

-    sub         r0, r0, r1, lsl #3
+    sub         sp, sp, #32
+    vld1.s8     {d4[], d5[]}, [r12]         ; thresh
+    mov         r12, sp
+    vst1.u8     {q3}, [r12]!
+    vst1.u8     {q10}, [r12]!

    bl          vp8_mbloop_filter_neon

-    sub         r12, r12, r1, lsl #3
+    sub         r0, r0, r1, lsl #3
+    sub         r3, r3, r1, lsl #3
+
+    vld1.u8     {q3}, [sp]!
+    vld1.u8     {q10}, [sp]!

    ;transpose to 16x8 matrix
    vtrn.32     q3, q7
@@ -287,23 +326,23 @@

    ;store op2, op1, op0, oq0, oq1, oq2
    vst1.8      {d6}, [r0], r1
-    vst1.8      {d7}, [r12], r1
+    vst1.8      {d7}, [r3], r1
    vst1.8      {d8}, [r0], r1
-    vst1.8      {d9}, [r12], r1
+    vst1.8      {d9}, [r3], r1
    vst1.8      {d10}, [r0], r1
-    vst1.8      {d11}, [r12], r1
+    vst1.8      {d11}, [r3], r1
    vst1.8      {d12}, [r0], r1
-    vst1.8      {d13}, [r12], r1
+    vst1.8      {d13}, [r3], r1
    vst1.8      {d14}, [r0], r1
-    vst1.8      {d15}, [r12], r1
+    vst1.8      {d15}, [r3], r1
    vst1.8      {d16}, [r0], r1
-    vst1.8      {d17}, [r12], r1
+    vst1.8      {d17}, [r3], r1
    vst1.8      {d18}, [r0], r1
-    vst1.8      {d19}, [r12], r1
-    vst1.8      {d20}, [r0]
-    vst1.8      {d21}, [r12]
+    vst1.8      {d19}, [r3], r1
+    vst1.8      {d20}, [r0], r1
+    vst1.8      {d21}, [r3], r1

-    pop         {pc}
+    ldmia       sp!, {pc}
    ENDP        ; |vp8_mbloop_filter_vertical_edge_uv_neon|

 ; void vp8_mbloop_filter_neon()
@@ -311,33 +350,41 @@
 ; functions do the necessary load, transpose (if necessary), preserve (if
 ; necessary) and store.

-; r0,r1 PRESERVE
-; r2    mblimit
-; r3    limit
+; TODO:
+; The vertical filter writes p3/q3 back out because two 4 element writes are
+; much simpler than ordering and writing two 3 element sets (or three 2 elements
+; sets, or whichever other combinations are possible).
+; If we can preserve q3 and q10, the vertical filter will be able to avoid
+; storing those values on the stack and reading them back after the filter.

+; r0,r1 PRESERVE
+; r2    flimit
+; r3    PRESERVE
+; q1    limit
 ; q2    thresh
-; q3    p3 PRESERVE
+; q3    p3
 ; q4    p2
 ; q5    p1
 ; q6    p0
 ; q7    q0
 ; q8    q1
 ; q9    q2
-; q10   q3 PRESERVE
+; q10   q3

 |vp8_mbloop_filter_neon| PROC
+    ldr         r12, _mblf_coeff_

    ; vp8_filter_mask
    vabd.u8     q11, q3, q4                 ; abs(p3 - p2)
    vabd.u8     q12, q4, q5                 ; abs(p2 - p1)
    vabd.u8     q13, q5, q6                 ; abs(p1 - p0)
    vabd.u8     q14, q8, q7                 ; abs(q1 - q0)
-    vabd.u8     q1, q9, q8                  ; abs(q2 - q1)
+    vabd.u8     q3, q9, q8                  ; abs(q2 - q1)
    vabd.u8     q0, q10, q9                 ; abs(q3 - q2)

    vmax.u8     q11, q11, q12
    vmax.u8     q12, q13, q14
-    vmax.u8     q1, q1, q0
+    vmax.u8     q3, q3, q0
    vmax.u8     q15, q11, q12

    vabd.u8     q12, q6, q7                 ; abs(p0 - q0)
@@ -345,53 +392,51 @@
    ; vp8_hevmask
    vcgt.u8     q13, q13, q2                ; (abs(p1 - p0) > thresh) * -1
    vcgt.u8     q14, q14, q2                ; (abs(q1 - q0) > thresh) * -1
-    vmax.u8     q15, q15, q1
+    vmax.u8     q15, q15, q3

-    vdup.u8     q1, r3                      ; limit
-    vdup.u8     q2, r2                      ; mblimit
+    vld1.s8     {d4[], d5[]}, [r2]          ; flimit

-    vmov.u8     q0, #0x80                   ; 0x80
+    vld1.u8     {q0}, [r12]!

+    vadd.u8     q2, q2, q2                  ; flimit * 2
+    vadd.u8     q2, q2, q1                  ; flimit * 2 +  limit
    vcge.u8     q15, q1, q15

    vabd.u8     q1, q5, q8                  ; a = abs(p1 - q1)
    vqadd.u8    q12, q12, q12               ; b = abs(p0 - q0) * 2
-    vmov.u16    q11, #3                     ; #3
+    vshr.u8     q1, q1, #1                  ; a = a / 2
+    vqadd.u8    q12, q12, q1                ; a = b + a
+    vcge.u8     q12, q2, q12                ; (a > flimit * 2 + limit) * -1

    ; vp8_filter
    ; convert to signed
    veor        q7, q7, q0                  ; qs0
-    vshr.u8     q1, q1, #1                  ; a = a / 2
    veor        q6, q6, q0                  ; ps0
    veor        q5, q5, q0                  ; ps1
-
-    vqadd.u8    q12, q12, q1                ; a = b + a
-
    veor        q8, q8, q0                  ; qs1
    veor        q4, q4, q0                  ; ps2
    veor        q9, q9, q0                  ; qs2

    vorr        q14, q13, q14               ; vp8_hevmask

-    vcge.u8     q12, q2, q12                ; (a > flimit * 2 + limit) * -1
-
    vsubl.s8    q2, d14, d12                ; qs0 - ps0
    vsubl.s8    q13, d15, d13

    vqsub.s8    q1, q5, q8                  ; vp8_filter = clamp(ps1-qs1)

-    vmul.i16    q2, q2, q11                 ; 3 * ( qs0 - ps0)
-
+    vadd.s16    q10, q2, q2                 ; 3 * (qs0 - ps0)
+    vadd.s16    q11, q13, q13
    vand        q15, q15, q12               ; vp8_filter_mask

-    vmul.i16    q13, q13, q11
+    vadd.s16    q2, q2, q10
+    vadd.s16    q13, q13, q11

-    vmov.u8     q12, #3                     ; #3
+    vld1.u8     {q12}, [r12]!               ; #3

    vaddw.s8    q2, q2, d2                  ; vp8_filter + 3 * ( qs0 - ps0)
    vaddw.s8    q13, q13, d3

-    vmov.u8     q11, #4                     ; #4
+    vld1.u8     {q11}, [r12]!               ; #4

    ; vp8_filter = clamp(vp8_filter + 3 * ( qs0 - ps0))
    vqmovn.s16  d2, q2
@@ -399,23 +444,27 @@

    vand        q1, q1, q15                 ; vp8_filter &= mask

-    vmov.u16    q15, #63                    ; #63
-
+    vld1.u8     {q15}, [r12]!               ; #63
+    ;
    vand        q13, q1, q14                ; Filter2 &= hev

+    vld1.u8     {d7}, [r12]!                ; #9
+
    vqadd.s8    q2, q13, q11                ; Filter1 = clamp(Filter2+4)
    vqadd.s8    q13, q13, q12               ; Filter2 = clamp(Filter2+3)

-    vmov        q0, q15
+    vld1.u8     {d6}, [r12]!                ; #18

    vshr.s8     q2, q2, #3                  ; Filter1 >>= 3
    vshr.s8     q13, q13, #3                ; Filter2 >>= 3

-    vmov        q11, q15
+    vmov        q10, q15
    vmov        q12, q15

    vqsub.s8    q7, q7, q2                  ; qs0 = clamp(qs0 - Filter1)

+    vld1.u8     {d5}, [r12]!                ; #27
+
    vqadd.s8    q6, q6, q13                 ; ps0 = clamp(ps0 + Filter2)

    vbic        q1, q1, q14                 ; vp8_filter &= ~hev
@@ -423,47 +472,49 @@
    ; roughly 1/7th difference across boundary
    ; roughly 2/7th difference across boundary
    ; roughly 3/7th difference across boundary
-
-    vmov.u8     d5, #9                      ; #9
-    vmov.u8     d4, #18                     ; #18
-
+    vmov        q11, q15
    vmov        q13, q15
    vmov        q14, q15

-    vmlal.s8    q0, d2, d5                  ; 63 + Filter2 * 9
-    vmlal.s8    q11, d3, d5
-    vmov.u8     d5, #27                     ; #27
-    vmlal.s8    q12, d2, d4                 ; 63 + Filter2 * 18
-    vmlal.s8    q13, d3, d4
-    vmlal.s8    q14, d2, d5                 ; 63 + Filter2 * 27
+    vmlal.s8    q10, d2, d7                 ; Filter2 * 9
+    vmlal.s8    q11, d3, d7
+    vmlal.s8    q12, d2, d6                 ; Filter2 * 18
+    vmlal.s8    q13, d3, d6
+    vmlal.s8    q14, d2, d5                 ; Filter2 * 27
    vmlal.s8    q15, d3, d5
-
-    vqshrn.s16  d0, q0, #7                  ; u = clamp((63 + Filter2 * 9)>>7)
-    vqshrn.s16  d1, q11, #7
+    vqshrn.s16  d20, q10, #7                ; u = clamp((63 + Filter2 * 9)>>7)
+    vqshrn.s16  d21, q11, #7
    vqshrn.s16  d24, q12, #7                ; u = clamp((63 + Filter2 * 18)>>7)
    vqshrn.s16  d25, q13, #7
    vqshrn.s16  d28, q14, #7                ; u = clamp((63 + Filter2 * 27)>>7)
    vqshrn.s16  d29, q15, #7

-    vmov.u8     q1, #0x80                   ; 0x80
-
-    vqsub.s8    q11, q9, q0                 ; s = clamp(qs2 - u)
-    vqadd.s8    q0, q4, q0                  ; s = clamp(ps2 + u)
+    vqsub.s8    q11, q9, q10                ; s = clamp(qs2 - u)
+    vqadd.s8    q10, q4, q10                ; s = clamp(ps2 + u)
    vqsub.s8    q13, q8, q12                ; s = clamp(qs1 - u)
    vqadd.s8    q12, q5, q12                ; s = clamp(ps1 + u)
    vqsub.s8    q15, q7, q14                ; s = clamp(qs0 - u)
    vqadd.s8    q14, q6, q14                ; s = clamp(ps0 + u)
-
-    veor        q9, q11, q1                 ; *oq2 = s^0x80
-    veor        q4, q0, q1                  ; *op2 = s^0x80
-    veor        q8, q13, q1                 ; *oq1 = s^0x80
-    veor        q5, q12, q1                 ; *op2 = s^0x80
-    veor        q7, q15, q1                 ; *oq0 = s^0x80
-    veor        q6, q14, q1                 ; *op0 = s^0x80
+    veor        q9, q11, q0                 ; *oq2 = s^0x80
+    veor        q4, q10, q0                 ; *op2 = s^0x80
+    veor        q8, q13, q0                 ; *oq1 = s^0x80
+    veor        q5, q12, q0                 ; *op2 = s^0x80
+    veor        q7, q15, q0                 ; *oq0 = s^0x80
+    veor        q6, q14, q0                 ; *op0 = s^0x80

    bx          lr
    ENDP        ; |vp8_mbloop_filter_neon|

 ;-----------------

+_mblf_coeff_
+    DCD     mblf_coeff
+mblf_coeff
+    DCD     0x80808080, 0x80808080, 0x80808080, 0x80808080
+    DCD     0x03030303, 0x03030303, 0x03030303, 0x03030303
+    DCD     0x04040404, 0x04040404, 0x04040404, 0x04040404
+    DCD     0x003f003f, 0x003f003f, 0x003f003f, 0x003f003f
+    DCD     0x09090909, 0x09090909, 0x12121212, 0x12121212
+    DCD     0x1b1b1b1b, 0x1b1b1b1b
+
    END
--- a/vp8/common/arm/neon/recon_neon.c
+++ b/vp8/common/arm/neon/recon_neon.c
@@ -9,7 +9,7 @@
 */


-#include "vpx_config.h"
+#include "vpx_ports/config.h"
 #include "vp8/common/recon.h"
 #include "vp8/common/blockd.h"

--- a/vp8/common/arm/neon/shortidct4x4llm_neon.asm
+++ b/vp8/common/arm/neon/shortidct4x4llm_neon.asm
@@ -31,7 +31,7 @@
 ;result of the multiplication that is needed in IDCT.

 |vp8_short_idct4x4llm_neon| PROC
-    adr             r12, idct_coeff
+    ldr             r12, _idct_coeff_
    vld1.16         {q1, q2}, [r0]
    vld1.16         {d0}, [r12]

@@ -114,6 +114,8 @@

 ;-----------------

+_idct_coeff_
+    DCD     idct_coeff
 idct_coeff
    DCD     0x4e7b4e7b, 0x8a8c8a8c

--- a/vp8/common/arm/neon/sixtappredict16x16_neon.asm
+++ b/vp8/common/arm/neon/sixtappredict16x16_neon.asm
@@ -15,17 +15,6 @@
    PRESERVE8

    AREA ||.text||, CODE, READONLY, ALIGN=2
-
-filter16_coeff
-    DCD     0,  0,  128,    0,   0,  0,   0,  0
-    DCD     0, -6,  123,   12,  -1,  0,   0,  0
-    DCD     2, -11, 108,   36,  -8,  1,   0,  0
-    DCD     0, -9,   93,   50,  -6,  0,   0,  0
-    DCD     3, -16,  77,   77, -16,  3,   0,  0
-    DCD     0, -6,   50,   93,  -9,  0,   0,  0
-    DCD     1, -8,   36,  108, -11,  2,   0,  0
-    DCD     0, -1,   12,  123,  -6,   0,  0,  0
-
 ; r0    unsigned char  *src_ptr,
 ; r1    int  src_pixels_per_line,
 ; r2    int  xoffset,
@@ -44,7 +33,7 @@ filter16_coeff
 |vp8_sixtap_predict16x16_neon| PROC
    push            {r4-r5, lr}

-    adr             r12, filter16_coeff
+    ldr             r12, _filter16_coeff_
    ldr             r4, [sp, #12]           ;load parameters from stack
    ldr             r5, [sp, #16]           ;load parameters from stack

@@ -487,4 +476,17 @@ secondpass_only_inner_loop_neon
    ENDP

 ;-----------------
+
+_filter16_coeff_
+    DCD     filter16_coeff
+filter16_coeff
+    DCD     0,  0,  128,    0,   0,  0,   0,  0
+    DCD     0, -6,  123,   12,  -1,  0,   0,  0
+    DCD     2, -11, 108,   36,  -8,  1,   0,  0
+    DCD     0, -9,   93,   50,  -6,  0,   0,  0
+    DCD     3, -16,  77,   77, -16,  3,   0,  0
+    DCD     0, -6,   50,   93,  -9,  0,   0,  0
+    DCD     1, -8,   36,  108, -11,  2,   0,  0
+    DCD     0, -1,   12,  123,  -6,   0,  0,  0
+
    END
--- a/vp8/common/arm/neon/sixtappredict4x4_neon.asm
+++ b/vp8/common/arm/neon/sixtappredict4x4_neon.asm
@@ -15,17 +15,6 @@
    PRESERVE8

    AREA ||.text||, CODE, READONLY, ALIGN=2
-
-filter4_coeff
-    DCD     0,  0,  128,    0,   0,  0,   0,  0
-    DCD     0, -6,  123,   12,  -1,  0,   0,  0
-    DCD     2, -11, 108,   36,  -8,  1,   0,  0
-    DCD     0, -9,   93,   50,  -6,  0,   0,  0
-    DCD     3, -16,  77,   77, -16,  3,   0,  0
-    DCD     0, -6,   50,   93,  -9,  0,   0,  0
-    DCD     1, -8,   36,  108, -11,  2,   0,  0
-    DCD     0, -1,   12,  123,  -6,   0,  0,  0
-
 ; r0    unsigned char  *src_ptr,
 ; r1    int  src_pixels_per_line,
 ; r2    int  xoffset,
@@ -36,7 +25,7 @@ filter4_coeff
 |vp8_sixtap_predict_neon| PROC
    push            {r4, lr}

-    adr             r12, filter4_coeff
+    ldr             r12, _filter4_coeff_
    ldr             r4, [sp, #8]            ;load parameters from stack
    ldr             lr, [sp, #12]           ;load parameters from stack

@@ -419,4 +408,16 @@ secondpass_filter4x4_only

 ;-----------------

+_filter4_coeff_
+    DCD     filter4_coeff
+filter4_coeff
+    DCD     0,  0,  128,    0,   0,  0,   0,  0
+    DCD     0, -6,  123,   12,  -1,  0,   0,  0
+    DCD     2, -11, 108,   36,  -8,  1,   0,  0
+    DCD     0, -9,   93,   50,  -6,  0,   0,  0
+    DCD     3, -16,  77,   77, -16,  3,   0,  0
+    DCD     0, -6,   50,   93,  -9,  0,   0,  0
+    DCD     1, -8,   36,  108, -11,  2,   0,  0
+    DCD     0, -1,   12,  123,  -6,   0,  0,  0
+
    END
--- a/vp8/common/arm/neon/sixtappredict8x4_neon.asm
+++ b/vp8/common/arm/neon/sixtappredict8x4_neon.asm
@@ -15,17 +15,6 @@
    PRESERVE8

    AREA ||.text||, CODE, READONLY, ALIGN=2
-
-filter8_coeff
-    DCD     0,  0,  128,    0,   0,  0,   0,  0
-    DCD     0, -6,  123,   12,  -1,  0,   0,  0
-    DCD     2, -11, 108,   36,  -8,  1,   0,  0
-    DCD     0, -9,   93,   50,  -6,  0,   0,  0
-    DCD     3, -16,  77,   77, -16,  3,   0,  0
-    DCD     0, -6,   50,   93,  -9,  0,   0,  0
-    DCD     1, -8,   36,  108, -11,  2,   0,  0
-    DCD     0, -1,   12,  123,  -6,   0,  0,  0
-
 ; r0    unsigned char  *src_ptr,
 ; r1    int  src_pixels_per_line,
 ; r2    int  xoffset,
@@ -36,7 +25,7 @@ filter8_coeff
 |vp8_sixtap_predict8x4_neon| PROC
    push            {r4-r5, lr}

-    adr             r12, filter8_coeff
+    ldr             r12, _filter8_coeff_
    ldr             r4, [sp, #12]           ;load parameters from stack
    ldr             r5, [sp, #16]           ;load parameters from stack

@@ -470,4 +459,16 @@ secondpass_filter8x4_only

 ;-----------------

+_filter8_coeff_
+    DCD     filter8_coeff
+filter8_coeff
+    DCD     0,  0,  128,    0,   0,  0,   0,  0
+    DCD     0, -6,  123,   12,  -1,  0,   0,  0
+    DCD     2, -11, 108,   36,  -8,  1,   0,  0
+    DCD     0, -9,   93,   50,  -6,  0,   0,  0
+    DCD     3, -16,  77,   77, -16,  3,   0,  0
+    DCD     0, -6,   50,   93,  -9,  0,   0,  0
+    DCD     1, -8,   36,  108, -11,  2,   0,  0
+    DCD     0, -1,   12,  123,  -6,   0,  0,  0
+
    END
--- a/vp8/common/arm/neon/sixtappredict8x8_neon.asm
+++ b/vp8/common/arm/neon/sixtappredict8x8_neon.asm
@@ -15,17 +15,6 @@
    PRESERVE8

    AREA ||.text||, CODE, READONLY, ALIGN=2
-
-filter8_coeff
-    DCD     0,  0,  128,    0,   0,  0,   0,  0
-    DCD     0, -6,  123,   12,  -1,  0,   0,  0
-    DCD     2, -11, 108,   36,  -8,  1,   0,  0
-    DCD     0, -9,   93,   50,  -6,  0,   0,  0
-    DCD     3, -16,  77,   77, -16,  3,   0,  0
-    DCD     0, -6,   50,   93,  -9,  0,   0,  0
-    DCD     1, -8,   36,  108, -11,  2,   0,  0
-    DCD     0, -1,   12,  123,  -6,   0,  0,  0
-
 ; r0    unsigned char  *src_ptr,
 ; r1    int  src_pixels_per_line,
 ; r2    int  xoffset,
@@ -36,7 +25,7 @@ filter8_coeff
 |vp8_sixtap_predict8x8_neon| PROC
    push            {r4-r5, lr}

-    adr             r12, filter8_coeff
+    ldr             r12, _filter8_coeff_

    ldr             r4, [sp, #12]           ;load parameters from stack
    ldr             r5, [sp, #16]           ;load parameters from stack
@@ -521,4 +510,16 @@ filt_blk2d_spo8x8_loop_neon

 ;-----------------

+_filter8_coeff_
+    DCD     filter8_coeff
+filter8_coeff
+    DCD     0,  0,  128,    0,   0,  0,   0,  0
+    DCD     0, -6,  123,   12,  -1,  0,   0,  0
+    DCD     2, -11, 108,   36,  -8,  1,   0,  0
+    DCD     0, -9,   93,   50,  -6,  0,   0,  0
+    DCD     3, -16,  77,   77, -16,  3,   0,  0
+    DCD     0, -6,   50,   93,  -9,  0,   0,  0
+    DCD     1, -8,   36,  108, -11,  2,   0,  0
+    DCD     0, -1,   12,  123,  -6,   0,  0,  0
+
    END
--- a/vp8/common/arm/reconintra_arm.c
+++ b/vp8/common/arm/reconintra_arm.c
@@ -9,7 +9,7 @@
 */


-#include "vpx_config.h"
+#include "vpx_ports/config.h"
 #include "vp8/common/blockd.h"
 #include "vp8/common/reconintra.h"
 #include "vpx_mem/vpx_mem.h"
--- a/vp8/common/asm_com_offsets.c
+++ b/vp8/common/asm_com_offsets.c
@@ -9,14 +9,27 @@
 */


-#include "vpx_config.h"
-#include "vpx/vpx_codec.h"
-#include "vpx_ports/asm_offsets.h"
+#include "vpx_ports/config.h"
+#include <stddef.h>
+
 #include "vpx_scale/yv12config.h"

-BEGIN
+#define ct_assert(name,cond) \
+    static void assert_##name(void) UNUSED;\
+    static void assert_##name(void) {switch(0){case 0:case !!(cond):;}}

-/* vpx_scale */
+#define DEFINE(sym, val) int sym = val;
+
+/*
+#define BLANK() asm volatile("\n->" : : )
+*/
+
+/*
+ * int main(void)
+ * {
+ */
+
+//vpx_scale
 DEFINE(yv12_buffer_config_y_width,              offsetof(YV12_BUFFER_CONFIG, y_width));
 DEFINE(yv12_buffer_config_y_height,             offsetof(YV12_BUFFER_CONFIG, y_height));
 DEFINE(yv12_buffer_config_y_stride,             offsetof(YV12_BUFFER_CONFIG, y_stride));
@@ -27,14 +40,10 @@ DEFINE(yv12_buffer_config_y_buffer,             offsetof(YV12_BUFFER_CONFIG, y_b
 DEFINE(yv12_buffer_config_u_buffer,             offsetof(YV12_BUFFER_CONFIG, u_buffer));
 DEFINE(yv12_buffer_config_v_buffer,             offsetof(YV12_BUFFER_CONFIG, v_buffer));
 DEFINE(yv12_buffer_config_border,               offsetof(YV12_BUFFER_CONFIG, border));
-DEFINE(VP8BORDERINPIXELS_VAL,                   VP8BORDERINPIXELS);

-END
-
-/* add asserts for any offset that is not supported by assembly code */
-/* add asserts for any size that is not supported by assembly code */
-
-#if HAVE_ARMV7
-/* vp8_yv12_extend_frame_borders_neon makes several assumptions based on this */
-ct_assert(VP8BORDERINPIXELS_VAL, VP8BORDERINPIXELS == 32)
-#endif
+//add asserts for any offset that is not supported by assembly code
+//add asserts for any size that is not supported by assembly code
+/*
+ * return 0;
+ * }
+ */
--- a/vp8/common/blockd.h
+++ b/vp8/common/blockd.h
@@ -14,7 +14,7 @@

 void vpx_log(const char *format, ...);

-#include "vpx_config.h"
+#include "vpx_ports/config.h"
 #include "vpx_scale/yv12config.h"
 #include "mv.h"
 #include "treecoder.h"
@@ -137,11 +137,16 @@ typedef enum
   modes for the Y blocks to the left and above us; for interframes, there
   is a single probability table. */

-union b_mode_info
+typedef struct
 {
-    B_PREDICTION_MODE as_mode;
-    int_mv mv;
-};
+    B_PREDICTION_MODE mode;
+    union
+    {
+        int as_int;
+        MV  as_mv;
+    } mv;
+} B_MODE_INFO;
+

 typedef enum
 {
@@ -156,26 +161,38 @@ typedef struct
 {
    MB_PREDICTION_MODE mode, uv_mode;
    MV_REFERENCE_FRAME ref_frame;
-    int_mv mv;
+    union
+    {
+        int as_int;
+        MV  as_mv;
+    } mv;

    unsigned char partitioning;
    unsigned char mb_skip_coeff;                                /* does this mb has coefficients at all, 1=no coefficients, 0=need decode tokens */
+    unsigned char dc_diff;
    unsigned char need_to_clamp_mvs;
+
    unsigned char segment_id;                  /* Which set of segmentation parameters should be used for this MB */
+
+    unsigned char force_no_skip; /* encoder only */
 } MB_MODE_INFO;

+
 typedef struct
 {
    MB_MODE_INFO mbmi;
-    union b_mode_info bmi[16];
+    B_MODE_INFO bmi[16];
 } MODE_INFO;

+
 typedef struct
 {
    short *qcoeff;
    short *dqcoeff;
    unsigned char  *predictor;
    short *diff;
+    short *reference;
+
    short *dequant;

    /* 16 Y blocks, 4 U blocks, 4 V blocks each with 16 entries */
@@ -189,20 +206,21 @@ typedef struct

    int eob;

-    union b_mode_info bmi;
+    B_MODE_INFO bmi;
+
 } BLOCKD;

-typedef struct MacroBlockD
+typedef struct
 {
    DECLARE_ALIGNED(16, short, diff[400]);      /* from idct diff */
    DECLARE_ALIGNED(16, unsigned char,  predictor[384]);
+/* not used    DECLARE_ALIGNED(16, short, reference[384]); */
    DECLARE_ALIGNED(16, short, qcoeff[400]);
    DECLARE_ALIGNED(16, short, dqcoeff[400]);
    DECLARE_ALIGNED(16, char,  eobs[25]);

    /* 16 Y blocks, 4 U, 4 V, 1 DC 2nd order block, each with 16 entries. */
    BLOCKD block[25];
-    int fullpixel_mask;

    YV12_BUFFER_CONFIG pre; /* Filtered copy of previous frame reconstruction */
    YV12_BUFFER_CONFIG dst;
@@ -253,9 +271,6 @@ typedef struct MacroBlockD
    int mb_to_top_edge;
    int mb_to_bottom_edge;

-    int ref_frame_cost[MAX_REF_FRAMES];
-
-
    unsigned int frames_since_golden;
    unsigned int frames_till_alt_ref_frame;
    vp8_subpix_fn_t  subpixel_predict;
@@ -267,14 +282,6 @@ typedef struct MacroBlockD

    int corrupted;

-#if ARCH_X86 || ARCH_X86_64
-    /* This is an intermediate buffer currently used in sub-pixel motion search
-     * to keep a copy of the reference area. This buffer can be used for other
-     * purpose.
-     */
-    DECLARE_ALIGNED(32, unsigned char, y_buf[22*32]);
-#endif
-
 #if CONFIG_RUNTIME_CPU_DETECT
    struct VP8_COMMON_RTCD  *rtcd;
 #endif
--- a/vp8/common/coefupdateprobs.h
+++ b/vp8/common/coefupdateprobs.h
@@ -12,7 +12,7 @@
 /* Update probabilities for the nodes in the token entropy tree.
   Generated file included by entropy.c */

-const vp8_prob vp8_coef_update_probs [BLOCK_TYPES] [COEF_BANDS] [PREV_COEF_CONTEXTS] [ENTROPY_NODES] =
+const vp8_prob vp8_coef_update_probs [BLOCK_TYPES] [COEF_BANDS] [PREV_COEF_CONTEXTS] [vp8_coef_tokens-1] =
 {
    {
        {
--- a/vp8/common/debugmodes.c
+++ b/vp8/common/debugmodes.c
@@ -97,7 +97,7 @@ void vp8_print_modes_and_motion_vectors(MODE_INFO *mi, int rows, int cols, int f
                bindex = (b_row & 3) * 4 + (b_col & 3);

                if (mi[mb_index].mbmi.mode == B_PRED)
-                    fprintf(mvs, "%2d ", mi[mb_index].bmi[bindex].as_mode);
+                    fprintf(mvs, "%2d ", mi[mb_index].bmi[bindex].mode);
                else
                    fprintf(mvs, "xx ");

--- a/vp8/common/default_coef_probs.h
+++ b/vp8/common/default_coef_probs.h
@@ -1,188 +0,0 @@
-/*
- *  Copyright (c) 2010 The WebM project authors. All Rights Reserved.
- *
- *  Use of this source code is governed by a BSD-style license
- *  that can be found in the LICENSE file in the root of the source
- *  tree. An additional intellectual property rights grant can be found
- *  in the file PATENTS.  All contributing project authors may
- *  be found in the AUTHORS file in the root of the source tree.
-*/
-
-
-/*Generated file, included by entropy.c*/
-
-
-static const vp8_prob default_coef_probs [BLOCK_TYPES]
-                                         [COEF_BANDS]
-                                         [PREV_COEF_CONTEXTS]
-                                         [ENTROPY_NODES] =
-{
-    { /* Block Type ( 0 ) */
-        { /* Coeff Band ( 0 )*/
-            { 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128 },
-            { 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128 },
-            { 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128 }
-        },
-        { /* Coeff Band ( 1 )*/
-            { 253, 136, 254, 255, 228, 219, 128, 128, 128, 128, 128 },
-            { 189, 129, 242, 255, 227, 213, 255, 219, 128, 128, 128 },
-            { 106, 126, 227, 252, 214, 209, 255, 255, 128, 128, 128 }
-        },
-        { /* Coeff Band ( 2 )*/
-            {   1,  98, 248, 255, 236, 226, 255, 255, 128, 128, 128 },
-            { 181, 133, 238, 254, 221, 234, 255, 154, 128, 128, 128 },
-            {  78, 134, 202, 247, 198, 180, 255, 219, 128, 128, 128 }
-        },
-        { /* Coeff Band ( 3 )*/
-            {   1, 185, 249, 255, 243, 255, 128, 128, 128, 128, 128 },
-            { 184, 150, 247, 255, 236, 224, 128, 128, 128, 128, 128 },
-            {  77, 110, 216, 255, 236, 230, 128, 128, 128, 128, 128 }
-        },
-        { /* Coeff Band ( 4 )*/
-            {   1, 101, 251, 255, 241, 255, 128, 128, 128, 128, 128 },
-            { 170, 139, 241, 252, 236, 209, 255, 255, 128, 128, 128 },
-            {  37, 116, 196, 243, 228, 255, 255, 255, 128, 128, 128 }
-        },
-        { /* Coeff Band ( 5 )*/
-            {   1, 204, 254, 255, 245, 255, 128, 128, 128, 128, 128 },
-            { 207, 160, 250, 255, 238, 128, 128, 128, 128, 128, 128 },
-            { 102, 103, 231, 255, 211, 171, 128, 128, 128, 128, 128 }
-        },
-        { /* Coeff Band ( 6 )*/
-            {   1, 152, 252, 255, 240, 255, 128, 128, 128, 128, 128 },
-            { 177, 135, 243, 255, 234, 225, 128, 128, 128, 128, 128 },
-            {  80, 129, 211, 255, 194, 224, 128, 128, 128, 128, 128 }
-        },
-        { /* Coeff Band ( 7 )*/
-            {   1,   1, 255, 128, 128, 128, 128, 128, 128, 128, 128 },
-            { 246,   1, 255, 128, 128, 128, 128, 128, 128, 128, 128 },
-            { 255, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128 }
-        }
-    },
-    { /* Block Type ( 1 ) */
-        { /* Coeff Band ( 0 )*/
-            { 198,  35, 237, 223, 193, 187, 162, 160, 145, 155,  62 },
-            { 131,  45, 198, 221, 172, 176, 220, 157, 252, 221,   1 },
-            {  68,  47, 146, 208, 149, 167, 221, 162, 255, 223, 128 }
-        },
-        { /* Coeff Band ( 1 )*/
-            {   1, 149, 241, 255, 221, 224, 255, 255, 128, 128, 128 },
-            { 184, 141, 234, 253, 222, 220, 255, 199, 128, 128, 128 },
-            {  81,  99, 181, 242, 176, 190, 249, 202, 255, 255, 128 }
-        },
-        { /* Coeff Band ( 2 )*/
-            {   1, 129, 232, 253, 214, 197, 242, 196, 255, 255, 128 },
-            {  99, 121, 210, 250, 201, 198, 255, 202, 128, 128, 128 },
-            {  23,  91, 163, 242, 170, 187, 247, 210, 255, 255, 128 }
-        },
-        { /* Coeff Band ( 3 )*/
-            {   1, 200, 246, 255, 234, 255, 128, 128, 128, 128, 128 },
-            { 109, 178, 241, 255, 231, 245, 255, 255, 128, 128, 128 },
-            {  44, 130, 201, 253, 205, 192, 255, 255, 128, 128, 128 }
-        },
-        { /* Coeff Band ( 4 )*/
-            {   1, 132, 239, 251, 219, 209, 255, 165, 128, 128, 128 },
-            {  94, 136, 225, 251, 218, 190, 255, 255, 128, 128, 128 },
-            {  22, 100, 174, 245, 186, 161, 255, 199, 128, 128, 128 }
-        },
-        { /* Coeff Band ( 5 )*/
-            {   1, 182, 249, 255, 232, 235, 128, 128, 128, 128, 128 },
-            { 124, 143, 241, 255, 227, 234, 128, 128, 128, 128, 128 },
-            {  35,  77, 181, 251, 193, 211, 255, 205, 128, 128, 128 }
-        },
-        { /* Coeff Band ( 6 )*/
-            {   1, 157, 247, 255, 236, 231, 255, 255, 128, 128, 128 },
-            { 121, 141, 235, 255, 225, 227, 255, 255, 128, 128, 128 },
-            {  45,  99, 188, 251, 195, 217, 255, 224, 128, 128, 128 }
-        },
-        { /* Coeff Band ( 7 )*/
-            {   1,   1, 251, 255, 213, 255, 128, 128, 128, 128, 128 },
-            { 203,   1, 248, 255, 255, 128, 128, 128, 128, 128, 128 },
-            { 137,   1, 177, 255, 224, 255, 128, 128, 128, 128, 128 }
-        }
-    },
-    { /* Block Type ( 2 ) */
-        { /* Coeff Band ( 0 )*/
-            { 253,   9, 248, 251, 207, 208, 255, 192, 128, 128, 128 },
-            { 175,  13, 224, 243, 193, 185, 249, 198, 255, 255, 128 },
-            {  73,  17, 171, 221, 161, 179, 236, 167, 255, 234, 128 }
-        },
-        { /* Coeff Band ( 1 )*/
-            {   1,  95, 247, 253, 212, 183, 255, 255, 128, 128, 128 },
-            { 239,  90, 244, 250, 211, 209, 255, 255, 128, 128, 128 },
-            { 155,  77, 195, 248, 188, 195, 255, 255, 128, 128, 128 }
-        },
-        { /* Coeff Band ( 2 )*/
-            {   1,  24, 239, 251, 218, 219, 255, 205, 128, 128, 128 },
-            { 201,  51, 219, 255, 196, 186, 128, 128, 128, 128, 128 },
-            {  69,  46, 190, 239, 201, 218, 255, 228, 128, 128, 128 }
-        },
-        { /* Coeff Band ( 3 )*/
-            {   1, 191, 251, 255, 255, 128, 128, 128, 128, 128, 128 },
-            { 223, 165, 249, 255, 213, 255, 128, 128, 128, 128, 128 },
-            { 141, 124, 248, 255, 255, 128, 128, 128, 128, 128, 128 }
-        },
-        { /* Coeff Band ( 4 )*/
-            {   1,  16, 248, 255, 255, 128, 128, 128, 128, 128, 128 },
-            { 190,  36, 230, 255, 236, 255, 128, 128, 128, 128, 128 },
-            { 149,   1, 255, 128, 128, 128, 128, 128, 128, 128, 128 }
-        },
-        { /* Coeff Band ( 5 )*/
-            {   1, 226, 255, 128, 128, 128, 128, 128, 128, 128, 128 },
-            { 247, 192, 255, 128, 128, 128, 128, 128, 128, 128, 128 },
-            { 240, 128, 255, 128, 128, 128, 128, 128, 128, 128, 128 }
-        },
-        { /* Coeff Band ( 6 )*/
-            {   1, 134, 252, 255, 255, 128, 128, 128, 128, 128, 128 },
-            { 213,  62, 250, 255, 255, 128, 128, 128, 128, 128, 128 },
-            {  55,  93, 255, 128, 128, 128, 128, 128, 128, 128, 128 }
-        },
-        { /* Coeff Band ( 7 )*/
-            { 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128 },
-            { 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128 },
-            { 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128 }
-        }
-    },
-    { /* Block Type ( 3 ) */
-        { /* Coeff Band ( 0 )*/
-            { 202,  24, 213, 235, 186, 191, 220, 160, 240, 175, 255 },
-            { 126,  38, 182, 232, 169, 184, 228, 174, 255, 187, 128 },
-            {  61,  46, 138, 219, 151, 178, 240, 170, 255, 216, 128 }
-        },
-        { /* Coeff Band ( 1 )*/
-            {   1, 112, 230, 250, 199, 191, 247, 159, 255, 255, 128 },
-            { 166, 109, 228, 252, 211, 215, 255, 174, 128, 128, 128 },
-            {  39,  77, 162, 232, 172, 180, 245, 178, 255, 255, 128 }
-        },
-        { /* Coeff Band ( 2 )*/
-            {   1,  52, 220, 246, 198, 199, 249, 220, 255, 255, 128 },
-            { 124,  74, 191, 243, 183, 193, 250, 221, 255, 255, 128 },
-            {  24,  71, 130, 219, 154, 170, 243, 182, 255, 255, 128 }
-        },
-        { /* Coeff Band ( 3 )*/
-            {   1, 182, 225, 249, 219, 240, 255, 224, 128, 128, 128 },
-            { 149, 150, 226, 252, 216, 205, 255, 171, 128, 128, 128 },
-            {  28, 108, 170, 242, 183, 194, 254, 223, 255, 255, 128 }
-        },
-        { /* Coeff Band ( 4 )*/
-            {   1,  81, 230, 252, 204, 203, 255, 192, 128, 128, 128 },
-            { 123, 102, 209, 247, 188, 196, 255, 233, 128, 128, 128 },
-            {  20,  95, 153, 243, 164, 173, 255, 203, 128, 128, 128 }
-        },
-        { /* Coeff Band ( 5 )*/
-            {   1, 222, 248, 255, 216, 213, 128, 128, 128, 128, 128 },
-            { 168, 175, 246, 252, 235, 205, 255, 255, 128, 128, 128 },
-            {  47, 116, 215, 255, 211, 212, 255, 255, 128, 128, 128 }
-        },
-        { /* Coeff Band ( 6 )*/
-            {   1, 121, 236, 253, 212, 214, 255, 255, 128, 128, 128 },
-            { 141,  84, 213, 252, 201, 202, 255, 219, 128, 128, 128 },
-            {  42,  80, 160, 240, 162, 185, 255, 205, 128, 128, 128 }
-        },
-        { /* Coeff Band ( 7 )*/
-            {   1,   1, 255, 128, 128, 128, 128, 128, 128, 128, 128 },
-            { 244,   1, 255, 128, 128, 128, 128, 128, 128, 128, 128 },
-            { 238,   1, 255, 128, 128, 128, 128, 128, 128, 128, 128 }
-        }
-    }
-};
--- a/vp8/encoder/defaultcoefcounts.h
+++ b/vp8/encoder/defaultcoefcounts.h
@@ -8,12 +8,10 @@
 *  be found in the AUTHORS file in the root of the source tree.
 */

+
 /* Generated file, included by entropy.c */

-static const unsigned int default_coef_counts[BLOCK_TYPES]
-                                             [COEF_BANDS]
-                                             [PREV_COEF_CONTEXTS]
-                                             [MAX_ENTROPY_TOKENS] =
+static const unsigned int default_coef_counts [BLOCK_TYPES] [COEF_BANDS] [PREV_COEF_CONTEXTS] [vp8_coef_tokens] =
 {

    {
--- a/vp8/common/entropy.c
+++ b/vp8/common/entropy.c
@@ -15,7 +15,6 @@
 #include "string.h"
 #include "blockd.h"
 #include "onyxc_int.h"
-#include "vpx_mem/vpx_mem.h"

 #define uchar unsigned char     /* typedefs can clash */
 #define uint  unsigned int
@@ -27,32 +26,8 @@ typedef vp8_prob Prob;

 #include "coefupdateprobs.h"

-DECLARE_ALIGNED(16, const unsigned char, vp8_norm[256]) =
-{
-    0, 7, 6, 6, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,
-    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
-    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
-    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
-    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
-    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
-    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
-    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
-    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
-    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
-    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
-    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
-    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
-    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
-    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
-    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
-};
-
-DECLARE_ALIGNED(16, cuchar, vp8_coef_bands[16]) =
-{ 0, 1, 2, 3, 6, 4, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7};
-
-DECLARE_ALIGNED(16, cuchar, vp8_prev_token_class[MAX_ENTROPY_TOKENS]) =
-{ 0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0};
-
+DECLARE_ALIGNED(16, cuchar, vp8_coef_bands[16]) = { 0, 1, 2, 3, 6, 4, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7};
+DECLARE_ALIGNED(16, cuchar, vp8_prev_token_class[MAX_ENTROPY_TOKENS]) = { 0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0};
 DECLARE_ALIGNED(16, const int, vp8_default_zig_zag1d[16]) =
 {
    0,  1,  4,  8,
@@ -90,7 +65,7 @@ const vp8_tree_index vp8_coef_tree[ 22] =     /* corresponding _CONTEXT_NODEs */
    -DCT_VAL_CATEGORY5, -DCT_VAL_CATEGORY6   /* 10 = CAT_FIVE */
 };

-struct vp8_token_struct vp8_coef_encodings[MAX_ENTROPY_TOKENS];
+struct vp8_token_struct vp8_coef_encodings[vp8_coef_tokens];

 /* Trees for extra bits.  Probabilities are constant and
   do not depend on previously encoded bits */
@@ -154,15 +129,37 @@ vp8_extra_bit_struct vp8_extra_bits[12] =
    { cat6, Pcat6, 11, 67},
    { 0, 0, 0, 0}
 };
-
-#include "default_coef_probs.h"
+#include "defaultcoefcounts.h"

 void vp8_default_coef_probs(VP8_COMMON *pc)
 {
-    vpx_memcpy(pc->fc.coef_probs, default_coef_probs,
-                   sizeof(default_coef_probs));
+    int h = 0;
+
+    do
+    {
+        int i = 0;
+
+        do
+        {
+            int k = 0;
+
+            do
+            {
+                unsigned int branch_ct [vp8_coef_tokens-1] [2];
+                vp8_tree_probs_from_distribution(
+                    vp8_coef_tokens, vp8_coef_encodings, vp8_coef_tree,
+                    pc->fc.coef_probs [h][i][k], branch_ct, default_coef_counts [h][i][k],
+                    256, 1);
+
+            }
+            while (++k < PREV_COEF_CONTEXTS);
+        }
+        while (++i < COEF_BANDS);
+    }
+    while (++h < BLOCK_TYPES);
 }

+
 void vp8_coef_tree_initialize()
 {
    init_bit_trees();
--- a/vp8/common/entropy.h
+++ b/vp8/common/entropy.h
@@ -30,12 +30,13 @@
 #define DCT_VAL_CATEGORY6       10      /* 67+       Extra Bits 11+1 */
 #define DCT_EOB_TOKEN           11      /* EOB       Extra Bits 0+0 */

-#define MAX_ENTROPY_TOKENS 12
+#define vp8_coef_tokens 12
+#define MAX_ENTROPY_TOKENS vp8_coef_tokens
 #define ENTROPY_NODES 11

 extern const vp8_tree_index vp8_coef_tree[];

-extern struct vp8_token_struct vp8_coef_encodings[MAX_ENTROPY_TOKENS];
+extern struct vp8_token_struct vp8_coef_encodings[vp8_coef_tokens];

 typedef struct
 {
@@ -84,9 +85,9 @@ extern DECLARE_ALIGNED(16, const unsigned char, vp8_coef_bands[16]);
 /*# define DC_TOKEN_CONTEXTS        3*/ /* 00, 0!0, !0!0 */
 #   define PREV_COEF_CONTEXTS       3

-extern DECLARE_ALIGNED(16, const unsigned char, vp8_prev_token_class[MAX_ENTROPY_TOKENS]);
+extern DECLARE_ALIGNED(16, const unsigned char, vp8_prev_token_class[vp8_coef_tokens]);

-extern const vp8_prob vp8_coef_update_probs [BLOCK_TYPES] [COEF_BANDS] [PREV_COEF_CONTEXTS] [ENTROPY_NODES];
+extern const vp8_prob vp8_coef_update_probs [BLOCK_TYPES] [COEF_BANDS] [PREV_COEF_CONTEXTS] [vp8_coef_tokens-1];


 struct VP8Common;
--- a/vp8/common/entropymode.c
+++ b/vp8/common/entropymode.c
@@ -33,11 +33,11 @@ typedef enum
    SUBMVREF_LEFT_ABOVE_ZED
 } sumvfref_t;

-int vp8_mv_cont(const int_mv *l, const int_mv *a)
+int vp8_mv_cont(const MV *l, const MV *a)
 {
-    int lez = (l->as_int == 0);
-    int aez = (a->as_int == 0);
-    int lea = (l->as_int == a->as_int);
+    int lez = (l->row == 0 && l->col == 0);
+    int aez = (a->row == 0 && a->col == 0);
+    int lea = (l->row == a->row && l->col == a->col);

    if (lea && lez)
        return SUBMVREF_LEFT_ABOVE_ZED;
--- a/vp8/common/entropymode.h
+++ b/vp8/common/entropymode.h
@@ -25,7 +25,7 @@ extern const int vp8_mbsplit_count [VP8_NUMMBSPLITS];    /* # of subsets */

 extern const vp8_prob vp8_mbsplit_probs [VP8_NUMMBSPLITS-1];

-extern int vp8_mv_cont(const int_mv *l, const int_mv *a);
+extern int vp8_mv_cont(const MV *l, const MV *a);
 #define SUBMVREF_COUNT 5
 extern const vp8_prob vp8_sub_mv_ref_prob2 [SUBMVREF_COUNT][VP8_SUBMVREFS-1];

--- a/vp8/common/entropymv.h
+++ b/vp8/common/entropymv.h
@@ -18,8 +18,6 @@ enum
 {
    mv_max  = 1023,              /* max absolute value of a MV component */
    MVvals = (2 * mv_max) + 1,   /* # possible values "" */
-    mvfp_max  = 255,              /* max absolute value of a full pixel MV component */
-    MVfpvals = (2 * mvfp_max) +1, /* # possible full pixel MV values */

    mvlong_width = 10,       /* Large MVs have 9 bit magnitudes */
    mvnum_short = 8,         /* magnitudes 0 through 7 */
--- a/vp8/common/extend.c
+++ b/vp8/common/extend.c
@@ -13,12 +13,10 @@
 #include "vpx_mem/vpx_mem.h"


-static void copy_and_extend_plane
+static void extend_plane_borders
 (
    unsigned char *s, /* source */
-    int sp,           /* source pitch */
-    unsigned char *d, /* destination */
-    int dp,           /* destination pitch */
+    int sp,           /* pitch */
    int h,            /* height */
    int w,            /* width */
    int et,           /* extend top border */
@@ -27,6 +25,7 @@ static void copy_and_extend_plane
    int er            /* extend right border */
 )
 {
+
    int i;
    unsigned char *src_ptr1, *src_ptr2;
    unsigned char *dest_ptr1, *dest_ptr2;
@@ -35,127 +34,68 @@ static void copy_and_extend_plane
    /* copy the left and right most columns out */
    src_ptr1 = s;
    src_ptr2 = s + w - 1;
-    dest_ptr1 = d - el;
-    dest_ptr2 = d + w;
+    dest_ptr1 = s - el;
+    dest_ptr2 = s + w;

-    for (i = 0; i < h; i++)
+    for (i = 0; i < h - 0 + 1; i++)
    {
-        vpx_memset(dest_ptr1, src_ptr1[0], el);
-        vpx_memcpy(dest_ptr1 + el, src_ptr1, w);
+        /* Some linkers will complain if we call vpx_memset with el set to a
+         * constant 0.
+         */
+        if (el)
+            vpx_memset(dest_ptr1, src_ptr1[0], el);
        vpx_memset(dest_ptr2, src_ptr2[0], er);
        src_ptr1  += sp;
        src_ptr2  += sp;
-        dest_ptr1 += dp;
-        dest_ptr2 += dp;
+        dest_ptr1 += sp;
+        dest_ptr2 += sp;
    }

-    /* Now copy the top and bottom lines into each line of the respective
-     * borders
-     */
-    src_ptr1 = d - el;
-    src_ptr2 = d + dp * (h - 1) - el;
-    dest_ptr1 = d + dp * (-et) - el;
-    dest_ptr2 = d + dp * (h) - el;
-    linesize = el + er + w;
+    /* Now copy the top and bottom source lines into each line of the respective borders */
+    src_ptr1 = s - el;
+    src_ptr2 = s + sp * (h - 1) - el;
+    dest_ptr1 = s + sp * (-et) - el;
+    dest_ptr2 = s + sp * (h) - el;
+    linesize = el + er + w + 1;

-    for (i = 0; i < et; i++)
+    for (i = 0; i < (int)et; i++)
    {
        vpx_memcpy(dest_ptr1, src_ptr1, linesize);
-        dest_ptr1 += dp;
+        dest_ptr1 += sp;
    }

-    for (i = 0; i < eb; i++)
+    for (i = 0; i < (int)eb; i++)
    {
        vpx_memcpy(dest_ptr2, src_ptr2, linesize);
-        dest_ptr2 += dp;
+        dest_ptr2 += sp;
    }
 }


-void vp8_copy_and_extend_frame(YV12_BUFFER_CONFIG *src,
-                               YV12_BUFFER_CONFIG *dst)
+void vp8_extend_to_multiple_of16(YV12_BUFFER_CONFIG *ybf, int width, int height)
 {
-    int et = dst->border;
-    int el = dst->border;
-    int eb = dst->border + dst->y_height - src->y_height;
-    int er = dst->border + dst->y_width - src->y_width;
+    int er = 0xf & (16 - (width & 0xf));
+    int eb = 0xf & (16 - (height & 0xf));

-    copy_and_extend_plane(src->y_buffer, src->y_stride,
-                          dst->y_buffer, dst->y_stride,
-                          src->y_height, src->y_width,
-                          et, el, eb, er);
+    /* check for non multiples of 16 */
+    if (er != 0 || eb != 0)
+    {
+        extend_plane_borders(ybf->y_buffer, ybf->y_stride, height, width, 0, 0, eb, er);

-    et = dst->border >> 1;
-    el = dst->border >> 1;
-    eb = (dst->border >> 1) + dst->uv_height - src->uv_height;
-    er = (dst->border >> 1) + dst->uv_width - src->uv_width;
+        /* adjust for uv */
+        height = (height + 1) >> 1;
+        width  = (width  + 1) >> 1;
+        er = 0x7 & (8 - (width  & 0x7));
+        eb = 0x7 & (8 - (height & 0x7));

-    copy_and_extend_plane(src->u_buffer, src->uv_stride,
-                          dst->u_buffer, dst->uv_stride,
-                          src->uv_height, src->uv_width,
-                          et, el, eb, er);
-
-    copy_and_extend_plane(src->v_buffer, src->uv_stride,
-                          dst->v_buffer, dst->uv_stride,
-                          src->uv_height, src->uv_width,
-                          et, el, eb, er);
+        if (er || eb)
+        {
+            extend_plane_borders(ybf->u_buffer, ybf->uv_stride, height, width, 0, 0, eb, er);
+            extend_plane_borders(ybf->v_buffer, ybf->uv_stride, height, width, 0, 0, eb, er);
+        }
+    }
 }

-
-void vp8_copy_and_extend_frame_with_rect(YV12_BUFFER_CONFIG *src,
-                                         YV12_BUFFER_CONFIG *dst,
-                                         int srcy, int srcx,
-                                         int srch, int srcw)
-{
-    int et = dst->border;
-    int el = dst->border;
-    int eb = dst->border + dst->y_height - src->y_height;
-    int er = dst->border + dst->y_width - src->y_width;
-    int src_y_offset = srcy * src->y_stride + srcx;
-    int dst_y_offset = srcy * dst->y_stride + srcx;
-    int src_uv_offset = ((srcy * src->uv_stride) >> 1) + (srcx >> 1);
-    int dst_uv_offset = ((srcy * dst->uv_stride) >> 1) + (srcx >> 1);
-
-    // If the side is not touching the bounder then don't extend.
-    if (srcy)
-      et = 0;
-    if (srcx)
-      el = 0;
-    if (srcy + srch != src->y_height)
-      eb = 0;
-    if (srcx + srcw != src->y_width)
-      er = 0;
-
-    copy_and_extend_plane(src->y_buffer + src_y_offset,
-                          src->y_stride,
-                          dst->y_buffer + dst_y_offset,
-                          dst->y_stride,
-                          srch, srcw,
-                          et, el, eb, er);
-
-    et = (et + 1) >> 1;
-    el = (el + 1) >> 1;
-    eb = (eb + 1) >> 1;
-    er = (er + 1) >> 1;
-    srch = (srch + 1) >> 1;
-    srcw = (srcw + 1) >> 1;
-
-    copy_and_extend_plane(src->u_buffer + src_uv_offset,
-                          src->uv_stride,
-                          dst->u_buffer + dst_uv_offset,
-                          dst->uv_stride,
-                          srch, srcw,
-                          et, el, eb, er);
-
-    copy_and_extend_plane(src->v_buffer + src_uv_offset,
-                          src->uv_stride,
-                          dst->v_buffer + dst_uv_offset,
-                          dst->uv_stride,
-                          srch, srcw,
-                          et, el, eb, er);
-}
-
-
 /* note the extension is only for the last row, for intra prediction purpose */
 void vp8_extend_mb_row(YV12_BUFFER_CONFIG *ybf, unsigned char *YPtr, unsigned char *UPtr, unsigned char *VPtr)
 {
--- a/vp8/common/extend.h
+++ b/vp8/common/extend.h
@@ -14,12 +14,8 @@

 #include "vpx_scale/yv12config.h"

+void Extend(YV12_BUFFER_CONFIG *ybf);
 void vp8_extend_mb_row(YV12_BUFFER_CONFIG *ybf, unsigned char *YPtr, unsigned char *UPtr, unsigned char *VPtr);
-void vp8_copy_and_extend_frame(YV12_BUFFER_CONFIG *src,
-                               YV12_BUFFER_CONFIG *dst);
-void vp8_copy_and_extend_frame_with_rect(YV12_BUFFER_CONFIG *src,
-                                         YV12_BUFFER_CONFIG *dst,
-                                         int srcy, int srcx,
-                                         int srch, int srcw);
+void vp8_extend_to_multiple_of16(YV12_BUFFER_CONFIG *ybf, int width, int height);

 #endif
--- a/vp8/common/findnearmv.c
+++ b/vp8/common/findnearmv.c
@@ -25,9 +25,9 @@ void vp8_find_near_mvs
 (
    MACROBLOCKD *xd,
    const MODE_INFO *here,
-    int_mv *nearest,
-    int_mv *nearby,
-    int_mv *best_mv,
+    MV *nearest,
+    MV *nearby,
+    MV *best_mv,
    int cnt[4],
    int refframe,
    int *ref_frame_sign_bias
@@ -131,14 +131,13 @@ void vp8_find_near_mvs
        near_mvs[CNT_INTRA] = near_mvs[CNT_NEAREST];

    /* Set up return values */
-    best_mv->as_int = near_mvs[0].as_int;
-    nearest->as_int = near_mvs[CNT_NEAREST].as_int;
-    nearby->as_int = near_mvs[CNT_NEAR].as_int;
+    *best_mv = near_mvs[0].as_mv;
+    *nearest = near_mvs[CNT_NEAREST].as_mv;
+    *nearby = near_mvs[CNT_NEAR].as_mv;

-    //TODO: move clamp outside findnearmv
-    vp8_clamp_mv2(nearest, xd);
-    vp8_clamp_mv2(nearby, xd);
-    vp8_clamp_mv2(best_mv, xd);
+    vp8_clamp_mv(nearest, xd);
+    vp8_clamp_mv(nearby, xd);
+    vp8_clamp_mv(best_mv, xd); /*TODO: move this up before the copy*/
 }

 vp8_prob *vp8_mv_ref_probs(
@@ -153,3 +152,26 @@ vp8_prob *vp8_mv_ref_probs(
    return p;
 }

+const B_MODE_INFO *vp8_left_bmi(const MODE_INFO *cur_mb, int b)
+{
+    if (!(b & 3))
+    {
+        /* On L edge, get from MB to left of us */
+        --cur_mb;
+        b += 4;
+    }
+
+    return cur_mb->bmi + b - 1;
+}
+
+const B_MODE_INFO *vp8_above_bmi(const MODE_INFO *cur_mb, int b, int mi_stride)
+{
+    if (!(b >> 2))
+    {
+        /* On top edge, get from MB above us */
+        cur_mb -= mi_stride;
+        b += 16;
+    }
+
+    return cur_mb->bmi + b - 4;
+}
--- a/vp8/common/findnearmv.h
+++ b/vp8/common/findnearmv.h
@@ -17,6 +17,11 @@
 #include "modecont.h"
 #include "treecoder.h"

+typedef union
+{
+    unsigned int as_int;
+    MV           as_mv;
+} int_mv;        /* facilitates rapid equality tests */

 static void mv_bias(int refmb_ref_frame_sign_bias, int refframe, int_mv *mvp, const int *ref_frame_sign_bias)
 {
@@ -34,48 +39,24 @@ static void mv_bias(int refmb_ref_frame_sign_bias, int refframe, int_mv *mvp, co

 #define LEFT_TOP_MARGIN (16 << 3)
 #define RIGHT_BOTTOM_MARGIN (16 << 3)
-static void vp8_clamp_mv2(int_mv *mv, const MACROBLOCKD *xd)
+static void vp8_clamp_mv(MV *mv, const MACROBLOCKD *xd)
 {
-    if (mv->as_mv.col < (xd->mb_to_left_edge - LEFT_TOP_MARGIN))
-        mv->as_mv.col = xd->mb_to_left_edge - LEFT_TOP_MARGIN;
-    else if (mv->as_mv.col > xd->mb_to_right_edge + RIGHT_BOTTOM_MARGIN)
-        mv->as_mv.col = xd->mb_to_right_edge + RIGHT_BOTTOM_MARGIN;
+    if (mv->col < (xd->mb_to_left_edge - LEFT_TOP_MARGIN))
+        mv->col = xd->mb_to_left_edge - LEFT_TOP_MARGIN;
+    else if (mv->col > xd->mb_to_right_edge + RIGHT_BOTTOM_MARGIN)
+        mv->col = xd->mb_to_right_edge + RIGHT_BOTTOM_MARGIN;

-    if (mv->as_mv.row < (xd->mb_to_top_edge - LEFT_TOP_MARGIN))
-        mv->as_mv.row = xd->mb_to_top_edge - LEFT_TOP_MARGIN;
-    else if (mv->as_mv.row > xd->mb_to_bottom_edge + RIGHT_BOTTOM_MARGIN)
-        mv->as_mv.row = xd->mb_to_bottom_edge + RIGHT_BOTTOM_MARGIN;
-}
-
-static void vp8_clamp_mv(int_mv *mv, int mb_to_left_edge, int mb_to_right_edge,
-                         int mb_to_top_edge, int mb_to_bottom_edge)
-{
-    mv->as_mv.col = (mv->as_mv.col < mb_to_left_edge) ?
-        mb_to_left_edge : mv->as_mv.col;
-    mv->as_mv.col = (mv->as_mv.col > mb_to_right_edge) ?
-        mb_to_right_edge : mv->as_mv.col;
-    mv->as_mv.row = (mv->as_mv.row < mb_to_top_edge) ?
-        mb_to_top_edge : mv->as_mv.row;
-    mv->as_mv.row = (mv->as_mv.row > mb_to_bottom_edge) ?
-        mb_to_bottom_edge : mv->as_mv.row;
-}
-static unsigned int vp8_check_mv_bounds(int_mv *mv, int mb_to_left_edge,
-                                int mb_to_right_edge, int mb_to_top_edge,
-                                int mb_to_bottom_edge)
-{
-    unsigned int need_to_clamp;
-    need_to_clamp = (mv->as_mv.col < mb_to_left_edge) ? 1 : 0;
-    need_to_clamp |= (mv->as_mv.col > mb_to_right_edge) ? 1 : 0;
-    need_to_clamp |= (mv->as_mv.row < mb_to_top_edge) ? 1 : 0;
-    need_to_clamp |= (mv->as_mv.row > mb_to_bottom_edge) ? 1 : 0;
-    return need_to_clamp;
+    if (mv->row < (xd->mb_to_top_edge - LEFT_TOP_MARGIN))
+        mv->row = xd->mb_to_top_edge - LEFT_TOP_MARGIN;
+    else if (mv->row > xd->mb_to_bottom_edge + RIGHT_BOTTOM_MARGIN)
+        mv->row = xd->mb_to_bottom_edge + RIGHT_BOTTOM_MARGIN;
 }

 void vp8_find_near_mvs
 (
    MACROBLOCKD *xd,
    const MODE_INFO *here,
-    int_mv *nearest, int_mv *nearby, int_mv *best,
+    MV *nearest, MV *nearby, MV *best,
    int near_mv_ref_cts[4],
    int refframe,
    int *ref_frame_sign_bias
@@ -85,89 +66,10 @@ vp8_prob *vp8_mv_ref_probs(
    vp8_prob p[VP8_MVREFS-1], const int near_mv_ref_ct[4]
 );

+const B_MODE_INFO *vp8_left_bmi(const MODE_INFO *cur_mb, int b);
+
+const B_MODE_INFO *vp8_above_bmi(const MODE_INFO *cur_mb, int b, int mi_stride);
+
 extern const unsigned char vp8_mbsplit_offset[4][16];

-
-static int left_block_mv(const MODE_INFO *cur_mb, int b)
-{
-    if (!(b & 3))
-    {
-        /* On L edge, get from MB to left of us */
-        --cur_mb;
-
-        if(cur_mb->mbmi.mode != SPLITMV)
-            return cur_mb->mbmi.mv.as_int;
-        b += 4;
-    }
-
-    return (cur_mb->bmi + b - 1)->mv.as_int;
-}
-
-static int above_block_mv(const MODE_INFO *cur_mb, int b, int mi_stride)
-{
-    if (!(b >> 2))
-    {
-        /* On top edge, get from MB above us */
-        cur_mb -= mi_stride;
-
-        if(cur_mb->mbmi.mode != SPLITMV)
-            return cur_mb->mbmi.mv.as_int;
-        b += 16;
-    }
-
-    return (cur_mb->bmi + b - 4)->mv.as_int;
-}
-static B_PREDICTION_MODE left_block_mode(const MODE_INFO *cur_mb, int b)
-{
-    if (!(b & 3))
-    {
-        /* On L edge, get from MB to left of us */
-        --cur_mb;
-        switch (cur_mb->mbmi.mode)
-        {
-            case B_PRED:
-              return (cur_mb->bmi + b + 3)->as_mode;
-            case DC_PRED:
-                return B_DC_PRED;
-            case V_PRED:
-                return B_VE_PRED;
-            case H_PRED:
-                return B_HE_PRED;
-            case TM_PRED:
-                return B_TM_PRED;
-            default:
-                return B_DC_PRED;
-        }
-    }
-
-    return (cur_mb->bmi + b - 1)->as_mode;
-}
-
-static B_PREDICTION_MODE above_block_mode(const MODE_INFO *cur_mb, int b, int mi_stride)
-{
-    if (!(b >> 2))
-    {
-        /* On top edge, get from MB above us */
-        cur_mb -= mi_stride;
-
-        switch (cur_mb->mbmi.mode)
-        {
-            case B_PRED:
-              return (cur_mb->bmi + b + 12)->as_mode;
-            case DC_PRED:
-                return B_DC_PRED;
-            case V_PRED:
-                return B_VE_PRED;
-            case H_PRED:
-                return B_HE_PRED;
-            case TM_PRED:
-                return B_TM_PRED;
-            default:
-                return B_DC_PRED;
-        }
-    }
-
-    return (cur_mb->bmi + b - 4)->as_mode;
-}
-
 #endif
--- a/vp8/common/generic/systemdependent.c
+++ b/vp8/common/generic/systemdependent.c
@@ -9,7 +9,7 @@
 */


-#include "vpx_config.h"
+#include "vpx_ports/config.h"
 #include "vp8/common/g_common.h"
 #include "vp8/common/subpixel.h"
 #include "vp8/common/loopfilter.h"
@@ -17,54 +17,9 @@
 #include "vp8/common/idct.h"
 #include "vp8/common/onyxc_int.h"

-#if CONFIG_MULTITHREAD
-#if HAVE_UNISTD_H
-#include <unistd.h>
-#elif defined(_WIN32)
-#include <windows.h>
-typedef void (WINAPI *PGNSI)(LPSYSTEM_INFO);
-#endif
-#endif
-
 extern void vp8_arch_x86_common_init(VP8_COMMON *ctx);
 extern void vp8_arch_arm_common_init(VP8_COMMON *ctx);

-#if CONFIG_MULTITHREAD
-static int get_cpu_count()
-{
-    int core_count = 16;
-
-#if HAVE_UNISTD_H
-#if defined(_SC_NPROCESSORS_ONLN)
-    core_count = sysconf(_SC_NPROCESSORS_ONLN);
-#elif defined(_SC_NPROC_ONLN)
-    core_count = sysconf(_SC_NPROC_ONLN);
-#endif
-#elif defined(_WIN32)
-    {
-        PGNSI pGNSI;
-        SYSTEM_INFO sysinfo;
-
-        /* Call GetNativeSystemInfo if supported or
-         * GetSystemInfo otherwise. */
-
-        pGNSI = (PGNSI) GetProcAddress(
-                GetModuleHandle(TEXT("kernel32.dll")), "GetNativeSystemInfo");
-        if (pGNSI != NULL)
-            pGNSI(&sysinfo);
-        else
-            GetSystemInfo(&sysinfo);
-
-        core_count = sysinfo.dwNumberOfProcessors;
-    }
-#else
-    /* other platforms */
-#endif
-
-    return core_count > 0 ? core_count : 1;
-}
-#endif
-
 void vp8_machine_specific_config(VP8_COMMON *ctx)
 {
 #if CONFIG_RUNTIME_CPU_DETECT
@@ -88,12 +43,6 @@ void vp8_machine_specific_config(VP8_COMMON *ctx)
        vp8_build_intra_predictors_mby;
    rtcd->recon.build_intra_predictors_mby_s =
        vp8_build_intra_predictors_mby_s;
-    rtcd->recon.build_intra_predictors_mbuv =
-        vp8_build_intra_predictors_mbuv;
-    rtcd->recon.build_intra_predictors_mbuv_s =
-        vp8_build_intra_predictors_mbuv_s;
-    rtcd->recon.intra4x4_predict =
-        vp8_intra4x4_predict;

    rtcd->subpix.sixtap16x16   = vp8_sixtap_predict16x16_c;
    rtcd->subpix.sixtap8x8     = vp8_sixtap_predict8x8_c;
@@ -108,12 +57,12 @@ void vp8_machine_specific_config(VP8_COMMON *ctx)
    rtcd->loopfilter.normal_b_v  = vp8_loop_filter_bv_c;
    rtcd->loopfilter.normal_mb_h = vp8_loop_filter_mbh_c;
    rtcd->loopfilter.normal_b_h  = vp8_loop_filter_bh_c;
-    rtcd->loopfilter.simple_mb_v = vp8_loop_filter_simple_vertical_edge_c;
+    rtcd->loopfilter.simple_mb_v = vp8_loop_filter_mbvs_c;
    rtcd->loopfilter.simple_b_v  = vp8_loop_filter_bvs_c;
-    rtcd->loopfilter.simple_mb_h = vp8_loop_filter_simple_horizontal_edge_c;
+    rtcd->loopfilter.simple_mb_h = vp8_loop_filter_mbhs_c;
    rtcd->loopfilter.simple_b_h  = vp8_loop_filter_bhs_c;

-#if CONFIG_POSTPROC || (CONFIG_VP8_ENCODER && CONFIG_INTERNAL_STATS)
+#if CONFIG_POSTPROC || (CONFIG_VP8_ENCODER && CONFIG_PSNR)
    rtcd->postproc.down             = vp8_mbpost_proc_down_c;
    rtcd->postproc.across           = vp8_mbpost_proc_across_ip_c;
    rtcd->postproc.downacross       = vp8_post_proc_down_and_across_c;
@@ -133,7 +82,4 @@ void vp8_machine_specific_config(VP8_COMMON *ctx)
    vp8_arch_arm_common_init(ctx);
 #endif

-#if CONFIG_MULTITHREAD
-    ctx->processor_core_count = get_cpu_count();
-#endif /* CONFIG_MULTITHREAD */
 }
--- a/vp8/common/invtrans.h
+++ b/vp8/common/invtrans.h
@@ -12,7 +12,7 @@
 #ifndef __INC_INVTRANS_H
 #define __INC_INVTRANS_H

-#include "vpx_config.h"
+#include "vpx_ports/config.h"
 #include "idct.h"
 #include "blockd.h"
 extern void vp8_inverse_transform_b(const vp8_idct_rtcd_vtable_t *rtcd, BLOCKD *b, int pitch);
--- a/vp8/common/loopfilter.c
+++ b/vp8/common/loopfilter.c
@@ -9,149 +9,160 @@
 */


-#include "vpx_config.h"
+#include "vpx_ports/config.h"
 #include "loopfilter.h"
 #include "onyxc_int.h"
-#include "vpx_mem/vpx_mem.h"

 typedef unsigned char uc;

+
 prototype_loopfilter(vp8_loop_filter_horizontal_edge_c);
 prototype_loopfilter(vp8_loop_filter_vertical_edge_c);
 prototype_loopfilter(vp8_mbloop_filter_horizontal_edge_c);
 prototype_loopfilter(vp8_mbloop_filter_vertical_edge_c);
-
-prototype_simple_loopfilter(vp8_loop_filter_simple_horizontal_edge_c);
-prototype_simple_loopfilter(vp8_loop_filter_simple_vertical_edge_c);
+prototype_loopfilter(vp8_loop_filter_simple_horizontal_edge_c);
+prototype_loopfilter(vp8_loop_filter_simple_vertical_edge_c);

 /* Horizontal MB filtering */
-void vp8_loop_filter_mbh_c(unsigned char *y_ptr, unsigned char *u_ptr,
-                           unsigned char *v_ptr, int y_stride, int uv_stride,
-                           loop_filter_info *lfi)
+void vp8_loop_filter_mbh_c(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
+                           int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    vp8_mbloop_filter_horizontal_edge_c(y_ptr, y_stride, lfi->mblim, lfi->lim, lfi->hev_thr, 2);
+    (void) simpler_lpf;
+    vp8_mbloop_filter_horizontal_edge_c(y_ptr, y_stride, lfi->mbflim, lfi->lim, lfi->thr, 2);

    if (u_ptr)
-        vp8_mbloop_filter_horizontal_edge_c(u_ptr, uv_stride, lfi->mblim, lfi->lim, lfi->hev_thr, 1);
+        vp8_mbloop_filter_horizontal_edge_c(u_ptr, uv_stride, lfi->mbflim, lfi->lim, lfi->thr, 1);

    if (v_ptr)
-        vp8_mbloop_filter_horizontal_edge_c(v_ptr, uv_stride, lfi->mblim, lfi->lim, lfi->hev_thr, 1);
+        vp8_mbloop_filter_horizontal_edge_c(v_ptr, uv_stride, lfi->mbflim, lfi->lim, lfi->thr, 1);
+}
+
+void vp8_loop_filter_mbhs_c(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
+                            int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
+{
+    (void) u_ptr;
+    (void) v_ptr;
+    (void) uv_stride;
+    (void) simpler_lpf;
+    vp8_loop_filter_simple_horizontal_edge_c(y_ptr, y_stride, lfi->mbflim, lfi->lim, lfi->thr, 2);
 }

 /* Vertical MB Filtering */
-void vp8_loop_filter_mbv_c(unsigned char *y_ptr, unsigned char *u_ptr,
-                           unsigned char *v_ptr, int y_stride, int uv_stride,
-                           loop_filter_info *lfi)
+void vp8_loop_filter_mbv_c(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
+                           int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    vp8_mbloop_filter_vertical_edge_c(y_ptr, y_stride, lfi->mblim, lfi->lim, lfi->hev_thr, 2);
+    (void) simpler_lpf;
+    vp8_mbloop_filter_vertical_edge_c(y_ptr, y_stride, lfi->mbflim, lfi->lim, lfi->thr, 2);

    if (u_ptr)
-        vp8_mbloop_filter_vertical_edge_c(u_ptr, uv_stride, lfi->mblim, lfi->lim, lfi->hev_thr, 1);
+        vp8_mbloop_filter_vertical_edge_c(u_ptr, uv_stride, lfi->mbflim, lfi->lim, lfi->thr, 1);

    if (v_ptr)
-        vp8_mbloop_filter_vertical_edge_c(v_ptr, uv_stride, lfi->mblim, lfi->lim, lfi->hev_thr, 1);
+        vp8_mbloop_filter_vertical_edge_c(v_ptr, uv_stride, lfi->mbflim, lfi->lim, lfi->thr, 1);
+}
+
+void vp8_loop_filter_mbvs_c(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
+                            int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
+{
+    (void) u_ptr;
+    (void) v_ptr;
+    (void) uv_stride;
+    (void) simpler_lpf;
+    vp8_loop_filter_simple_vertical_edge_c(y_ptr, y_stride, lfi->mbflim, lfi->lim, lfi->thr, 2);
 }

 /* Horizontal B Filtering */
-void vp8_loop_filter_bh_c(unsigned char *y_ptr, unsigned char *u_ptr,
-                          unsigned char *v_ptr, int y_stride, int uv_stride,
-                          loop_filter_info *lfi)
+void vp8_loop_filter_bh_c(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
+                          int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    vp8_loop_filter_horizontal_edge_c(y_ptr + 4 * y_stride, y_stride, lfi->blim, lfi->lim, lfi->hev_thr, 2);
-    vp8_loop_filter_horizontal_edge_c(y_ptr + 8 * y_stride, y_stride, lfi->blim, lfi->lim, lfi->hev_thr, 2);
-    vp8_loop_filter_horizontal_edge_c(y_ptr + 12 * y_stride, y_stride, lfi->blim, lfi->lim, lfi->hev_thr, 2);
+    (void) simpler_lpf;
+    vp8_loop_filter_horizontal_edge_c(y_ptr + 4 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_horizontal_edge_c(y_ptr + 8 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_horizontal_edge_c(y_ptr + 12 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);

    if (u_ptr)
-        vp8_loop_filter_horizontal_edge_c(u_ptr + 4 * uv_stride, uv_stride, lfi->blim, lfi->lim, lfi->hev_thr, 1);
+        vp8_loop_filter_horizontal_edge_c(u_ptr + 4 * uv_stride, uv_stride, lfi->flim, lfi->lim, lfi->thr, 1);

    if (v_ptr)
-        vp8_loop_filter_horizontal_edge_c(v_ptr + 4 * uv_stride, uv_stride, lfi->blim, lfi->lim, lfi->hev_thr, 1);
+        vp8_loop_filter_horizontal_edge_c(v_ptr + 4 * uv_stride, uv_stride, lfi->flim, lfi->lim, lfi->thr, 1);
 }

-void vp8_loop_filter_bhs_c(unsigned char *y_ptr, int y_stride,
-                           const unsigned char *blimit)
+void vp8_loop_filter_bhs_c(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
+                           int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    vp8_loop_filter_simple_horizontal_edge_c(y_ptr + 4 * y_stride, y_stride, blimit);
-    vp8_loop_filter_simple_horizontal_edge_c(y_ptr + 8 * y_stride, y_stride, blimit);
-    vp8_loop_filter_simple_horizontal_edge_c(y_ptr + 12 * y_stride, y_stride, blimit);
+    (void) u_ptr;
+    (void) v_ptr;
+    (void) uv_stride;
+    (void) simpler_lpf;
+    vp8_loop_filter_simple_horizontal_edge_c(y_ptr + 4 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_simple_horizontal_edge_c(y_ptr + 8 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_simple_horizontal_edge_c(y_ptr + 12 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
 }

 /* Vertical B Filtering */
-void vp8_loop_filter_bv_c(unsigned char *y_ptr, unsigned char *u_ptr,
-                          unsigned char *v_ptr, int y_stride, int uv_stride,
-                          loop_filter_info *lfi)
+void vp8_loop_filter_bv_c(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
+                          int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    vp8_loop_filter_vertical_edge_c(y_ptr + 4, y_stride, lfi->blim, lfi->lim, lfi->hev_thr, 2);
-    vp8_loop_filter_vertical_edge_c(y_ptr + 8, y_stride, lfi->blim, lfi->lim, lfi->hev_thr, 2);
-    vp8_loop_filter_vertical_edge_c(y_ptr + 12, y_stride, lfi->blim, lfi->lim, lfi->hev_thr, 2);
+    (void) simpler_lpf;
+    vp8_loop_filter_vertical_edge_c(y_ptr + 4, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_vertical_edge_c(y_ptr + 8, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_vertical_edge_c(y_ptr + 12, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);

    if (u_ptr)
-        vp8_loop_filter_vertical_edge_c(u_ptr + 4, uv_stride, lfi->blim, lfi->lim, lfi->hev_thr, 1);
+        vp8_loop_filter_vertical_edge_c(u_ptr + 4, uv_stride, lfi->flim, lfi->lim, lfi->thr, 1);

    if (v_ptr)
-        vp8_loop_filter_vertical_edge_c(v_ptr + 4, uv_stride, lfi->blim, lfi->lim, lfi->hev_thr, 1);
+        vp8_loop_filter_vertical_edge_c(v_ptr + 4, uv_stride, lfi->flim, lfi->lim, lfi->thr, 1);
 }

-void vp8_loop_filter_bvs_c(unsigned char *y_ptr, int y_stride,
-                           const unsigned char *blimit)
+void vp8_loop_filter_bvs_c(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
+                           int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    vp8_loop_filter_simple_vertical_edge_c(y_ptr + 4, y_stride, blimit);
-    vp8_loop_filter_simple_vertical_edge_c(y_ptr + 8, y_stride, blimit);
-    vp8_loop_filter_simple_vertical_edge_c(y_ptr + 12, y_stride, blimit);
+    (void) u_ptr;
+    (void) v_ptr;
+    (void) uv_stride;
+    (void) simpler_lpf;
+    vp8_loop_filter_simple_vertical_edge_c(y_ptr + 4, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_simple_vertical_edge_c(y_ptr + 8, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_simple_vertical_edge_c(y_ptr + 12, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
 }

-static void lf_init_lut(loop_filter_info_n *lfi)
+void vp8_init_loop_filter(VP8_COMMON *cm)
 {
-    int filt_lvl;
+    loop_filter_info *lfi = cm->lf_info;
+    LOOPFILTERTYPE lft = cm->filter_type;
+    int sharpness_lvl = cm->sharpness_level;
+    int frame_type = cm->frame_type;
+    int i, j;

-    for (filt_lvl = 0; filt_lvl <= MAX_LOOP_FILTER; filt_lvl++)
-    {
-        if (filt_lvl >= 40)
-        {
-            lfi->hev_thr_lut[KEY_FRAME][filt_lvl] = 2;
-            lfi->hev_thr_lut[INTER_FRAME][filt_lvl] = 3;
-        }
-        else if (filt_lvl >= 20)
-        {
-            lfi->hev_thr_lut[KEY_FRAME][filt_lvl] = 1;
-            lfi->hev_thr_lut[INTER_FRAME][filt_lvl] = 2;
-        }
-        else if (filt_lvl >= 15)
-        {
-            lfi->hev_thr_lut[KEY_FRAME][filt_lvl] = 1;
-            lfi->hev_thr_lut[INTER_FRAME][filt_lvl] = 1;
-        }
-        else
-        {
-            lfi->hev_thr_lut[KEY_FRAME][filt_lvl] = 0;
-            lfi->hev_thr_lut[INTER_FRAME][filt_lvl] = 0;
-        }
-    }
+    int block_inside_limit = 0;
+    int HEVThresh;

-    lfi->mode_lf_lut[DC_PRED] = 1;
-    lfi->mode_lf_lut[V_PRED] = 1;
-    lfi->mode_lf_lut[H_PRED] = 1;
-    lfi->mode_lf_lut[TM_PRED] = 1;
-    lfi->mode_lf_lut[B_PRED]  = 0;
-
-    lfi->mode_lf_lut[ZEROMV]  = 1;
-    lfi->mode_lf_lut[NEARESTMV] = 2;
-    lfi->mode_lf_lut[NEARMV] = 2;
-    lfi->mode_lf_lut[NEWMV] = 2;
-    lfi->mode_lf_lut[SPLITMV] = 3;
-
-}
-
-void vp8_loop_filter_update_sharpness(loop_filter_info_n *lfi,
-                                      int sharpness_lvl)
-{
-    int i;
-
-    /* For each possible value for the loop filter fill out limits */
+    /* For each possible value for the loop filter fill out a "loop_filter_info" entry. */
    for (i = 0; i <= MAX_LOOP_FILTER; i++)
    {
        int filt_lvl = i;
-        int block_inside_limit = 0;
+
+        if (frame_type == KEY_FRAME)
+        {
+            if (filt_lvl >= 40)
+                HEVThresh = 2;
+            else if (filt_lvl >= 15)
+                HEVThresh = 1;
+            else
+                HEVThresh = 0;
+        }
+        else
+        {
+            if (filt_lvl >= 40)
+                HEVThresh = 3;
+            else if (filt_lvl >= 20)
+                HEVThresh = 2;
+            else if (filt_lvl >= 15)
+                HEVThresh = 1;
+            else
+                HEVThresh = 0;
+        }

        /* Set loop filter paramaeters that control sharpness. */
        block_inside_limit = filt_lvl >> (sharpness_lvl > 0);
@@ -166,143 +177,170 @@ void vp8_loop_filter_update_sharpness(loop_filter_info_n *lfi,
        if (block_inside_limit < 1)
            block_inside_limit = 1;

-        vpx_memset(lfi->lim[i], block_inside_limit, SIMD_WIDTH);
-        vpx_memset(lfi->blim[i], (2 * filt_lvl + block_inside_limit),
-                SIMD_WIDTH);
-        vpx_memset(lfi->mblim[i], (2 * (filt_lvl + 2) + block_inside_limit),
-                SIMD_WIDTH);
+        for (j = 0; j < 16; j++)
+        {
+            lfi[i].lim[j] = block_inside_limit;
+            lfi[i].mbflim[j] = filt_lvl + 2;
+            lfi[i].flim[j] = filt_lvl;
+            lfi[i].thr[j] = HEVThresh;
+        }
+
+    }
+
+    /* Set up the function pointers depending on the type of loop filtering selected */
+    if (lft == NORMAL_LOOPFILTER)
+    {
+        cm->lf_mbv = LF_INVOKE(&cm->rtcd.loopfilter, normal_mb_v);
+        cm->lf_bv  = LF_INVOKE(&cm->rtcd.loopfilter, normal_b_v);
+        cm->lf_mbh = LF_INVOKE(&cm->rtcd.loopfilter, normal_mb_h);
+        cm->lf_bh  = LF_INVOKE(&cm->rtcd.loopfilter, normal_b_h);
+    }
+    else
+    {
+        cm->lf_mbv = LF_INVOKE(&cm->rtcd.loopfilter, simple_mb_v);
+        cm->lf_bv  = LF_INVOKE(&cm->rtcd.loopfilter, simple_b_v);
+        cm->lf_mbh = LF_INVOKE(&cm->rtcd.loopfilter, simple_mb_h);
+        cm->lf_bh  = LF_INVOKE(&cm->rtcd.loopfilter, simple_b_h);
    }
 }

-void vp8_loop_filter_init(VP8_COMMON *cm)
+/* Put vp8_init_loop_filter() in vp8dx_create_decompressor(). Only call vp8_frame_init_loop_filter() while decoding
+ * each frame. Check last_frame_type to skip the function most of times.
+ */
+void vp8_frame_init_loop_filter(loop_filter_info *lfi, int frame_type)
 {
-    loop_filter_info_n *lfi = &cm->lf_info;
-    int i;
+    int HEVThresh;
+    int i, j;

-    /* init limits for given sharpness*/
-    vp8_loop_filter_update_sharpness(lfi, cm->sharpness_level);
-    cm->last_sharpness_level = cm->sharpness_level;
-
-    /* init LUT for lvl  and hev thr picking */
-    lf_init_lut(lfi);
-
-    /* init hev threshold const vectors */
-    for(i = 0; i < 4 ; i++)
+    /* For each possible value for the loop filter fill out a "loop_filter_info" entry. */
+    for (i = 0; i <= MAX_LOOP_FILTER; i++)
    {
-        vpx_memset(lfi->hev_thr[i], i, SIMD_WIDTH);
+        int filt_lvl = i;
+
+        if (frame_type == KEY_FRAME)
+        {
+            if (filt_lvl >= 40)
+                HEVThresh = 2;
+            else if (filt_lvl >= 15)
+                HEVThresh = 1;
+            else
+                HEVThresh = 0;
+        }
+        else
+        {
+            if (filt_lvl >= 40)
+                HEVThresh = 3;
+            else if (filt_lvl >= 20)
+                HEVThresh = 2;
+            else if (filt_lvl >= 15)
+                HEVThresh = 1;
+            else
+                HEVThresh = 0;
+        }
+
+        for (j = 0; j < 16; j++)
+        {
+            /*lfi[i].lim[j] = block_inside_limit;
+            lfi[i].mbflim[j] = filt_lvl+2;*/
+            /*lfi[i].flim[j] = filt_lvl;*/
+            lfi[i].thr[j] = HEVThresh;
+        }
    }
 }

-void vp8_loop_filter_frame_init(VP8_COMMON *cm,
-                                MACROBLOCKD *mbd,
-                                int default_filt_lvl)
+
+int vp8_adjust_mb_lf_value(MACROBLOCKD *mbd, int filter_level)
 {
-    int seg,  /* segment number */
-        ref,  /* index in ref_lf_deltas */
-        mode; /* index in mode_lf_deltas */
+    MB_MODE_INFO *mbmi = &mbd->mode_info_context->mbmi;

-    loop_filter_info_n *lfi = &cm->lf_info;
-
-    /* update limits if sharpness has changed */
-    if(cm->last_sharpness_level != cm->sharpness_level)
+    if (mbd->mode_ref_lf_delta_enabled)
    {
-        vp8_loop_filter_update_sharpness(lfi, cm->sharpness_level);
-        cm->last_sharpness_level = cm->sharpness_level;
-    }
-
-    for(seg = 0; seg < MAX_MB_SEGMENTS; seg++)
-    {
-        int lvl_seg = default_filt_lvl;
-        int lvl_ref, lvl_mode;
-
-        /* Note the baseline filter values for each segment */
-        if (mbd->segmentation_enabled)
-        {
-            /* Abs value */
-            if (mbd->mb_segement_abs_delta == SEGMENT_ABSDATA)
-            {
-                lvl_seg = mbd->segment_feature_data[MB_LVL_ALT_LF][seg];
-            }
-            else  /* Delta Value */
-            {
-                lvl_seg += mbd->segment_feature_data[MB_LVL_ALT_LF][seg];
-                lvl_seg = (lvl_seg > 0) ? ((lvl_seg > 63) ? 63: lvl_seg) : 0;
-            }
-        }
-
-        if (!mbd->mode_ref_lf_delta_enabled)
-        {
-            /* we could get rid of this if we assume that deltas are set to
-             * zero when not in use; encoder always uses deltas
-             */
-            vpx_memset(lfi->lvl[seg][0], lvl_seg, 4 * 4 );
-            continue;
-        }
-
-        lvl_ref = lvl_seg;
-
-        /* INTRA_FRAME */
-        ref = INTRA_FRAME;
-
        /* Apply delta for reference frame */
-        lvl_ref += mbd->ref_lf_deltas[ref];
+        filter_level += mbd->ref_lf_deltas[mbmi->ref_frame];

-        /* Apply delta for Intra modes */
-        mode = 0; /* B_PRED */
-        /* Only the split mode BPRED has a further special case */
-        lvl_mode = lvl_ref +  mbd->mode_lf_deltas[mode];
-        lvl_mode = (lvl_mode > 0) ? (lvl_mode > 63 ? 63 : lvl_mode) : 0; /* clamp */
-
-        lfi->lvl[seg][ref][mode] = lvl_mode;
-
-        mode = 1; /* all the rest of Intra modes */
-        lvl_mode = (lvl_ref > 0) ? (lvl_ref > 63 ? 63 : lvl_ref)  : 0; /* clamp */
-        lfi->lvl[seg][ref][mode] = lvl_mode;
-
-        /* LAST, GOLDEN, ALT */
-        for(ref = 1; ref < MAX_REF_FRAMES; ref++)
+        /* Apply delta for mode */
+        if (mbmi->ref_frame == INTRA_FRAME)
        {
-            int lvl_ref = lvl_seg;
-
-            /* Apply delta for reference frame */
-            lvl_ref += mbd->ref_lf_deltas[ref];
-
-            /* Apply delta for Inter modes */
-            for (mode = 1; mode < 4; mode++)
-            {
-                lvl_mode = lvl_ref + mbd->mode_lf_deltas[mode];
-                lvl_mode = (lvl_mode > 0) ? (lvl_mode > 63 ? 63 : lvl_mode) : 0; /* clamp */
-
-                lfi->lvl[seg][ref][mode] = lvl_mode;
-            }
+            /* Only the split mode BPRED has a further special case */
+            if (mbmi->mode == B_PRED)
+                filter_level +=  mbd->mode_lf_deltas[0];
        }
+        else
+        {
+            /* Zero motion mode */
+            if (mbmi->mode == ZEROMV)
+                filter_level +=  mbd->mode_lf_deltas[1];
+
+            /* Split MB motion mode */
+            else if (mbmi->mode == SPLITMV)
+                filter_level +=  mbd->mode_lf_deltas[3];
+
+            /* All other inter motion modes (Nearest, Near, New) */
+            else
+                filter_level +=  mbd->mode_lf_deltas[2];
+        }
+
+        /* Range check */
+        if (filter_level > MAX_LOOP_FILTER)
+            filter_level = MAX_LOOP_FILTER;
+        else if (filter_level < 0)
+            filter_level = 0;
    }
+    return filter_level;
 }

+
 void vp8_loop_filter_frame
 (
    VP8_COMMON *cm,
-    MACROBLOCKD *mbd
+    MACROBLOCKD *mbd,
+    int default_filt_lvl
 )
 {
    YV12_BUFFER_CONFIG *post = cm->frame_to_show;
-    loop_filter_info_n *lfi_n = &cm->lf_info;
-    loop_filter_info lfi;
-
+    loop_filter_info *lfi = cm->lf_info;
    FRAME_TYPE frame_type = cm->frame_type;

    int mb_row;
    int mb_col;

-    int filter_level;

+    int baseline_filter_level[MAX_MB_SEGMENTS];
+    int filter_level;
+    int alt_flt_enabled = mbd->segmentation_enabled;
+
+    int i;
    unsigned char *y_ptr, *u_ptr, *v_ptr;

-    /* Point at base of Mb MODE_INFO list */
-    const MODE_INFO *mode_info_context = cm->mi;
+    mbd->mode_info_context = cm->mi;          /* Point at base of Mb MODE_INFO list */
+
+    /* Note the baseline filter values for each segment */
+    if (alt_flt_enabled)
+    {
+        for (i = 0; i < MAX_MB_SEGMENTS; i++)
+        {
+            /* Abs value */
+            if (mbd->mb_segement_abs_delta == SEGMENT_ABSDATA)
+                baseline_filter_level[i] = mbd->segment_feature_data[MB_LVL_ALT_LF][i];
+            /* Delta Value */
+            else
+            {
+                baseline_filter_level[i] = default_filt_lvl + mbd->segment_feature_data[MB_LVL_ALT_LF][i];
+                baseline_filter_level[i] = (baseline_filter_level[i] >= 0) ? ((baseline_filter_level[i] <= MAX_LOOP_FILTER) ? baseline_filter_level[i] : MAX_LOOP_FILTER) : 0;  /* Clamp to valid range */
+            }
+        }
+    }
+    else
+    {
+        for (i = 0; i < MAX_MB_SEGMENTS; i++)
+            baseline_filter_level[i] = default_filt_lvl;
+    }

    /* Initialize the loop filter for this frame. */
-    vp8_loop_filter_frame_init(cm, mbd, cm->filter_level);
+    if ((cm->last_filter_type != cm->filter_type) || (cm->last_sharpness_level != cm->sharpness_level))
+        vp8_init_loop_filter(cm);
+    else if (frame_type != cm->last_frame_type)
+        vp8_frame_init_loop_filter(lfi, frame_type);

    /* Set up the buffer pointers */
    y_ptr = post->y_buffer;
@@ -314,108 +352,101 @@ void vp8_loop_filter_frame
    {
        for (mb_col = 0; mb_col < cm->mb_cols; mb_col++)
        {
-            int skip_lf = (mode_info_context->mbmi.mode != B_PRED &&
-                            mode_info_context->mbmi.mode != SPLITMV &&
-                            mode_info_context->mbmi.mb_skip_coeff);
+            int Segment = (alt_flt_enabled) ? mbd->mode_info_context->mbmi.segment_id : 0;

-            const int mode_index = lfi_n->mode_lf_lut[mode_info_context->mbmi.mode];
-            const int seg = mode_info_context->mbmi.segment_id;
-            const int ref_frame = mode_info_context->mbmi.ref_frame;
+            filter_level = baseline_filter_level[Segment];

-            filter_level = lfi_n->lvl[seg][ref_frame][mode_index];
+            /* Distance of Mb to the various image edges.
+             * These specified to 8th pel as they are always compared to values that are in 1/8th pel units
+             * Apply any context driven MB level adjustment
+             */
+            filter_level = vp8_adjust_mb_lf_value(mbd, filter_level);

            if (filter_level)
            {
-                if (cm->filter_type == NORMAL_LOOPFILTER)
-                {
-                    const int hev_index = lfi_n->hev_thr_lut[frame_type][filter_level];
-                    lfi.mblim = lfi_n->mblim[filter_level];
-                    lfi.blim = lfi_n->blim[filter_level];
-                    lfi.lim = lfi_n->lim[filter_level];
-                    lfi.hev_thr = lfi_n->hev_thr[hev_index];
+                if (mb_col > 0)
+                    cm->lf_mbv(y_ptr, u_ptr, v_ptr, post->y_stride, post->uv_stride, &lfi[filter_level], cm->simpler_lpf);

-                    if (mb_col > 0)
-                        LF_INVOKE(&cm->rtcd.loopfilter, normal_mb_v)
-                        (y_ptr, u_ptr, v_ptr, post->y_stride, post->uv_stride, &lfi);
+                if (mbd->mode_info_context->mbmi.dc_diff > 0)
+                    cm->lf_bv(y_ptr, u_ptr, v_ptr, post->y_stride, post->uv_stride, &lfi[filter_level], cm->simpler_lpf);

-                    if (!skip_lf)
-                        LF_INVOKE(&cm->rtcd.loopfilter, normal_b_v)
-                        (y_ptr, u_ptr, v_ptr, post->y_stride, post->uv_stride, &lfi);
+                /* don't apply across umv border */
+                if (mb_row > 0)
+                    cm->lf_mbh(y_ptr, u_ptr, v_ptr, post->y_stride, post->uv_stride, &lfi[filter_level], cm->simpler_lpf);

-                    /* don't apply across umv border */
-                    if (mb_row > 0)
-                        LF_INVOKE(&cm->rtcd.loopfilter, normal_mb_h)
-                        (y_ptr, u_ptr, v_ptr, post->y_stride, post->uv_stride, &lfi);
-
-                    if (!skip_lf)
-                        LF_INVOKE(&cm->rtcd.loopfilter, normal_b_h)
-                        (y_ptr, u_ptr, v_ptr, post->y_stride, post->uv_stride, &lfi);
-                }
-                else
-                {
-                    if (mb_col > 0)
-                        LF_INVOKE(&cm->rtcd.loopfilter, simple_mb_v)
-                        (y_ptr, post->y_stride, lfi_n->mblim[filter_level]);
-
-                    if (!skip_lf)
-                        LF_INVOKE(&cm->rtcd.loopfilter, simple_b_v)
-                        (y_ptr, post->y_stride, lfi_n->blim[filter_level]);
-
-                    /* don't apply across umv border */
-                    if (mb_row > 0)
-                        LF_INVOKE(&cm->rtcd.loopfilter, simple_mb_h)
-                        (y_ptr, post->y_stride, lfi_n->mblim[filter_level]);
-
-                    if (!skip_lf)
-                        LF_INVOKE(&cm->rtcd.loopfilter, simple_b_h)
-                        (y_ptr, post->y_stride, lfi_n->blim[filter_level]);
-                }
+                if (mbd->mode_info_context->mbmi.dc_diff > 0)
+                    cm->lf_bh(y_ptr, u_ptr, v_ptr, post->y_stride, post->uv_stride, &lfi[filter_level], cm->simpler_lpf);
            }

            y_ptr += 16;
            u_ptr += 8;
            v_ptr += 8;

-            mode_info_context++;     /* step to next MB */
+            mbd->mode_info_context++;     /* step to next MB */
        }

        y_ptr += post->y_stride  * 16 - post->y_width;
        u_ptr += post->uv_stride *  8 - post->uv_width;
        v_ptr += post->uv_stride *  8 - post->uv_width;

-        mode_info_context++;         /* Skip border mb */
+        mbd->mode_info_context++;         /* Skip border mb */
    }
 }

+
 void vp8_loop_filter_frame_yonly
 (
    VP8_COMMON *cm,
    MACROBLOCKD *mbd,
-    int default_filt_lvl
+    int default_filt_lvl,
+    int sharpness_lvl
 )
 {
    YV12_BUFFER_CONFIG *post = cm->frame_to_show;

+    int i;
    unsigned char *y_ptr;
    int mb_row;
    int mb_col;

-    loop_filter_info_n *lfi_n = &cm->lf_info;
-    loop_filter_info lfi;
-
+    loop_filter_info *lfi = cm->lf_info;
+    int baseline_filter_level[MAX_MB_SEGMENTS];
    int filter_level;
+    int alt_flt_enabled = mbd->segmentation_enabled;
    FRAME_TYPE frame_type = cm->frame_type;

-    /* Point at base of Mb MODE_INFO list */
-    const MODE_INFO *mode_info_context = cm->mi;
+    (void) sharpness_lvl;

-#if 0
-    if(default_filt_lvl == 0) /* no filter applied */
-        return;
-#endif
+    /*MODE_INFO * this_mb_mode_info = cm->mi;*/ /* Point at base of Mb MODE_INFO list */
+    mbd->mode_info_context = cm->mi;          /* Point at base of Mb MODE_INFO list */
+
+    /* Note the baseline filter values for each segment */
+    if (alt_flt_enabled)
+    {
+        for (i = 0; i < MAX_MB_SEGMENTS; i++)
+        {
+            /* Abs value */
+            if (mbd->mb_segement_abs_delta == SEGMENT_ABSDATA)
+                baseline_filter_level[i] = mbd->segment_feature_data[MB_LVL_ALT_LF][i];
+            /* Delta Value */
+            else
+            {
+                baseline_filter_level[i] = default_filt_lvl + mbd->segment_feature_data[MB_LVL_ALT_LF][i];
+                baseline_filter_level[i] = (baseline_filter_level[i] >= 0) ? ((baseline_filter_level[i] <= MAX_LOOP_FILTER) ? baseline_filter_level[i] : MAX_LOOP_FILTER) : 0;  /* Clamp to valid range */
+            }
+        }
+    }
+    else
+    {
+        for (i = 0; i < MAX_MB_SEGMENTS; i++)
+            baseline_filter_level[i] = default_filt_lvl;
+    }

    /* Initialize the loop filter for this frame. */
-    vp8_loop_filter_frame_init( cm, mbd, default_filt_lvl);
+    if ((cm->last_filter_type != cm->filter_type) || (cm->last_sharpness_level != cm->sharpness_level))
+        vp8_init_loop_filter(cm);
+    else if (frame_type != cm->last_frame_type)
+        vp8_frame_init_loop_filter(lfi, frame_type);

    /* Set up the buffer pointers */
    y_ptr = post->y_buffer;
@@ -425,106 +456,72 @@ void vp8_loop_filter_frame_yonly
    {
        for (mb_col = 0; mb_col < cm->mb_cols; mb_col++)
        {
-            int skip_lf = (mode_info_context->mbmi.mode != B_PRED &&
-                            mode_info_context->mbmi.mode != SPLITMV &&
-                            mode_info_context->mbmi.mb_skip_coeff);
+            int Segment = (alt_flt_enabled) ? mbd->mode_info_context->mbmi.segment_id : 0;
+            filter_level = baseline_filter_level[Segment];

-            const int mode_index = lfi_n->mode_lf_lut[mode_info_context->mbmi.mode];
-            const int seg = mode_info_context->mbmi.segment_id;
-            const int ref_frame = mode_info_context->mbmi.ref_frame;
-
-            filter_level = lfi_n->lvl[seg][ref_frame][mode_index];
+            /* Apply any context driven MB level adjustment */
+            filter_level = vp8_adjust_mb_lf_value(mbd, filter_level);

            if (filter_level)
            {
-                if (cm->filter_type == NORMAL_LOOPFILTER)
-                {
-                    const int hev_index = lfi_n->hev_thr_lut[frame_type][filter_level];
-                    lfi.mblim = lfi_n->mblim[filter_level];
-                    lfi.blim = lfi_n->blim[filter_level];
-                    lfi.lim = lfi_n->lim[filter_level];
-                    lfi.hev_thr = lfi_n->hev_thr[hev_index];
+                if (mb_col > 0)
+                    cm->lf_mbv(y_ptr, 0, 0, post->y_stride, 0, &lfi[filter_level], 0);

-                    if (mb_col > 0)
-                        LF_INVOKE(&cm->rtcd.loopfilter, normal_mb_v)
-                        (y_ptr, 0, 0, post->y_stride, 0, &lfi);
+                if (mbd->mode_info_context->mbmi.dc_diff > 0)
+                    cm->lf_bv(y_ptr, 0, 0, post->y_stride, 0, &lfi[filter_level], 0);

-                    if (!skip_lf)
-                        LF_INVOKE(&cm->rtcd.loopfilter, normal_b_v)
-                        (y_ptr, 0, 0, post->y_stride, 0, &lfi);
+                /* don't apply across umv border */
+                if (mb_row > 0)
+                    cm->lf_mbh(y_ptr, 0, 0, post->y_stride, 0, &lfi[filter_level], 0);

-                    /* don't apply across umv border */
-                    if (mb_row > 0)
-                        LF_INVOKE(&cm->rtcd.loopfilter, normal_mb_h)
-                        (y_ptr, 0, 0, post->y_stride, 0, &lfi);
-
-                    if (!skip_lf)
-                        LF_INVOKE(&cm->rtcd.loopfilter, normal_b_h)
-                        (y_ptr, 0, 0, post->y_stride, 0, &lfi);
-                }
-                else
-                {
-                    if (mb_col > 0)
-                        LF_INVOKE(&cm->rtcd.loopfilter, simple_mb_v)
-                        (y_ptr, post->y_stride, lfi_n->mblim[filter_level]);
-
-                    if (!skip_lf)
-                        LF_INVOKE(&cm->rtcd.loopfilter, simple_b_v)
-                        (y_ptr, post->y_stride, lfi_n->blim[filter_level]);
-
-                    /* don't apply across umv border */
-                    if (mb_row > 0)
-                        LF_INVOKE(&cm->rtcd.loopfilter, simple_mb_h)
-                        (y_ptr, post->y_stride, lfi_n->mblim[filter_level]);
-
-                    if (!skip_lf)
-                        LF_INVOKE(&cm->rtcd.loopfilter, simple_b_h)
-                        (y_ptr, post->y_stride, lfi_n->blim[filter_level]);
-                }
+                if (mbd->mode_info_context->mbmi.dc_diff > 0)
+                    cm->lf_bh(y_ptr, 0, 0, post->y_stride, 0, &lfi[filter_level], 0);
            }

            y_ptr += 16;
-            mode_info_context ++;        /* step to next MB */
+            mbd->mode_info_context ++;        /* step to next MB */

        }

        y_ptr += post->y_stride  * 16 - post->y_width;
-        mode_info_context ++;            /* Skip border mb */
+        mbd->mode_info_context ++;            /* Skip border mb */
    }

 }

+
 void vp8_loop_filter_partial_frame
 (
    VP8_COMMON *cm,
    MACROBLOCKD *mbd,
-    int default_filt_lvl
+    int default_filt_lvl,
+    int sharpness_lvl,
+    int Fraction
 )
 {
    YV12_BUFFER_CONFIG *post = cm->frame_to_show;

+    int i;
    unsigned char *y_ptr;
    int mb_row;
    int mb_col;
+    /*int mb_rows = post->y_height >> 4;*/
    int mb_cols = post->y_width  >> 4;

-    int linestocopy, i;
-
-    loop_filter_info_n *lfi_n = &cm->lf_info;
-    loop_filter_info lfi;
+    int linestocopy;

+    loop_filter_info *lfi = cm->lf_info;
+    int baseline_filter_level[MAX_MB_SEGMENTS];
    int filter_level;
    int alt_flt_enabled = mbd->segmentation_enabled;
    FRAME_TYPE frame_type = cm->frame_type;

-    const MODE_INFO *mode_info_context;
+    (void) sharpness_lvl;

-    int lvl_seg[MAX_MB_SEGMENTS];
+    /*MODE_INFO * this_mb_mode_info = cm->mi + (post->y_height>>5) * (mb_cols + 1);*/ /* Point at base of Mb MODE_INFO list */
+    mbd->mode_info_context = cm->mi + (post->y_height >> 5) * (mb_cols + 1);        /* Point at base of Mb MODE_INFO list */

-    mode_info_context = cm->mi + (post->y_height >> 5) * (mb_cols + 1);
-
-    /* 3 is a magic number. 4 is probably magic too */
-    linestocopy = (post->y_height >> (4 + 3));
+    linestocopy = (post->y_height >> (4 + Fraction));

    if (linestocopy < 1)
        linestocopy = 1;
@@ -532,27 +529,32 @@ void vp8_loop_filter_partial_frame
    linestocopy <<= 4;

    /* Note the baseline filter values for each segment */
-    /* See vp8_loop_filter_frame_init. Rather than call that for each change
-     * to default_filt_lvl, copy the relevant calculation here.
-     */
    if (alt_flt_enabled)
    {
        for (i = 0; i < MAX_MB_SEGMENTS; i++)
-        {    /* Abs value */
+        {
+            /* Abs value */
            if (mbd->mb_segement_abs_delta == SEGMENT_ABSDATA)
-            {
-                lvl_seg[i] = mbd->segment_feature_data[MB_LVL_ALT_LF][i];
-            }
+                baseline_filter_level[i] = mbd->segment_feature_data[MB_LVL_ALT_LF][i];
            /* Delta Value */
            else
            {
-                lvl_seg[i] = default_filt_lvl
-                        + mbd->segment_feature_data[MB_LVL_ALT_LF][i];
-                lvl_seg[i] = (lvl_seg[i] > 0) ?
-                        ((lvl_seg[i] > 63) ? 63: lvl_seg[i]) : 0;
+                baseline_filter_level[i] = default_filt_lvl + mbd->segment_feature_data[MB_LVL_ALT_LF][i];
+                baseline_filter_level[i] = (baseline_filter_level[i] >= 0) ? ((baseline_filter_level[i] <= MAX_LOOP_FILTER) ? baseline_filter_level[i] : MAX_LOOP_FILTER) : 0;  /* Clamp to valid range */
            }
        }
    }
+    else
+    {
+        for (i = 0; i < MAX_MB_SEGMENTS; i++)
+            baseline_filter_level[i] = default_filt_lvl;
+    }
+
+    /* Initialize the loop filter for this frame. */
+    if ((cm->last_filter_type != cm->filter_type) || (cm->last_sharpness_level != cm->sharpness_level))
+        vp8_init_loop_filter(cm);
+    else if (frame_type != cm->last_frame_type)
+        vp8_frame_init_loop_filter(lfi, frame_type);

    /* Set up the buffer pointers */
    y_ptr = post->y_buffer + (post->y_height >> 5) * 16 * post->y_stride;
@@ -562,64 +564,28 @@ void vp8_loop_filter_partial_frame
    {
        for (mb_col = 0; mb_col < mb_cols; mb_col++)
        {
-            int skip_lf = (mode_info_context->mbmi.mode != B_PRED &&
-                           mode_info_context->mbmi.mode != SPLITMV &&
-                           mode_info_context->mbmi.mb_skip_coeff);
-
-            if (alt_flt_enabled)
-                filter_level = lvl_seg[mode_info_context->mbmi.segment_id];
-            else
-                filter_level = default_filt_lvl;
+            int Segment = (alt_flt_enabled) ? mbd->mode_info_context->mbmi.segment_id : 0;
+            filter_level = baseline_filter_level[Segment];

            if (filter_level)
            {
-                if (cm->filter_type == NORMAL_LOOPFILTER)
-                {
-                    const int hev_index = lfi_n->hev_thr_lut[frame_type][filter_level];
-                    lfi.mblim = lfi_n->mblim[filter_level];
-                    lfi.blim = lfi_n->blim[filter_level];
-                    lfi.lim = lfi_n->lim[filter_level];
-                    lfi.hev_thr = lfi_n->hev_thr[hev_index];
+                if (mb_col > 0)
+                    cm->lf_mbv(y_ptr, 0, 0, post->y_stride, 0, &lfi[filter_level], 0);

-                    if (mb_col > 0)
-                        LF_INVOKE(&cm->rtcd.loopfilter, normal_mb_v)
-                        (y_ptr, 0, 0, post->y_stride, 0, &lfi);
+                if (mbd->mode_info_context->mbmi.dc_diff > 0)
+                    cm->lf_bv(y_ptr, 0, 0, post->y_stride, 0, &lfi[filter_level], 0);

-                    if (!skip_lf)
-                        LF_INVOKE(&cm->rtcd.loopfilter, normal_b_v)
-                        (y_ptr, 0, 0, post->y_stride, 0, &lfi);
+                cm->lf_mbh(y_ptr, 0, 0, post->y_stride, 0, &lfi[filter_level], 0);

-                    LF_INVOKE(&cm->rtcd.loopfilter, normal_mb_h)
-                        (y_ptr, 0, 0, post->y_stride, 0, &lfi);
-
-                    if (!skip_lf)
-                        LF_INVOKE(&cm->rtcd.loopfilter, normal_b_h)
-                        (y_ptr, 0, 0, post->y_stride, 0, &lfi);
-                }
-                else
-                {
-                    if (mb_col > 0)
-                        LF_INVOKE(&cm->rtcd.loopfilter, simple_mb_v)
-                        (y_ptr, post->y_stride, lfi_n->mblim[filter_level]);
-
-                    if (!skip_lf)
-                        LF_INVOKE(&cm->rtcd.loopfilter, simple_b_v)
-                        (y_ptr, post->y_stride, lfi_n->blim[filter_level]);
-
-                    LF_INVOKE(&cm->rtcd.loopfilter, simple_mb_h)
-                        (y_ptr, post->y_stride, lfi_n->mblim[filter_level]);
-
-                    if (!skip_lf)
-                        LF_INVOKE(&cm->rtcd.loopfilter, simple_b_h)
-                        (y_ptr, post->y_stride, lfi_n->blim[filter_level]);
-                }
+                if (mbd->mode_info_context->mbmi.dc_diff > 0)
+                    cm->lf_bh(y_ptr, 0, 0, post->y_stride, 0, &lfi[filter_level], 0);
            }

            y_ptr += 16;
-            mode_info_context += 1;      /* step to next MB */
+            mbd->mode_info_context += 1;      /* step to next MB */
        }

        y_ptr += post->y_stride  * 16 - post->y_width;
-        mode_info_context += 1;          /* Skip border mb */
+        mbd->mode_info_context += 1;          /* Skip border mb */
    }
 }
--- a/vp8/common/loopfilter.h
+++ b/vp8/common/loopfilter.h
@@ -13,7 +13,6 @@
 #define loopfilter_h

 #include "vpx_ports/mem.h"
-#include "vpx_config.h"

 #define MAX_LOOP_FILTER 63

@@ -23,45 +22,26 @@ typedef enum
    SIMPLE_LOOPFILTER = 1
 } LOOPFILTERTYPE;

-#if ARCH_ARM
-#define SIMD_WIDTH 1
-#else
-#define SIMD_WIDTH 16
-#endif
-
-/* Need to align this structure so when it is declared and
+/* FRK
+ * Need to align this structure so when it is declared and
 * passed it can be loaded into vector registers.
 */
 typedef struct
 {
-    DECLARE_ALIGNED(SIMD_WIDTH, unsigned char, mblim[MAX_LOOP_FILTER + 1][SIMD_WIDTH]);
-    DECLARE_ALIGNED(SIMD_WIDTH, unsigned char, blim[MAX_LOOP_FILTER + 1][SIMD_WIDTH]);
-    DECLARE_ALIGNED(SIMD_WIDTH, unsigned char, lim[MAX_LOOP_FILTER + 1][SIMD_WIDTH]);
-    DECLARE_ALIGNED(SIMD_WIDTH, unsigned char, hev_thr[4][SIMD_WIDTH]);
-    unsigned char lvl[4][4][4];
-    unsigned char hev_thr_lut[2][MAX_LOOP_FILTER + 1];
-    unsigned char mode_lf_lut[10];
-} loop_filter_info_n;
-
-typedef struct
-{
-    const unsigned char * mblim;
-    const unsigned char * blim;
-    const unsigned char * lim;
-    const unsigned char * hev_thr;
+    DECLARE_ALIGNED(16, signed char, lim[16]);
+    DECLARE_ALIGNED(16, signed char, flim[16]);
+    DECLARE_ALIGNED(16, signed char, thr[16]);
+    DECLARE_ALIGNED(16, signed char, mbflim[16]);
 } loop_filter_info;


 #define prototype_loopfilter(sym) \
-    void sym(unsigned char *src, int pitch, const unsigned char *blimit,\
-             const unsigned char *limit, const unsigned char *thresh, int count)
+    void sym(unsigned char *src, int pitch, const signed char *flimit,\
+             const signed char *limit, const signed char *thresh, int count)

 #define prototype_loopfilter_block(sym) \
-    void sym(unsigned char *y, unsigned char *u, unsigned char *v, \
-             int ystride, int uv_stride, loop_filter_info *lfi)
-
-#define prototype_simple_loopfilter(sym) \
-    void sym(unsigned char *y, int ystride, const unsigned char *blimit)
+    void sym(unsigned char *y, unsigned char *u, unsigned char *v,\
+             int ystride, int uv_stride, loop_filter_info *lfi, int simpler)

 #if ARCH_X86 || ARCH_X86_64
 #include "x86/loopfilter_x86.h"
@@ -91,39 +71,38 @@ extern prototype_loopfilter_block(vp8_lf_normal_mb_h);
 #endif
 extern prototype_loopfilter_block(vp8_lf_normal_b_h);

+
 #ifndef vp8_lf_simple_mb_v
-#define vp8_lf_simple_mb_v vp8_loop_filter_simple_vertical_edge_c
+#define vp8_lf_simple_mb_v vp8_loop_filter_mbvs_c
 #endif
-extern prototype_simple_loopfilter(vp8_lf_simple_mb_v);
+extern prototype_loopfilter_block(vp8_lf_simple_mb_v);

 #ifndef vp8_lf_simple_b_v
 #define vp8_lf_simple_b_v vp8_loop_filter_bvs_c
 #endif
-extern prototype_simple_loopfilter(vp8_lf_simple_b_v);
+extern prototype_loopfilter_block(vp8_lf_simple_b_v);

 #ifndef vp8_lf_simple_mb_h
-#define vp8_lf_simple_mb_h vp8_loop_filter_simple_horizontal_edge_c
+#define vp8_lf_simple_mb_h vp8_loop_filter_mbhs_c
 #endif
-extern prototype_simple_loopfilter(vp8_lf_simple_mb_h);
+extern prototype_loopfilter_block(vp8_lf_simple_mb_h);

 #ifndef vp8_lf_simple_b_h
 #define vp8_lf_simple_b_h vp8_loop_filter_bhs_c
 #endif
-extern prototype_simple_loopfilter(vp8_lf_simple_b_h);
+extern prototype_loopfilter_block(vp8_lf_simple_b_h);

 typedef prototype_loopfilter_block((*vp8_lf_block_fn_t));
-typedef prototype_simple_loopfilter((*vp8_slf_block_fn_t));
-
 typedef struct
 {
    vp8_lf_block_fn_t  normal_mb_v;
    vp8_lf_block_fn_t  normal_b_v;
    vp8_lf_block_fn_t  normal_mb_h;
    vp8_lf_block_fn_t  normal_b_h;
-    vp8_slf_block_fn_t  simple_mb_v;
-    vp8_slf_block_fn_t  simple_b_v;
-    vp8_slf_block_fn_t  simple_mb_h;
-    vp8_slf_block_fn_t  simple_b_h;
+    vp8_lf_block_fn_t  simple_mb_v;
+    vp8_lf_block_fn_t  simple_b_v;
+    vp8_lf_block_fn_t  simple_mb_h;
+    vp8_lf_block_fn_t  simple_b_h;
 } vp8_loopfilter_rtcd_vtable_t;

 #if CONFIG_RUNTIME_CPU_DETECT
@@ -136,33 +115,10 @@ typedef void loop_filter_uvfunction
 (
    unsigned char *u,   /* source pointer */
    int p,              /* pitch */
-    const unsigned char *blimit,
-    const unsigned char *limit,
-    const unsigned char *thresh,
+    const signed char *flimit,
+    const signed char *limit,
+    const signed char *thresh,
    unsigned char *v
 );

-/* assorted loopfilter functions which get used elsewhere */
-struct VP8Common;
-struct MacroBlockD;
-
-void vp8_loop_filter_init(struct VP8Common *cm);
-
-void vp8_loop_filter_frame_init(struct VP8Common *cm,
-                                struct MacroBlockD *mbd,
-                                int default_filt_lvl);
-
-void vp8_loop_filter_frame(struct VP8Common *cm, struct MacroBlockD *mbd);
-
-void vp8_loop_filter_partial_frame(struct VP8Common *cm,
-                                   struct MacroBlockD *mbd,
-                                   int default_filt_lvl);
-
-void vp8_loop_filter_frame_yonly(struct VP8Common *cm,
-                                 struct MacroBlockD *mbd,
-                                 int default_filt_lvl);
-
-void vp8_loop_filter_update_sharpness(loop_filter_info_n *lfi,
-                                      int sharpness_lvl);
-
 #endif
--- a/vp8/common/loopfilter_filters.c
+++ b/vp8/common/loopfilter_filters.c
@@ -24,9 +24,8 @@ static __inline signed char vp8_signed_char_clamp(int t)


 /* should we apply any filter at all ( 11111111 yes, 00000000 no) */
-static __inline signed char vp8_filter_mask(uc limit, uc blimit,
-                                     uc p3, uc p2, uc p1, uc p0,
-                                     uc q0, uc q1, uc q2, uc q3)
+static __inline signed char vp8_filter_mask(signed char limit, signed char flimit,
+                                     uc p3, uc p2, uc p1, uc p0, uc q0, uc q1, uc q2, uc q3)
 {
    signed char mask = 0;
    mask |= (abs(p3 - p2) > limit) * -1;
@@ -35,13 +34,13 @@ static __inline signed char vp8_filter_mask(uc limit, uc blimit,
    mask |= (abs(q1 - q0) > limit) * -1;
    mask |= (abs(q2 - q1) > limit) * -1;
    mask |= (abs(q3 - q2) > limit) * -1;
-    mask |= (abs(p0 - q0) * 2 + abs(p1 - q1) / 2  > blimit) * -1;
+    mask |= (abs(p0 - q0) * 2 + abs(p1 - q1) / 2  > flimit * 2 + limit) * -1;
    mask = ~mask;
    return mask;
 }

 /* is there high variance internal edge ( 11111111 yes, 00000000 no) */
-static __inline signed char vp8_hevmask(uc thresh, uc p1, uc p0, uc q0, uc q1)
+static __inline signed char vp8_hevmask(signed char thresh, uc p1, uc p0, uc q0, uc q1)
 {
    signed char hev = 0;
    hev  |= (abs(p1 - p0) > thresh) * -1;
@@ -49,8 +48,7 @@ static __inline signed char vp8_hevmask(uc thresh, uc p1, uc p0, uc q0, uc q1)
    return hev;
 }

-static __inline void vp8_filter(signed char mask, uc hev, uc *op1,
-        uc *op0, uc *oq0, uc *oq1)
+static __inline void vp8_filter(signed char mask, signed char hev, uc *op1, uc *op0, uc *oq0, uc *oq1)

 {
    signed char ps0, qs0;
@@ -100,9 +98,9 @@ void vp8_loop_filter_horizontal_edge_c
 (
    unsigned char *s,
    int p, /* pitch */
-    const unsigned char *blimit,
-    const unsigned char *limit,
-    const unsigned char *thresh,
+    const signed char *flimit,
+    const signed char *limit,
+    const signed char *thresh,
    int count
 )
 {
@@ -115,11 +113,11 @@ void vp8_loop_filter_horizontal_edge_c
     */
    do
    {
-        mask = vp8_filter_mask(limit[0], blimit[0],
+        mask = vp8_filter_mask(limit[i], flimit[i],
                               s[-4*p], s[-3*p], s[-2*p], s[-1*p],
                               s[0*p], s[1*p], s[2*p], s[3*p]);

-        hev = vp8_hevmask(thresh[0], s[-2*p], s[-1*p], s[0*p], s[1*p]);
+        hev = vp8_hevmask(thresh[i], s[-2*p], s[-1*p], s[0*p], s[1*p]);

        vp8_filter(mask, hev, s - 2 * p, s - 1 * p, s, s + 1 * p);

@@ -132,9 +130,9 @@ void vp8_loop_filter_vertical_edge_c
 (
    unsigned char *s,
    int p,
-    const unsigned char *blimit,
-    const unsigned char *limit,
-    const unsigned char *thresh,
+    const signed char *flimit,
+    const signed char *limit,
+    const signed char *thresh,
    int count
 )
 {
@@ -147,10 +145,10 @@ void vp8_loop_filter_vertical_edge_c
     */
    do
    {
-        mask = vp8_filter_mask(limit[0], blimit[0],
+        mask = vp8_filter_mask(limit[i], flimit[i],
                               s[-4], s[-3], s[-2], s[-1], s[0], s[1], s[2], s[3]);

-        hev = vp8_hevmask(thresh[0], s[-2], s[-1], s[0], s[1]);
+        hev = vp8_hevmask(thresh[i], s[-2], s[-1], s[0], s[1]);

        vp8_filter(mask, hev, s - 2, s - 1, s, s + 1);

@@ -159,7 +157,7 @@ void vp8_loop_filter_vertical_edge_c
    while (++i < count * 8);
 }

-static __inline void vp8_mbfilter(signed char mask, uc hev,
+static __inline void vp8_mbfilter(signed char mask, signed char hev,
                           uc *op2, uc *op1, uc *op0, uc *oq0, uc *oq1, uc *oq2)
 {
    signed char s, u;
@@ -218,9 +216,9 @@ void vp8_mbloop_filter_horizontal_edge_c
 (
    unsigned char *s,
    int p,
-    const unsigned char *blimit,
-    const unsigned char *limit,
-    const unsigned char *thresh,
+    const signed char *flimit,
+    const signed char *limit,
+    const signed char *thresh,
    int count
 )
 {
@@ -234,11 +232,11 @@ void vp8_mbloop_filter_horizontal_edge_c
    do
    {

-        mask = vp8_filter_mask(limit[0], blimit[0],
+        mask = vp8_filter_mask(limit[i], flimit[i],
                               s[-4*p], s[-3*p], s[-2*p], s[-1*p],
                               s[0*p], s[1*p], s[2*p], s[3*p]);

-        hev = vp8_hevmask(thresh[0], s[-2*p], s[-1*p], s[0*p], s[1*p]);
+        hev = vp8_hevmask(thresh[i], s[-2*p], s[-1*p], s[0*p], s[1*p]);

        vp8_mbfilter(mask, hev, s - 3 * p, s - 2 * p, s - 1 * p, s, s + 1 * p, s + 2 * p);

@@ -253,9 +251,9 @@ void vp8_mbloop_filter_vertical_edge_c
 (
    unsigned char *s,
    int p,
-    const unsigned char *blimit,
-    const unsigned char *limit,
-    const unsigned char *thresh,
+    const signed char *flimit,
+    const signed char *limit,
+    const signed char *thresh,
    int count
 )
 {
@@ -266,10 +264,10 @@ void vp8_mbloop_filter_vertical_edge_c
    do
    {

-        mask = vp8_filter_mask(limit[0], blimit[0],
+        mask = vp8_filter_mask(limit[i], flimit[i],
                               s[-4], s[-3], s[-2], s[-1], s[0], s[1], s[2], s[3]);

-        hev = vp8_hevmask(thresh[0], s[-2], s[-1], s[0], s[1]);
+        hev = vp8_hevmask(thresh[i], s[-2], s[-1], s[0], s[1]);

        vp8_mbfilter(mask, hev, s - 3, s - 2, s - 1, s, s + 1, s + 2);

@@ -280,13 +278,13 @@ void vp8_mbloop_filter_vertical_edge_c
 }

 /* should we apply any filter at all ( 11111111 yes, 00000000 no) */
-static __inline signed char vp8_simple_filter_mask(uc blimit, uc p1, uc p0, uc q0, uc q1)
+static __inline signed char vp8_simple_filter_mask(signed char limit, signed char flimit, uc p1, uc p0, uc q0, uc q1)
 {
 /* Why does this cause problems for win32?
 * error C2143: syntax error : missing ';' before 'type'
 *  (void) limit;
 */
-    signed char mask = (abs(p0 - q0) * 2 + abs(p1 - q1) / 2  <= blimit) * -1;
+    signed char mask = (abs(p0 - q0) * 2 + abs(p1 - q1) / 2  <= flimit * 2 + limit) * -1;
    return mask;
 }

@@ -319,37 +317,47 @@ void vp8_loop_filter_simple_horizontal_edge_c
 (
    unsigned char *s,
    int p,
-    const unsigned char *blimit
+    const signed char *flimit,
+    const signed char *limit,
+    const signed char *thresh,
+    int count
 )
 {
    signed char mask = 0;
    int i = 0;
+    (void) thresh;

    do
    {
-        mask = vp8_simple_filter_mask(blimit[0], s[-2*p], s[-1*p], s[0*p], s[1*p]);
+        /*mask = vp8_simple_filter_mask( limit[i], flimit[i],s[-1*p],s[0*p]);*/
+        mask = vp8_simple_filter_mask(limit[i], flimit[i], s[-2*p], s[-1*p], s[0*p], s[1*p]);
        vp8_simple_filter(mask, s - 2 * p, s - 1 * p, s, s + 1 * p);
        ++s;
    }
-    while (++i < 16);
+    while (++i < count * 8);
 }

 void vp8_loop_filter_simple_vertical_edge_c
 (
    unsigned char *s,
    int p,
-    const unsigned char *blimit
+    const signed char *flimit,
+    const signed char *limit,
+    const signed char *thresh,
+    int count
 )
 {
    signed char mask = 0;
    int i = 0;
+    (void) thresh;

    do
    {
-        mask = vp8_simple_filter_mask(blimit[0], s[-2], s[-1], s[0], s[1]);
+        /*mask = vp8_simple_filter_mask( limit[i], flimit[i],s[-1],s[0]);*/
+        mask = vp8_simple_filter_mask(limit[i], flimit[i], s[-2], s[-1], s[0], s[1]);
        vp8_simple_filter(mask, s - 2, s - 1, s, s + 1);
        s += p;
    }
-    while (++i < 16);
+    while (++i < count * 8);

 }
--- a/vp8/common/mv.h
+++ b/vp8/common/mv.h
@@ -11,7 +11,6 @@

 #ifndef __INC_MV_H
 #define __INC_MV_H
-#include "vpx/vpx_integer.h"

 typedef struct
 {
@@ -19,10 +18,4 @@ typedef struct
    short col;
 } MV;

-typedef union
-{
-    uint32_t  as_int;
-    MV        as_mv;
-} int_mv;        /* facilitates faster equality tests and copies */
-
 #endif
--- a/vp8/common/onyx.h
+++ b/vp8/common/onyx.h
@@ -109,7 +109,6 @@ extern "C"
        int noise_sensitivity;   // parameter used for applying pre processing blur: recommendation 0
        int Sharpness;          // parameter used for sharpening output: recommendation 0:
        int cpu_used;
-        unsigned int rc_max_intra_bitrate_pct;

        // mode ->
        //(0)=Realtime/Live Encoding. This mode is optimized for realtim encoding (for example, capturing
@@ -140,9 +139,8 @@ extern "C"

        int end_usage; // vbr or cbr

-        // buffer targeting aggressiveness
+        // shoot to keep buffer full at all times by undershooting a bit 95 recommended
        int under_shoot_pct;
-        int over_shoot_pct;

        // buffering parameters
        int starting_buffer_level;  // in seconds
@@ -184,11 +182,8 @@ extern "C"
        int token_partitions; // how many token partitions to create for multi core decoding
        int encode_breakout;  // early breakout encode threshold : for video conf recommend 800

-        unsigned int error_resilient_mode; // Bitfield defining the error
-                                   // resiliency features to enable. Can provide
-                                   // decodable frames after losses in previous
-                                   // frames and decodable partitions after
-                                   // losses in the same frame.
+        int error_resilient_mode;  // if running over udp networks provides decodable frames after a
+        // dropped packet

        int arnr_max_frames;
        int arnr_strength ;
@@ -211,8 +206,8 @@ extern "C"

 // receive a frames worth of data caller can assume that a copy of this frame is made
 // and not just a copy of the pointer..
-    int vp8_receive_raw_frame(VP8_PTR comp, unsigned int frame_flags, YV12_BUFFER_CONFIG *sd, int64_t time_stamp, int64_t end_time_stamp);
-    int vp8_get_compressed_data(VP8_PTR comp, unsigned int *frame_flags, unsigned long *size, unsigned char *dest, int64_t *time_stamp, int64_t *time_end, int flush);
+    int vp8_receive_raw_frame(VP8_PTR comp, unsigned int frame_flags, YV12_BUFFER_CONFIG *sd, INT64 time_stamp, INT64 end_time_stamp);
+    int vp8_get_compressed_data(VP8_PTR comp, unsigned int *frame_flags, unsigned long *size, unsigned char *dest, INT64 *time_stamp, INT64 *time_end, int flush);
    int vp8_get_preview_raw_frame(VP8_PTR comp, YV12_BUFFER_CONFIG *dest, vp8_ppflags_t *flags);

    int vp8_use_as_reference(VP8_PTR comp, int ref_frame_flags);
--- a/vp8/common/onyxc_int.h
+++ b/vp8/common/onyxc_int.h
@@ -19,9 +19,7 @@
 #include "entropy.h"
 #include "idct.h"
 #include "recon.h"
-#if CONFIG_POSTPROC
 #include "postproc.h"
-#endif

 /*#ifdef PACKET_TESTING*/
 #include "header.h"
@@ -37,15 +35,13 @@ void vp8_initialize_common(void);

 #define NUM_YV12_BUFFERS 4

-#define MAX_PARTITIONS 9
-
 typedef struct frame_contexts
 {
    vp8_prob bmode_prob [VP8_BINTRAMODES-1];
    vp8_prob ymode_prob [VP8_YMODES-1];   /* interframe intra mode probs */
    vp8_prob uv_mode_prob [VP8_UV_MODES-1];
    vp8_prob sub_mv_ref_prob [VP8_SUBMVREFS-1];
-    vp8_prob coef_probs [BLOCK_TYPES] [COEF_BANDS] [PREV_COEF_CONTEXTS] [ENTROPY_NODES];
+    vp8_prob coef_probs [BLOCK_TYPES] [COEF_BANDS] [PREV_COEF_CONTEXTS] [vp8_coef_tokens-1];
    MV_CONTEXT mvc[2];
    MV_CONTEXT pre_mvc[2];  /* not to caculate the mvcost for the frame if mvc doesn't change. */
 } FRAME_CONTEXT;
@@ -77,9 +73,7 @@ typedef struct VP8_COMMON_RTCD
    vp8_recon_rtcd_vtable_t       recon;
    vp8_subpix_rtcd_vtable_t      subpix;
    vp8_loopfilter_rtcd_vtable_t  loopfilter;
-#if CONFIG_POSTPROC
    vp8_postproc_rtcd_vtable_t    postproc;
-#endif
    int                           flags;
 #else
    int unused;
@@ -87,7 +81,6 @@ typedef struct VP8_COMMON_RTCD
 } VP8_COMMON_RTCD;

 typedef struct VP8Common
-
 {
    struct vpx_internal_error_info  error;

@@ -112,8 +105,7 @@ typedef struct VP8Common
    YV12_BUFFER_CONFIG post_proc_buffer;
    YV12_BUFFER_CONFIG temp_scale_frame;

-
-    FRAME_TYPE last_frame_type;  /* Save last frame's frame type for motion search. */
+    FRAME_TYPE last_frame_type;  /* Save last frame's frame type for loopfilter init checking and motion search. */
    FRAME_TYPE frame_type;

    int show_frame;
@@ -127,6 +119,7 @@ typedef struct VP8Common
    /* profile settings */
    int mb_no_coeff_skip;
    int no_lpf;
+    int simpler_lpf;
    int use_bilinear_mc_filter;
    int full_pixel;

@@ -147,15 +140,16 @@ typedef struct VP8Common

    MODE_INFO *mip; /* Base of allocated array */
    MODE_INFO *mi;  /* Corresponds to upper left visible macroblock */
-    MODE_INFO *prev_mip; /* MODE_INFO array 'mip' from last decoded frame */
-    MODE_INFO *prev_mi;  /* 'mi' from last frame (points into prev_mip) */


    INTERPOLATIONFILTERTYPE mcomp_filter_type;
+    LOOPFILTERTYPE last_filter_type;
    LOOPFILTERTYPE filter_type;
-
-    loop_filter_info_n lf_info;
-
+    loop_filter_info lf_info[MAX_LOOP_FILTER+1];
+    prototype_loopfilter_block((*lf_mbv));
+    prototype_loopfilter_block((*lf_mbh));
+    prototype_loopfilter_block((*lf_bv));
+    prototype_loopfilter_block((*lf_bh));
    int filter_level;
    int last_sharpness_level;
    int sharpness_level;
@@ -202,12 +196,13 @@ typedef struct VP8Common
 #if CONFIG_RUNTIME_CPU_DETECT
    VP8_COMMON_RTCD rtcd;
 #endif
-#if CONFIG_MULTITHREAD
-    int processor_core_count;
-#endif
-#if CONFIG_POSTPROC
    struct postproc_state  postproc_state;
-#endif
 } VP8_COMMON;

+
+int vp8_adjust_mb_lf_value(MACROBLOCKD *mbd, int filter_level);
+void vp8_init_loop_filter(VP8_COMMON *cm);
+void vp8_frame_init_loop_filter(loop_filter_info *lfi, int frame_type);
+extern void vp8_loop_filter_frame(VP8_COMMON *cm,    MACROBLOCKD *mbd,  int filt_val);
+
 #endif
--- a/vp8/common/onyxd.h
+++ b/vp8/common/onyxd.h
@@ -18,12 +18,10 @@
 extern "C"
 {
 #endif
-#include "vpx/vpx_codec.h"
 #include "type_aliases.h"
 #include "vpx_scale/yv12config.h"
 #include "ppflags.h"
 #include "vpx_ports/mem.h"
-#include "vpx/vpx_codec.h"

    typedef void   *VP8D_PTR;
    typedef struct
@@ -33,8 +31,6 @@ extern "C"
        int     Version;
        int     postprocess;
        int     max_threads;
-        int     error_concealment;
-        int     input_partition;
    } VP8D_CONFIG;
    typedef enum
    {
@@ -54,11 +50,11 @@ extern "C"

    int vp8dx_get_setting(VP8D_PTR comp, VP8D_SETTING oxst);

-    int vp8dx_receive_compressed_data(VP8D_PTR comp, unsigned long size, const unsigned char *dest, int64_t time_stamp);
-    int vp8dx_get_raw_frame(VP8D_PTR comp, YV12_BUFFER_CONFIG *sd, int64_t *time_stamp, int64_t *time_end_stamp, vp8_ppflags_t *flags);
+    int vp8dx_receive_compressed_data(VP8D_PTR comp, unsigned long size, const unsigned char *dest, INT64 time_stamp);
+    int vp8dx_get_raw_frame(VP8D_PTR comp, YV12_BUFFER_CONFIG *sd, INT64 *time_stamp, INT64 *time_end_stamp, vp8_ppflags_t *flags);

-    vpx_codec_err_t vp8dx_get_reference(VP8D_PTR comp, VP8_REFFRAME ref_frame_flag, YV12_BUFFER_CONFIG *sd);
-    vpx_codec_err_t vp8dx_set_reference(VP8D_PTR comp, VP8_REFFRAME ref_frame_flag, YV12_BUFFER_CONFIG *sd);
+    int vp8dx_get_reference(VP8D_PTR comp, VP8_REFFRAME ref_frame_flag, YV12_BUFFER_CONFIG *sd);
+    int vp8dx_set_reference(VP8D_PTR comp, VP8_REFFRAME ref_frame_flag, YV12_BUFFER_CONFIG *sd);

    VP8D_PTR vp8dx_create_decompressor(VP8D_CONFIG *oxcf);

--- a/vp8/common/postproc.c
+++ b/vp8/common/postproc.c
@@ -9,7 +9,7 @@
 */


-#include "vpx_config.h"
+#include "vpx_ports/config.h"
 #include "vpx_scale/yv12config.h"
 #include "postproc.h"
 #include "vpx_scale/yv12extend.h"
@@ -804,14 +804,11 @@ int vp8_post_proc_frame(VP8_COMMON *oci, YV12_BUFFER_CONFIG *dest, vp8_ppflags_t
            for (j = 0; j < mb_cols; j++)
            {
                char zz[4];
-                int dc_diff = !(mi[mb_index].mbmi.mode != B_PRED &&
-                              mi[mb_index].mbmi.mode != SPLITMV &&
-                              mi[mb_index].mbmi.mb_skip_coeff);

                if (oci->frame_type == KEY_FRAME)
                    sprintf(zz, "a");
                else
-                    sprintf(zz, "%c", dc_diff + '0');
+                    sprintf(zz, "%c", mi[mb_index].mbmi.dc_diff + '0');

                vp8_blit_text(zz, y_ptr, post->y_stride);
                mb_index ++;
@@ -837,6 +834,7 @@ int vp8_post_proc_frame(VP8_COMMON *oci, YV12_BUFFER_CONFIG *dest, vp8_ppflags_t
        YV12_BUFFER_CONFIG *post = &oci->post_proc_buffer;
        int width  = post->y_width;
        int height = post->y_height;
+        int mb_cols = width  >> 4;
        unsigned char *y_buffer = oci->post_proc_buffer.y_buffer;
        int y_stride = oci->post_proc_buffer.y_stride;
        MODE_INFO *mi = oci->mi;
@@ -860,7 +858,7 @@ int vp8_post_proc_frame(VP8_COMMON *oci, YV12_BUFFER_CONFIG *dest, vp8_ppflags_t
                    {
                        case 0 :    /* mv_top_bottom */
                        {
-                            union b_mode_info *bmi = &mi->bmi[0];
+                            B_MODE_INFO *bmi = &mi->bmi[0];
                            MV *mv = &bmi->mv.as_mv;

                            x1 = x0 + 8 + (mv->col >> 3);
@@ -881,7 +879,7 @@ int vp8_post_proc_frame(VP8_COMMON *oci, YV12_BUFFER_CONFIG *dest, vp8_ppflags_t
                        }
                        case 1 :    /* mv_left_right */
                        {
-                            union b_mode_info *bmi = &mi->bmi[0];
+                            B_MODE_INFO *bmi = &mi->bmi[0];
                            MV *mv = &bmi->mv.as_mv;

                            x1 = x0 + 4 + (mv->col >> 3);
@@ -902,7 +900,7 @@ int vp8_post_proc_frame(VP8_COMMON *oci, YV12_BUFFER_CONFIG *dest, vp8_ppflags_t
                        }
                        case 2 :    /* mv_quarters   */
                        {
-                            union b_mode_info *bmi = &mi->bmi[0];
+                            B_MODE_INFO *bmi = &mi->bmi[0];
                            MV *mv = &bmi->mv.as_mv;

                            x1 = x0 + 4 + (mv->col >> 3);
@@ -938,7 +936,7 @@ int vp8_post_proc_frame(VP8_COMMON *oci, YV12_BUFFER_CONFIG *dest, vp8_ppflags_t
                        }
                        default :
                        {
-                            union b_mode_info *bmi = mi->bmi;
+                            B_MODE_INFO *bmi = mi->bmi;
                            int bx0, by0;

                            for (by0 = y0; by0 < (y0+16); by0 += 4)
@@ -1011,7 +1009,7 @@ int vp8_post_proc_frame(VP8_COMMON *oci, YV12_BUFFER_CONFIG *dest, vp8_ppflags_t
                {
                    int by, bx;
                    unsigned char *yl, *ul, *vl;
-                    union b_mode_info *bmi = mi->bmi;
+                    B_MODE_INFO *bmi = mi->bmi;

                    yl = y_ptr + x;
                    ul = u_ptr + (x>>1);
@@ -1024,9 +1022,9 @@ int vp8_post_proc_frame(VP8_COMMON *oci, YV12_BUFFER_CONFIG *dest, vp8_ppflags_t
                            if ((ppflags->display_b_modes_flag & (1<<mi->mbmi.mode))
                                || (ppflags->display_mb_modes_flag & B_PRED))
                            {
-                                Y = B_PREDICTION_MODE_colors[bmi->as_mode][0];
-                                U = B_PREDICTION_MODE_colors[bmi->as_mode][1];
-                                V = B_PREDICTION_MODE_colors[bmi->as_mode][2];
+                                Y = B_PREDICTION_MODE_colors[bmi->mode][0];
+                                U = B_PREDICTION_MODE_colors[bmi->mode][1];
+                                V = B_PREDICTION_MODE_colors[bmi->mode][2];

                                POSTPROC_INVOKE(RTCD_VTABLE(oci), blend_b)
                                    (yl+bx, ul+(bx>>1), vl+(bx>>1), Y, U, V, 0xc000, y_stride);
--- a/vp8/common/ppc/loopfilter_altivec.c
+++ b/vp8/common/ppc/loopfilter_altivec.c
@@ -53,8 +53,9 @@ loop_filter_function_s_ppc loop_filter_simple_vertical_edge_ppc;

 // Horizontal MB filtering
 void loop_filter_mbh_ppc(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
-                         int y_stride, int uv_stride, loop_filter_info *lfi)
+                         int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
+    (void)simpler_lpf;
    mbloop_filter_horizontal_edge_y_ppc(y_ptr, y_stride, lfi->mbflim, lfi->lim, lfi->thr);

    if (u_ptr)
@@ -62,8 +63,9 @@ void loop_filter_mbh_ppc(unsigned char *y_ptr, unsigned char *u_ptr, unsigned ch
 }

 void loop_filter_mbhs_ppc(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
-                          int y_stride, int uv_stride, loop_filter_info *lfi)
+                          int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
+    (void)simpler_lpf;
    (void)u_ptr;
    (void)v_ptr;
    (void)uv_stride;
@@ -72,8 +74,9 @@ void loop_filter_mbhs_ppc(unsigned char *y_ptr, unsigned char *u_ptr, unsigned c

 // Vertical MB Filtering
 void loop_filter_mbv_ppc(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
-                         int y_stride, int uv_stride, loop_filter_info *lfi)
+                         int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
+    (void)simpler_lpf;
    mbloop_filter_vertical_edge_y_ppc(y_ptr, y_stride, lfi->mbflim, lfi->lim, lfi->thr);

    if (u_ptr)
@@ -81,8 +84,9 @@ void loop_filter_mbv_ppc(unsigned char *y_ptr, unsigned char *u_ptr, unsigned ch
 }

 void loop_filter_mbvs_ppc(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
-                          int y_stride, int uv_stride, loop_filter_info *lfi)
+                          int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
+    (void)simpler_lpf;
    (void)u_ptr;
    (void)v_ptr;
    (void)uv_stride;
@@ -91,8 +95,9 @@ void loop_filter_mbvs_ppc(unsigned char *y_ptr, unsigned char *u_ptr, unsigned c

 // Horizontal B Filtering
 void loop_filter_bh_ppc(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
-                        int y_stride, int uv_stride, loop_filter_info *lfi)
+                        int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
+    (void)simpler_lpf;
    // These should all be done at once with one call, instead of 3
    loop_filter_horizontal_edge_y_ppc(y_ptr + 4 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr);
    loop_filter_horizontal_edge_y_ppc(y_ptr + 8 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr);
@@ -103,8 +108,9 @@ void loop_filter_bh_ppc(unsigned char *y_ptr, unsigned char *u_ptr, unsigned cha
 }

 void loop_filter_bhs_ppc(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
-                         int y_stride, int uv_stride, loop_filter_info *lfi)
+                         int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
+    (void)simpler_lpf;
    (void)u_ptr;
    (void)v_ptr;
    (void)uv_stride;
@@ -115,8 +121,9 @@ void loop_filter_bhs_ppc(unsigned char *y_ptr, unsigned char *u_ptr, unsigned ch

 // Vertical B Filtering
 void loop_filter_bv_ppc(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
-                        int y_stride, int uv_stride, loop_filter_info *lfi)
+                        int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
+    (void)simpler_lpf;
    loop_filter_vertical_edge_y_ppc(y_ptr, y_stride, lfi->flim, lfi->lim, lfi->thr);

    if (u_ptr)
@@ -124,8 +131,9 @@ void loop_filter_bv_ppc(unsigned char *y_ptr, unsigned char *u_ptr, unsigned cha
 }

 void loop_filter_bvs_ppc(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
-                         int y_stride, int uv_stride, loop_filter_info *lfi)
+                         int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
+    (void)simpler_lpf;
    (void)u_ptr;
    (void)v_ptr;
    (void)uv_stride;
--- a/vp8/common/recon.c
+++ b/vp8/common/recon.c
@@ -9,7 +9,7 @@
 */


-#include "vpx_config.h"
+#include "vpx_ports/config.h"
 #include "recon.h"
 #include "blockd.h"

--- a/vp8/common/recon.h
+++ b/vp8/common/recon.h
@@ -26,9 +26,6 @@
 #define prototype_build_intra_predictors(sym) \
    void sym(MACROBLOCKD *x)

-#define prototype_intra4x4_predict(sym) \
-    void sym(BLOCKD *x, int b_mode, unsigned char *predictor)
-
 struct vp8_recon_rtcd_vtable;

 #if ARCH_X86 || ARCH_X86_64
@@ -91,30 +88,11 @@ extern prototype_build_intra_predictors\
 extern prototype_build_intra_predictors\
    (vp8_recon_build_intra_predictors_mby_s);

-#ifndef vp8_recon_build_intra_predictors_mbuv
-#define vp8_recon_build_intra_predictors_mbuv vp8_build_intra_predictors_mbuv
-#endif
-extern prototype_build_intra_predictors\
-    (vp8_recon_build_intra_predictors_mbuv);
-
-#ifndef vp8_recon_build_intra_predictors_mbuv_s
-#define vp8_recon_build_intra_predictors_mbuv_s vp8_build_intra_predictors_mbuv_s
-#endif
-extern prototype_build_intra_predictors\
-    (vp8_recon_build_intra_predictors_mbuv_s);
-
-#ifndef vp8_recon_intra4x4_predict
-#define vp8_recon_intra4x4_predict vp8_intra4x4_predict
-#endif
-extern prototype_intra4x4_predict\
-    (vp8_recon_intra4x4_predict);
-

 typedef prototype_copy_block((*vp8_copy_block_fn_t));
 typedef prototype_recon_block((*vp8_recon_fn_t));
 typedef prototype_recon_macroblock((*vp8_recon_mb_fn_t));
 typedef prototype_build_intra_predictors((*vp8_build_intra_pred_fn_t));
-typedef prototype_intra4x4_predict((*vp8_intra4x4_pred_fn_t));
 typedef struct vp8_recon_rtcd_vtable
 {
    vp8_copy_block_fn_t  copy16x16;
@@ -127,9 +105,6 @@ typedef struct vp8_recon_rtcd_vtable
    vp8_recon_mb_fn_t    recon_mby;
    vp8_build_intra_pred_fn_t  build_intra_predictors_mby_s;
    vp8_build_intra_pred_fn_t  build_intra_predictors_mby;
-    vp8_build_intra_pred_fn_t  build_intra_predictors_mbuv_s;
-    vp8_build_intra_pred_fn_t  build_intra_predictors_mbuv;
-    vp8_intra4x4_pred_fn_t intra4x4_predict;
 } vp8_recon_rtcd_vtable_t;

 #if CONFIG_RUNTIME_CPU_DETECT
--- a/vp8/common/reconinter.c
+++ b/vp8/common/reconinter.c
@@ -9,8 +9,7 @@
 */


-#include "vpx_config.h"
-#include "vpx/vpx_integer.h"
+#include "vpx_ports/config.h"
 #include "recon.h"
 #include "subpixel.h"
 #include "blockd.h"
@@ -19,6 +18,16 @@
 #include "onyxc_int.h"
 #endif

+/* use this define on systems where unaligned int reads and writes are
+ * not allowed, i.e. ARM architectures
+ */
+/*#define MUST_BE_ALIGNED*/
+
+
+static const int bbb[4] = {0, 2, 8, 10};
+
+
+
 void vp8_copy_mem16x16_c(
    unsigned char *src,
    int src_stride,
@@ -30,7 +39,7 @@ void vp8_copy_mem16x16_c(

    for (r = 0; r < 16; r++)
    {
-#if !(CONFIG_FAST_UNALIGNED)
+#ifdef MUST_BE_ALIGNED
        dst[0] = src[0];
        dst[1] = src[1];
        dst[2] = src[2];
@@ -49,10 +58,10 @@ void vp8_copy_mem16x16_c(
        dst[15] = src[15];

 #else
-        ((uint32_t *)dst)[0] = ((uint32_t *)src)[0] ;
-        ((uint32_t *)dst)[1] = ((uint32_t *)src)[1] ;
-        ((uint32_t *)dst)[2] = ((uint32_t *)src)[2] ;
-        ((uint32_t *)dst)[3] = ((uint32_t *)src)[3] ;
+        ((int *)dst)[0] = ((int *)src)[0] ;
+        ((int *)dst)[1] = ((int *)src)[1] ;
+        ((int *)dst)[2] = ((int *)src)[2] ;
+        ((int *)dst)[3] = ((int *)src)[3] ;

 #endif
        src += src_stride;
@@ -72,7 +81,7 @@ void vp8_copy_mem8x8_c(

    for (r = 0; r < 8; r++)
    {
-#if !(CONFIG_FAST_UNALIGNED)
+#ifdef MUST_BE_ALIGNED
        dst[0] = src[0];
        dst[1] = src[1];
        dst[2] = src[2];
@@ -82,8 +91,8 @@ void vp8_copy_mem8x8_c(
        dst[6] = src[6];
        dst[7] = src[7];
 #else
-        ((uint32_t *)dst)[0] = ((uint32_t *)src)[0] ;
-        ((uint32_t *)dst)[1] = ((uint32_t *)src)[1] ;
+        ((int *)dst)[0] = ((int *)src)[0] ;
+        ((int *)dst)[1] = ((int *)src)[1] ;
 #endif
        src += src_stride;
        dst += dst_stride;
@@ -102,7 +111,7 @@ void vp8_copy_mem8x4_c(

    for (r = 0; r < 4; r++)
    {
-#if !(CONFIG_FAST_UNALIGNED)
+#ifdef MUST_BE_ALIGNED
        dst[0] = src[0];
        dst[1] = src[1];
        dst[2] = src[2];
@@ -112,8 +121,8 @@ void vp8_copy_mem8x4_c(
        dst[6] = src[6];
        dst[7] = src[7];
 #else
-        ((uint32_t *)dst)[0] = ((uint32_t *)src)[0] ;
-        ((uint32_t *)dst)[1] = ((uint32_t *)src)[1] ;
+        ((int *)dst)[0] = ((int *)src)[0] ;
+        ((int *)dst)[1] = ((int *)src)[1] ;
 #endif
        src += src_stride;
        dst += dst_stride;
@@ -145,13 +154,13 @@ void vp8_build_inter_predictors_b(BLOCKD *d, int pitch, vp8_subpix_fn_t sppf)

        for (r = 0; r < 4; r++)
        {
-#if !(CONFIG_FAST_UNALIGNED)
+#ifdef MUST_BE_ALIGNED
            pred_ptr[0]  = ptr[0];
            pred_ptr[1]  = ptr[1];
            pred_ptr[2]  = ptr[2];
            pred_ptr[3]  = ptr[3];
 #else
-            *(uint32_t *)pred_ptr = *(uint32_t *)ptr ;
+            *(int *)pred_ptr = *(int *)ptr ;
 #endif
            pred_ptr     += pitch;
            ptr         += d->pre_stride;
@@ -198,361 +207,487 @@ static void build_inter_predictors2b(MACROBLOCKD *x, BLOCKD *d, int pitch)
 }


-/*encoder only*/
-void vp8_build_inter16x16_predictors_mbuv(MACROBLOCKD *x)
-{
-    unsigned char *uptr, *vptr;
-    unsigned char *upred_ptr = &x->predictor[256];
-    unsigned char *vpred_ptr = &x->predictor[320];
-
-    int mv_row = x->mode_info_context->mbmi.mv.as_mv.row;
-    int mv_col = x->mode_info_context->mbmi.mv.as_mv.col;
-    int offset;
-    int pre_stride = x->block[16].pre_stride;
-
-    /* calc uv motion vectors */
-    if (mv_row < 0)
-        mv_row -= 1;
-    else
-        mv_row += 1;
-
-    if (mv_col < 0)
-        mv_col -= 1;
-    else
-        mv_col += 1;
-
-    mv_row /= 2;
-    mv_col /= 2;
-
-    mv_row &= x->fullpixel_mask;
-    mv_col &= x->fullpixel_mask;
-
-    offset = (mv_row >> 3) * pre_stride + (mv_col >> 3);
-    uptr = x->pre.u_buffer + offset;
-    vptr = x->pre.v_buffer + offset;
-
-    if ((mv_row | mv_col) & 7)
-    {
-        x->subpixel_predict8x8(uptr, pre_stride, mv_col & 7, mv_row & 7, upred_ptr, 8);
-        x->subpixel_predict8x8(vptr, pre_stride, mv_col & 7, mv_row & 7, vpred_ptr, 8);
-    }
-    else
-    {
-        RECON_INVOKE(&x->rtcd->recon, copy8x8)(uptr, pre_stride, upred_ptr, 8);
-        RECON_INVOKE(&x->rtcd->recon, copy8x8)(vptr, pre_stride, vpred_ptr, 8);
-    }
-}
-
-/*encoder only*/
-void vp8_build_inter4x4_predictors_mbuv(MACROBLOCKD *x)
-{
-    int i, j;
-
-    /* build uv mvs */
-    for (i = 0; i < 2; i++)
-    {
-        for (j = 0; j < 2; j++)
-        {
-            int yoffset = i * 8 + j * 2;
-            int uoffset = 16 + i * 2 + j;
-            int voffset = 20 + i * 2 + j;
-
-            int temp;
-
-            temp = x->block[yoffset  ].bmi.mv.as_mv.row
-                   + x->block[yoffset+1].bmi.mv.as_mv.row
-                   + x->block[yoffset+4].bmi.mv.as_mv.row
-                   + x->block[yoffset+5].bmi.mv.as_mv.row;
-
-            if (temp < 0) temp -= 4;
-            else temp += 4;
-
-            x->block[uoffset].bmi.mv.as_mv.row = (temp / 8) & x->fullpixel_mask;
-
-            temp = x->block[yoffset  ].bmi.mv.as_mv.col
-                   + x->block[yoffset+1].bmi.mv.as_mv.col
-                   + x->block[yoffset+4].bmi.mv.as_mv.col
-                   + x->block[yoffset+5].bmi.mv.as_mv.col;
-
-            if (temp < 0) temp -= 4;
-            else temp += 4;
-
-            x->block[uoffset].bmi.mv.as_mv.col = (temp / 8) & x->fullpixel_mask;
-
-            x->block[voffset].bmi.mv.as_mv.row =
-                x->block[uoffset].bmi.mv.as_mv.row ;
-            x->block[voffset].bmi.mv.as_mv.col =
-                x->block[uoffset].bmi.mv.as_mv.col ;
-        }
-    }
-
-    for (i = 16; i < 24; i += 2)
-    {
-        BLOCKD *d0 = &x->block[i];
-        BLOCKD *d1 = &x->block[i+1];
-
-        if (d0->bmi.mv.as_int == d1->bmi.mv.as_int)
-            build_inter_predictors2b(x, d0, 8);
-        else
-        {
-            vp8_build_inter_predictors_b(d0, 8, x->subpixel_predict);
-            vp8_build_inter_predictors_b(d1, 8, x->subpixel_predict);
-        }
-    }
-}
-
-
-/*encoder only*/
-void vp8_build_inter16x16_predictors_mby(MACROBLOCKD *x)
-{
-    unsigned char *ptr_base;
-    unsigned char *ptr;
-    unsigned char *pred_ptr = x->predictor;
-    int mv_row = x->mode_info_context->mbmi.mv.as_mv.row;
-    int mv_col = x->mode_info_context->mbmi.mv.as_mv.col;
-    int pre_stride = x->block[0].pre_stride;
-
-    ptr_base = x->pre.y_buffer;
-    ptr = ptr_base + (mv_row >> 3) * pre_stride + (mv_col >> 3);
-
-    if ((mv_row | mv_col) & 7)
-    {
-        x->subpixel_predict16x16(ptr, pre_stride, mv_col & 7, mv_row & 7, pred_ptr, 16);
-    }
-    else
-    {
-        RECON_INVOKE(&x->rtcd->recon, copy16x16)(ptr, pre_stride, pred_ptr, 16);
-    }
-}
-
-static void clamp_mv_to_umv_border(MV *mv, const MACROBLOCKD *xd)
-{
-    /* If the MV points so far into the UMV border that no visible pixels
-     * are used for reconstruction, the subpel part of the MV can be
-     * discarded and the MV limited to 16 pixels with equivalent results.
-     *
-     * This limit kicks in at 19 pixels for the top and left edges, for
-     * the 16 pixels plus 3 taps right of the central pixel when subpel
-     * filtering. The bottom and right edges use 16 pixels plus 2 pixels
-     * left of the central pixel when filtering.
-     */
-    if (mv->col < (xd->mb_to_left_edge - (19 << 3)))
-        mv->col = xd->mb_to_left_edge - (16 << 3);
-    else if (mv->col > xd->mb_to_right_edge + (18 << 3))
-        mv->col = xd->mb_to_right_edge + (16 << 3);
-
-    if (mv->row < (xd->mb_to_top_edge - (19 << 3)))
-        mv->row = xd->mb_to_top_edge - (16 << 3);
-    else if (mv->row > xd->mb_to_bottom_edge + (18 << 3))
-        mv->row = xd->mb_to_bottom_edge + (16 << 3);
-}
-
-/* A version of the above function for chroma block MVs.*/
-static void clamp_uvmv_to_umv_border(MV *mv, const MACROBLOCKD *xd)
-{
-    mv->col = (2*mv->col < (xd->mb_to_left_edge - (19 << 3))) ?
-        (xd->mb_to_left_edge - (16 << 3)) >> 1 : mv->col;
-    mv->col = (2*mv->col > xd->mb_to_right_edge + (18 << 3)) ?
-        (xd->mb_to_right_edge + (16 << 3)) >> 1 : mv->col;
-
-    mv->row = (2*mv->row < (xd->mb_to_top_edge - (19 << 3))) ?
-        (xd->mb_to_top_edge - (16 << 3)) >> 1 : mv->row;
-    mv->row = (2*mv->row > xd->mb_to_bottom_edge + (18 << 3)) ?
-        (xd->mb_to_bottom_edge + (16 << 3)) >> 1 : mv->row;
-}
-
-void vp8_build_inter16x16_predictors_mb(MACROBLOCKD *x,
-                                        unsigned char *dst_y,
-                                        unsigned char *dst_u,
-                                        unsigned char *dst_v,
-                                        int dst_ystride,
-                                        int dst_uvstride)
-{
-    int offset;
-    unsigned char *ptr;
-    unsigned char *uptr, *vptr;
-
-    int_mv _16x16mv;
-
-    unsigned char *ptr_base = x->pre.y_buffer;
-    int pre_stride = x->block[0].pre_stride;
-
-    _16x16mv.as_int = x->mode_info_context->mbmi.mv.as_int;
-
-    if (x->mode_info_context->mbmi.need_to_clamp_mvs)
-    {
-        clamp_mv_to_umv_border(&_16x16mv.as_mv, x);
-    }
-
-    ptr = ptr_base + ( _16x16mv.as_mv.row >> 3) * pre_stride + (_16x16mv.as_mv.col >> 3);
-
-    if ( _16x16mv.as_int & 0x00070007)
-    {
-        x->subpixel_predict16x16(ptr, pre_stride, _16x16mv.as_mv.col & 7,  _16x16mv.as_mv.row & 7, dst_y, dst_ystride);
-    }
-    else
-    {
-        RECON_INVOKE(&x->rtcd->recon, copy16x16)(ptr, pre_stride, dst_y, dst_ystride);
-    }
-
-    /* calc uv motion vectors */
-    if ( _16x16mv.as_mv.row < 0)
-      _16x16mv.as_mv.row -= 1;
-    else
-      _16x16mv.as_mv.row += 1;
-
-    if (_16x16mv.as_mv.col < 0)
-        _16x16mv.as_mv.col -= 1;
-    else
-        _16x16mv.as_mv.col += 1;
-
-    _16x16mv.as_mv.row /= 2;
-    _16x16mv.as_mv.col /= 2;
-
-    _16x16mv.as_mv.row &= x->fullpixel_mask;
-    _16x16mv.as_mv.col &= x->fullpixel_mask;
-
-    pre_stride >>= 1;
-    offset = ( _16x16mv.as_mv.row >> 3) * pre_stride + (_16x16mv.as_mv.col >> 3);
-    uptr = x->pre.u_buffer + offset;
-    vptr = x->pre.v_buffer + offset;
-
-    if ( _16x16mv.as_int & 0x00070007)
-    {
-        x->subpixel_predict8x8(uptr, pre_stride, _16x16mv.as_mv.col & 7,  _16x16mv.as_mv.row & 7, dst_u, dst_uvstride);
-        x->subpixel_predict8x8(vptr, pre_stride, _16x16mv.as_mv.col & 7,  _16x16mv.as_mv.row & 7, dst_v, dst_uvstride);
-    }
-    else
-    {
-        RECON_INVOKE(&x->rtcd->recon, copy8x8)(uptr, pre_stride, dst_u, dst_uvstride);
-        RECON_INVOKE(&x->rtcd->recon, copy8x8)(vptr, pre_stride, dst_v, dst_uvstride);
-    }
-}
-
-static void build_inter4x4_predictors_mb(MACROBLOCKD *x)
+void vp8_build_inter_predictors_mbuv(MACROBLOCKD *x)
 {
    int i;

-    if (x->mode_info_context->mbmi.partitioning < 3)
+    if (x->mode_info_context->mbmi.ref_frame != INTRA_FRAME &&
+        x->mode_info_context->mbmi.mode != SPLITMV)
    {
-        x->block[ 0].bmi = x->mode_info_context->bmi[ 0];
-        x->block[ 2].bmi = x->mode_info_context->bmi[ 2];
-        x->block[ 8].bmi = x->mode_info_context->bmi[ 8];
-        x->block[10].bmi = x->mode_info_context->bmi[10];
-        if (x->mode_info_context->mbmi.need_to_clamp_mvs)
-        {
-            clamp_mv_to_umv_border(&x->block[ 0].bmi.mv.as_mv, x);
-            clamp_mv_to_umv_border(&x->block[ 2].bmi.mv.as_mv, x);
-            clamp_mv_to_umv_border(&x->block[ 8].bmi.mv.as_mv, x);
-            clamp_mv_to_umv_border(&x->block[10].bmi.mv.as_mv, x);
-        }
+        unsigned char *uptr, *vptr;
+        unsigned char *upred_ptr = &x->predictor[256];
+        unsigned char *vpred_ptr = &x->predictor[320];

-        build_inter_predictors4b(x, &x->block[ 0], 16);
-        build_inter_predictors4b(x, &x->block[ 2], 16);
-        build_inter_predictors4b(x, &x->block[ 8], 16);
-        build_inter_predictors4b(x, &x->block[10], 16);
+        int mv_row = x->block[16].bmi.mv.as_mv.row;
+        int mv_col = x->block[16].bmi.mv.as_mv.col;
+        int offset;
+        int pre_stride = x->block[16].pre_stride;
+
+        offset = (mv_row >> 3) * pre_stride + (mv_col >> 3);
+        uptr = x->pre.u_buffer + offset;
+        vptr = x->pre.v_buffer + offset;
+
+        if ((mv_row | mv_col) & 7)
+        {
+            x->subpixel_predict8x8(uptr, pre_stride, mv_col & 7, mv_row & 7, upred_ptr, 8);
+            x->subpixel_predict8x8(vptr, pre_stride, mv_col & 7, mv_row & 7, vpred_ptr, 8);
+        }
+        else
+        {
+            RECON_INVOKE(&x->rtcd->recon, copy8x8)(uptr, pre_stride, upred_ptr, 8);
+            RECON_INVOKE(&x->rtcd->recon, copy8x8)(vptr, pre_stride, vpred_ptr, 8);
+        }
    }
    else
    {
-        for (i = 0; i < 16; i += 2)
+        for (i = 16; i < 24; i += 2)
        {
            BLOCKD *d0 = &x->block[i];
            BLOCKD *d1 = &x->block[i+1];

-            x->block[i+0].bmi = x->mode_info_context->bmi[i+0];
-            x->block[i+1].bmi = x->mode_info_context->bmi[i+1];
-            if (x->mode_info_context->mbmi.need_to_clamp_mvs)
-            {
-                clamp_mv_to_umv_border(&x->block[i+0].bmi.mv.as_mv, x);
-                clamp_mv_to_umv_border(&x->block[i+1].bmi.mv.as_mv, x);
-            }
-
            if (d0->bmi.mv.as_int == d1->bmi.mv.as_int)
-                build_inter_predictors2b(x, d0, 16);
+                build_inter_predictors2b(x, d0, 8);
            else
            {
-                vp8_build_inter_predictors_b(d0, 16, x->subpixel_predict);
-                vp8_build_inter_predictors_b(d1, 16, x->subpixel_predict);
+                vp8_build_inter_predictors_b(d0, 8, x->subpixel_predict);
+                vp8_build_inter_predictors_b(d1, 8, x->subpixel_predict);
            }
-
-        }
-
-    }
-
-    for (i = 16; i < 24; i += 2)
-    {
-        BLOCKD *d0 = &x->block[i];
-        BLOCKD *d1 = &x->block[i+1];
-
-        /* Note: uv mvs already clamped in build_4x4uvmvs() */
-
-        if (d0->bmi.mv.as_int == d1->bmi.mv.as_int)
-            build_inter_predictors2b(x, d0, 8);
-        else
-        {
-            vp8_build_inter_predictors_b(d0, 8, x->subpixel_predict);
-            vp8_build_inter_predictors_b(d1, 8, x->subpixel_predict);
        }
    }
 }

-static
-void build_4x4uvmvs(MACROBLOCKD *x)
+/*encoder only*/
+void vp8_build_inter_predictors_mby(MACROBLOCKD *x)
 {
-    int i, j;

-    for (i = 0; i < 2; i++)
+  if (x->mode_info_context->mbmi.ref_frame != INTRA_FRAME &&
+      x->mode_info_context->mbmi.mode != SPLITMV)
    {
-        for (j = 0; j < 2; j++)
+        unsigned char *ptr_base;
+        unsigned char *ptr;
+        unsigned char *pred_ptr = x->predictor;
+        int mv_row = x->mode_info_context->mbmi.mv.as_mv.row;
+        int mv_col = x->mode_info_context->mbmi.mv.as_mv.col;
+        int pre_stride = x->block[0].pre_stride;
+
+        ptr_base = x->pre.y_buffer;
+        ptr = ptr_base + (mv_row >> 3) * pre_stride + (mv_col >> 3);
+
+        if ((mv_row | mv_col) & 7)
        {
-            int yoffset = i * 8 + j * 2;
-            int uoffset = 16 + i * 2 + j;
-            int voffset = 20 + i * 2 + j;
+            x->subpixel_predict16x16(ptr, pre_stride, mv_col & 7, mv_row & 7, pred_ptr, 16);
+        }
+        else
+        {
+            RECON_INVOKE(&x->rtcd->recon, copy16x16)(ptr, pre_stride, pred_ptr, 16);
+        }
+    }
+    else
+    {
+        int i;

-            int temp;
+        if (x->mode_info_context->mbmi.partitioning < 3)
+        {
+            for (i = 0; i < 4; i++)
+            {
+                BLOCKD *d = &x->block[bbb[i]];
+                build_inter_predictors4b(x, d, 16);
+            }

-            temp = x->mode_info_context->bmi[yoffset + 0].mv.as_mv.row
-                 + x->mode_info_context->bmi[yoffset + 1].mv.as_mv.row
-                 + x->mode_info_context->bmi[yoffset + 4].mv.as_mv.row
-                 + x->mode_info_context->bmi[yoffset + 5].mv.as_mv.row;
+        }
+        else
+        {
+            for (i = 0; i < 16; i += 2)
+            {
+                BLOCKD *d0 = &x->block[i];
+                BLOCKD *d1 = &x->block[i+1];

-            if (temp < 0) temp -= 4;
-            else temp += 4;
+                if (d0->bmi.mv.as_int == d1->bmi.mv.as_int)
+                    build_inter_predictors2b(x, d0, 16);
+                else
+                {
+                    vp8_build_inter_predictors_b(d0, 16, x->subpixel_predict);
+                    vp8_build_inter_predictors_b(d1, 16, x->subpixel_predict);
+                }

-            x->block[uoffset].bmi.mv.as_mv.row = (temp / 8) & x->fullpixel_mask;
-
-            temp = x->mode_info_context->bmi[yoffset + 0].mv.as_mv.col
-                 + x->mode_info_context->bmi[yoffset + 1].mv.as_mv.col
-                 + x->mode_info_context->bmi[yoffset + 4].mv.as_mv.col
-                 + x->mode_info_context->bmi[yoffset + 5].mv.as_mv.col;
-
-            if (temp < 0) temp -= 4;
-            else temp += 4;
-
-            x->block[uoffset].bmi.mv.as_mv.col = (temp / 8) & x->fullpixel_mask;
-
-            if (x->mode_info_context->mbmi.need_to_clamp_mvs)
-                clamp_uvmv_to_umv_border(&x->block[uoffset].bmi.mv.as_mv, x);
-
-            x->block[voffset].bmi.mv.as_mv.row =
-                x->block[uoffset].bmi.mv.as_mv.row ;
-            x->block[voffset].bmi.mv.as_mv.col =
-                x->block[uoffset].bmi.mv.as_mv.col ;
+            }
        }
    }
 }

 void vp8_build_inter_predictors_mb(MACROBLOCKD *x)
 {
-    if (x->mode_info_context->mbmi.mode != SPLITMV)
+
+    if (x->mode_info_context->mbmi.ref_frame != INTRA_FRAME &&
+        x->mode_info_context->mbmi.mode != SPLITMV)
    {
-        vp8_build_inter16x16_predictors_mb(x, x->predictor, &x->predictor[256],
-                                           &x->predictor[320], 16, 8);
+        int offset;
+        unsigned char *ptr_base;
+        unsigned char *ptr;
+        unsigned char *uptr, *vptr;
+        unsigned char *pred_ptr = x->predictor;
+        unsigned char *upred_ptr = &x->predictor[256];
+        unsigned char *vpred_ptr = &x->predictor[320];
+
+        int mv_row = x->mode_info_context->mbmi.mv.as_mv.row;
+        int mv_col = x->mode_info_context->mbmi.mv.as_mv.col;
+        int pre_stride = x->block[0].pre_stride;
+
+        ptr_base = x->pre.y_buffer;
+        ptr = ptr_base + (mv_row >> 3) * pre_stride + (mv_col >> 3);
+
+        if ((mv_row | mv_col) & 7)
+        {
+            x->subpixel_predict16x16(ptr, pre_stride, mv_col & 7, mv_row & 7, pred_ptr, 16);
+        }
+        else
+        {
+            RECON_INVOKE(&x->rtcd->recon, copy16x16)(ptr, pre_stride, pred_ptr, 16);
+        }
+
+        mv_row = x->block[16].bmi.mv.as_mv.row;
+        mv_col = x->block[16].bmi.mv.as_mv.col;
+        pre_stride >>= 1;
+        offset = (mv_row >> 3) * pre_stride + (mv_col >> 3);
+        uptr = x->pre.u_buffer + offset;
+        vptr = x->pre.v_buffer + offset;
+
+        if ((mv_row | mv_col) & 7)
+        {
+            x->subpixel_predict8x8(uptr, pre_stride, mv_col & 7, mv_row & 7, upred_ptr, 8);
+            x->subpixel_predict8x8(vptr, pre_stride, mv_col & 7, mv_row & 7, vpred_ptr, 8);
+        }
+        else
+        {
+            RECON_INVOKE(&x->rtcd->recon, copy8x8)(uptr, pre_stride, upred_ptr, 8);
+            RECON_INVOKE(&x->rtcd->recon, copy8x8)(vptr, pre_stride, vpred_ptr, 8);
+        }
    }
    else
    {
-        build_4x4uvmvs(x);
-        build_inter4x4_predictors_mb(x);
+        int i;
+
+        if (x->mode_info_context->mbmi.partitioning < 3)
+        {
+            for (i = 0; i < 4; i++)
+            {
+                BLOCKD *d = &x->block[bbb[i]];
+                build_inter_predictors4b(x, d, 16);
+            }
+        }
+        else
+        {
+            for (i = 0; i < 16; i += 2)
+            {
+                BLOCKD *d0 = &x->block[i];
+                BLOCKD *d1 = &x->block[i+1];
+
+                if (d0->bmi.mv.as_int == d1->bmi.mv.as_int)
+                    build_inter_predictors2b(x, d0, 16);
+                else
+                {
+                    vp8_build_inter_predictors_b(d0, 16, x->subpixel_predict);
+                    vp8_build_inter_predictors_b(d1, 16, x->subpixel_predict);
+                }
+
+            }
+
+        }
+
+        for (i = 16; i < 24; i += 2)
+        {
+            BLOCKD *d0 = &x->block[i];
+            BLOCKD *d1 = &x->block[i+1];
+
+            if (d0->bmi.mv.as_int == d1->bmi.mv.as_int)
+                build_inter_predictors2b(x, d0, 8);
+            else
+            {
+                vp8_build_inter_predictors_b(d0, 8, x->subpixel_predict);
+                vp8_build_inter_predictors_b(d1, 8, x->subpixel_predict);
+            }
+
+        }
+
    }
 }

+void vp8_build_uvmvs(MACROBLOCKD *x, int fullpixel)
+{
+    int i, j;
+
+    if (x->mode_info_context->mbmi.mode == SPLITMV)
+    {
+        for (i = 0; i < 2; i++)
+        {
+            for (j = 0; j < 2; j++)
+            {
+                int yoffset = i * 8 + j * 2;
+                int uoffset = 16 + i * 2 + j;
+                int voffset = 20 + i * 2 + j;
+
+                int temp;
+
+                temp = x->block[yoffset  ].bmi.mv.as_mv.row
+                       + x->block[yoffset+1].bmi.mv.as_mv.row
+                       + x->block[yoffset+4].bmi.mv.as_mv.row
+                       + x->block[yoffset+5].bmi.mv.as_mv.row;
+
+                if (temp < 0) temp -= 4;
+                else temp += 4;
+
+                x->block[uoffset].bmi.mv.as_mv.row = temp / 8;
+
+                if (fullpixel)
+                    x->block[uoffset].bmi.mv.as_mv.row = (temp / 8) & 0xfffffff8;
+
+                temp = x->block[yoffset  ].bmi.mv.as_mv.col
+                       + x->block[yoffset+1].bmi.mv.as_mv.col
+                       + x->block[yoffset+4].bmi.mv.as_mv.col
+                       + x->block[yoffset+5].bmi.mv.as_mv.col;
+
+                if (temp < 0) temp -= 4;
+                else temp += 4;
+
+                x->block[uoffset].bmi.mv.as_mv.col = temp / 8;
+
+                if (fullpixel)
+                    x->block[uoffset].bmi.mv.as_mv.col = (temp / 8) & 0xfffffff8;
+
+                x->block[voffset].bmi.mv.as_mv.row = x->block[uoffset].bmi.mv.as_mv.row ;
+                x->block[voffset].bmi.mv.as_mv.col = x->block[uoffset].bmi.mv.as_mv.col ;
+            }
+        }
+    }
+    else
+    {
+        int mvrow = x->mode_info_context->mbmi.mv.as_mv.row;
+        int mvcol = x->mode_info_context->mbmi.mv.as_mv.col;
+
+        if (mvrow < 0)
+            mvrow -= 1;
+        else
+            mvrow += 1;
+
+        if (mvcol < 0)
+            mvcol -= 1;
+        else
+            mvcol += 1;
+
+        mvrow /= 2;
+        mvcol /= 2;
+
+        for (i = 0; i < 8; i++)
+        {
+            x->block[ 16 + i].bmi.mv.as_mv.row = mvrow;
+            x->block[ 16 + i].bmi.mv.as_mv.col = mvcol;
+
+            if (fullpixel)
+            {
+                x->block[ 16 + i].bmi.mv.as_mv.row = mvrow & 0xfffffff8;
+                x->block[ 16 + i].bmi.mv.as_mv.col = mvcol & 0xfffffff8;
+            }
+        }
+    }
+}
+
+
+/* The following functions are wriiten for skip_recon_mb() to call. Since there is no recon in this
+ * situation, we can write the result directly to dst buffer instead of writing it to predictor
+ * buffer and then copying it to dst buffer.
+ */
+static void vp8_build_inter_predictors_b_s(BLOCKD *d, unsigned char *dst_ptr, vp8_subpix_fn_t sppf)
+{
+    int r;
+    unsigned char *ptr_base;
+    unsigned char *ptr;
+    /*unsigned char *pred_ptr = d->predictor;*/
+    int dst_stride = d->dst_stride;
+    int pre_stride = d->pre_stride;
+
+    ptr_base = *(d->base_pre);
+
+    if (d->bmi.mv.as_mv.row & 7 || d->bmi.mv.as_mv.col & 7)
+    {
+        ptr = ptr_base + d->pre + (d->bmi.mv.as_mv.row >> 3) * d->pre_stride + (d->bmi.mv.as_mv.col >> 3);
+        sppf(ptr, pre_stride, d->bmi.mv.as_mv.col & 7, d->bmi.mv.as_mv.row & 7, dst_ptr, dst_stride);
+    }
+    else
+    {
+        ptr_base += d->pre + (d->bmi.mv.as_mv.row >> 3) * d->pre_stride + (d->bmi.mv.as_mv.col >> 3);
+        ptr = ptr_base;
+
+        for (r = 0; r < 4; r++)
+        {
+#ifdef MUST_BE_ALIGNED
+            dst_ptr[0]   = ptr[0];
+            dst_ptr[1]   = ptr[1];
+            dst_ptr[2]   = ptr[2];
+            dst_ptr[3]   = ptr[3];
+#else
+            *(int *)dst_ptr = *(int *)ptr ;
+#endif
+            dst_ptr      += dst_stride;
+            ptr         += pre_stride;
+        }
+    }
+}
+
+
+
+void vp8_build_inter_predictors_mb_s(MACROBLOCKD *x)
+{
+    /*unsigned char *pred_ptr = x->block[0].predictor;
+    unsigned char *dst_ptr = *(x->block[0].base_dst) + x->block[0].dst;*/
+    unsigned char *pred_ptr = x->predictor;
+    unsigned char *dst_ptr = x->dst.y_buffer;
+
+    if (x->mode_info_context->mbmi.mode != SPLITMV)
+    {
+        int offset;
+        unsigned char *ptr_base;
+        unsigned char *ptr;
+        unsigned char *uptr, *vptr;
+        /*unsigned char *pred_ptr = x->predictor;
+        unsigned char *upred_ptr = &x->predictor[256];
+        unsigned char *vpred_ptr = &x->predictor[320];*/
+        unsigned char *udst_ptr = x->dst.u_buffer;
+        unsigned char *vdst_ptr = x->dst.v_buffer;
+
+        int mv_row = x->mode_info_context->mbmi.mv.as_mv.row;
+        int mv_col = x->mode_info_context->mbmi.mv.as_mv.col;
+        int pre_stride = x->dst.y_stride; /*x->block[0].pre_stride;*/
+
+        ptr_base = x->pre.y_buffer;
+        ptr = ptr_base + (mv_row >> 3) * pre_stride + (mv_col >> 3);
+
+        if ((mv_row | mv_col) & 7)
+        {
+            x->subpixel_predict16x16(ptr, pre_stride, mv_col & 7, mv_row & 7, dst_ptr, x->dst.y_stride); /*x->block[0].dst_stride);*/
+        }
+        else
+        {
+            RECON_INVOKE(&x->rtcd->recon, copy16x16)(ptr, pre_stride, dst_ptr, x->dst.y_stride); /*x->block[0].dst_stride);*/
+        }
+
+        mv_row = x->block[16].bmi.mv.as_mv.row;
+        mv_col = x->block[16].bmi.mv.as_mv.col;
+        pre_stride >>= 1;
+        offset = (mv_row >> 3) * pre_stride + (mv_col >> 3);
+        uptr = x->pre.u_buffer + offset;
+        vptr = x->pre.v_buffer + offset;
+
+        if ((mv_row | mv_col) & 7)
+        {
+            x->subpixel_predict8x8(uptr, pre_stride, mv_col & 7, mv_row & 7, udst_ptr, x->dst.uv_stride);
+            x->subpixel_predict8x8(vptr, pre_stride, mv_col & 7, mv_row & 7, vdst_ptr, x->dst.uv_stride);
+        }
+        else
+        {
+            RECON_INVOKE(&x->rtcd->recon, copy8x8)(uptr, pre_stride, udst_ptr, x->dst.uv_stride);
+            RECON_INVOKE(&x->rtcd->recon, copy8x8)(vptr, pre_stride, vdst_ptr, x->dst.uv_stride);
+        }
+    }
+    else
+    {
+        /* note: this whole ELSE part is not executed at all. So, no way to test the correctness of my modification. Later,
+         * if sth is wrong, go back to what it is in build_inter_predictors_mb.
+         */
+        int i;
+
+        if (x->mode_info_context->mbmi.partitioning < 3)
+        {
+            for (i = 0; i < 4; i++)
+            {
+                BLOCKD *d = &x->block[bbb[i]];
+                /*build_inter_predictors4b(x, d, 16);*/
+
+                {
+                    unsigned char *ptr_base;
+                    unsigned char *ptr;
+                    unsigned char *pred_ptr = d->predictor;
+
+                    ptr_base = *(d->base_pre);
+                    ptr = ptr_base + d->pre + (d->bmi.mv.as_mv.row >> 3) * d->pre_stride + (d->bmi.mv.as_mv.col >> 3);
+
+                    if (d->bmi.mv.as_mv.row & 7 || d->bmi.mv.as_mv.col & 7)
+                    {
+                        x->subpixel_predict8x8(ptr, d->pre_stride, d->bmi.mv.as_mv.col & 7, d->bmi.mv.as_mv.row & 7, dst_ptr, x->dst.y_stride); /*x->block[0].dst_stride);*/
+                    }
+                    else
+                    {
+                        RECON_INVOKE(&x->rtcd->recon, copy8x8)(ptr, d->pre_stride, dst_ptr, x->dst.y_stride); /*x->block[0].dst_stride);*/
+                    }
+                }
+            }
+        }
+        else
+        {
+            for (i = 0; i < 16; i += 2)
+            {
+                BLOCKD *d0 = &x->block[i];
+                BLOCKD *d1 = &x->block[i+1];
+
+                if (d0->bmi.mv.as_int == d1->bmi.mv.as_int)
+                {
+                    /*build_inter_predictors2b(x, d0, 16);*/
+                    unsigned char *ptr_base;
+                    unsigned char *ptr;
+                    unsigned char *pred_ptr = d0->predictor;
+
+                    ptr_base = *(d0->base_pre);
+                    ptr = ptr_base + d0->pre + (d0->bmi.mv.as_mv.row >> 3) * d0->pre_stride + (d0->bmi.mv.as_mv.col >> 3);
+
+                    if (d0->bmi.mv.as_mv.row & 7 || d0->bmi.mv.as_mv.col & 7)
+                    {
+                        x->subpixel_predict8x4(ptr, d0->pre_stride, d0->bmi.mv.as_mv.col & 7, d0->bmi.mv.as_mv.row & 7, dst_ptr, x->dst.y_stride);
+                    }
+                    else
+                    {
+                        RECON_INVOKE(&x->rtcd->recon, copy8x4)(ptr, d0->pre_stride, dst_ptr, x->dst.y_stride);
+                    }
+                }
+                else
+                {
+                    vp8_build_inter_predictors_b_s(d0, dst_ptr, x->subpixel_predict);
+                    vp8_build_inter_predictors_b_s(d1, dst_ptr, x->subpixel_predict);
+                }
+            }
+        }
+
+        for (i = 16; i < 24; i += 2)
+        {
+            BLOCKD *d0 = &x->block[i];
+            BLOCKD *d1 = &x->block[i+1];
+
+            if (d0->bmi.mv.as_int == d1->bmi.mv.as_int)
+            {
+                /*build_inter_predictors2b(x, d0, 8);*/
+                unsigned char *ptr_base;
+                unsigned char *ptr;
+                unsigned char *pred_ptr = d0->predictor;
+
+                ptr_base = *(d0->base_pre);
+                ptr = ptr_base + d0->pre + (d0->bmi.mv.as_mv.row >> 3) * d0->pre_stride + (d0->bmi.mv.as_mv.col >> 3);
+
+                if (d0->bmi.mv.as_mv.row & 7 || d0->bmi.mv.as_mv.col & 7)
+                {
+                    x->subpixel_predict8x4(ptr, d0->pre_stride,
+                        d0->bmi.mv.as_mv.col & 7,
+                        d0->bmi.mv.as_mv.row & 7,
+                        dst_ptr, x->dst.uv_stride);
+                }
+                else
+                {
+                    RECON_INVOKE(&x->rtcd->recon, copy8x4)(ptr,
+                        d0->pre_stride, dst_ptr, x->dst.uv_stride);
+                }
+            }
+            else
+            {
+                vp8_build_inter_predictors_b_s(d0, dst_ptr, x->subpixel_predict);
+                vp8_build_inter_predictors_b_s(d1, dst_ptr, x->subpixel_predict);
+            }
+        }
+    }
+}
--- a/vp8/common/reconinter.h
+++ b/vp8/common/reconinter.h
@@ -13,18 +13,11 @@
 #define __INC_RECONINTER_H

 extern void vp8_build_inter_predictors_mb(MACROBLOCKD *x);
-extern void vp8_build_inter16x16_predictors_mb(MACROBLOCKD *x,
-                                               unsigned char *dst_y,
-                                               unsigned char *dst_u,
-                                               unsigned char *dst_v,
-                                               int dst_ystride,
-                                               int dst_uvstride);
+extern void vp8_build_inter_predictors_mb_s(MACROBLOCKD *x);

-
-extern void vp8_build_inter16x16_predictors_mby(MACROBLOCKD *x);
+extern void vp8_build_inter_predictors_mby(MACROBLOCKD *x);
+extern void vp8_build_uvmvs(MACROBLOCKD *x, int fullpixel);
 extern void vp8_build_inter_predictors_b(BLOCKD *d, int pitch, vp8_subpix_fn_t sppf);
-
-extern void vp8_build_inter16x16_predictors_mbuv(MACROBLOCKD *x);
-extern void vp8_build_inter4x4_predictors_mbuv(MACROBLOCKD *x);
+extern void vp8_build_inter_predictors_mbuv(MACROBLOCKD *x);

 #endif
--- a/vp8/common/reconintra.c
+++ b/vp8/common/reconintra.c
@@ -9,7 +9,7 @@
 */


-#include "vpx_config.h"
+#include "vpx_ports/config.h"
 #include "recon.h"
 #include "reconintra.h"
 #include "vpx_mem/vpx_mem.h"
--- a/vp8/common/reconintra.h
+++ b/vp8/common/reconintra.h
@@ -14,4 +14,9 @@

 extern void init_intra_left_above_pixels(MACROBLOCKD *x);

+extern void vp8_build_intra_predictors_mbuv(MACROBLOCKD *x);
+extern void vp8_build_intra_predictors_mbuv_s(MACROBLOCKD *x);
+
+extern void vp8_predict_intra4x4(BLOCKD *x, int b_mode, unsigned char *Predictor);
+
 #endif
--- a/vp8/common/reconintra4x4.c
+++ b/vp8/common/reconintra4x4.c
@@ -9,12 +9,12 @@
 */


-#include "vpx_config.h"
+#include "vpx_ports/config.h"
 #include "recon.h"
 #include "vpx_mem/vpx_mem.h"
 #include "reconintra.h"

-void vp8_intra4x4_predict(BLOCKD *x,
+void vp8_predict_intra4x4(BLOCKD *x,
                          int b_mode,
                          unsigned char *predictor)
 {
--- a/vp8/common/systemdependent.h
+++ b/vp8/common/systemdependent.h
@@ -9,7 +9,7 @@
 */


-#include "vpx_config.h"
+#include "vpx_ports/config.h"
 #if ARCH_X86 || ARCH_X86_64
 void vpx_reset_mmx_state(void);
 #define vp8_clear_system_state() vpx_reset_mmx_state()
--- a/vp8/common/threading.h
+++ b/vp8/common/threading.h
@@ -12,6 +12,8 @@
 #ifndef _PTHREAD_EMULATION
 #define _PTHREAD_EMULATION

+#define VPXINFINITE 10000       /* 10second. */
+
 #if CONFIG_OS_SUPPORT && CONFIG_MULTITHREAD

 /* Thread management macros */
@@ -26,7 +28,7 @@
 #define pthread_t HANDLE
 #define pthread_attr_t DWORD
 #define pthread_create(thhandle,attr,thfunc,tharg) (int)((*thhandle=(HANDLE)_beginthreadex(NULL,0,(unsigned int (__stdcall *)(void *))thfunc,tharg,0,NULL))==NULL)
-#define pthread_join(thread, result) ((WaitForSingleObject((thread),INFINITE)!=WAIT_OBJECT_0) || !CloseHandle(thread))
+#define pthread_join(thread, result) ((WaitForSingleObject((thread),VPXINFINITE)!=WAIT_OBJECT_0) || !CloseHandle(thread))
 #define pthread_detach(thread) if(thread!=NULL)CloseHandle(thread)
 #define thread_sleep(nms) Sleep(nms)
 #define pthread_cancel(thread) terminate_thread(thread,0)
@@ -59,9 +61,9 @@
 #ifdef _WIN32
 #define sem_t HANDLE
 #define pause(voidpara) __asm PAUSE
-#define sem_init(sem, sem_attr1, sem_init_value) (int)((*sem = CreateSemaphore(NULL,0,32768,NULL))==NULL)
-#define sem_wait(sem) (int)(WAIT_OBJECT_0 != WaitForSingleObject(*sem,INFINITE))
-#define sem_post(sem) ReleaseSemaphore(*sem,1,NULL)
+#define sem_init(sem, sem_attr1, sem_init_value) (int)((*sem = CreateEvent(NULL,FALSE,FALSE,NULL))==NULL)
+#define sem_wait(sem) (int)(WAIT_OBJECT_0 != WaitForSingleObject(*sem,VPXINFINITE))
+#define sem_post(sem) SetEvent(*sem)
 #define sem_destroy(sem) if(*sem)((int)(CloseHandle(*sem))==TRUE)
 #define thread_sleep(nms) Sleep(nms)

--- a/vp8/common/x86/boolcoder.cxx
+++ b/vp8/common/x86/boolcoder.cxx
@@ -0,0 +1,494 @@
+/*
+ *  Copyright (c) 2010 The WebM project authors. All Rights Reserved.
+ *
+ *  Use of this source code is governed by a BSD-style license
+ *  that can be found in the LICENSE file in the root of the source
+ *  tree. An additional intellectual property rights grant can be found
+ *  in the file PATENTS.  All contributing project authors may
+ *  be found in the AUTHORS file in the root of the source tree.
+ */
+
+
+
+/* Arithmetic bool coder with largish probability range.
+   Timothy S Murphy  6 August 2004 */
+
+#include <assert.h>
+#include <math.h>
+
+#include "bool_coder.h"
+
+#if tim_vp8
+    extern "C" {
+#       include "VP8cx/treewriter.h"
+    }
+#endif
+
+int_types::~int_types() {}
+
+void bool_coder_spec::check_prec() const {
+    assert( w  &&  (r==Up || w > 1)  &&  w < 24  &&  (ebias || w < 17));
+}
+
+bool bool_coder_spec::float_init( uint Ebits, uint Mbits) {
+    uint b = (ebits = Ebits) + (mbits = Mbits);
+    if( b) {
+        assert( ebits < 6  &&  w + mbits < 31);
+        assert( ebits + mbits  <  sizeof(Index) * 8);
+        ebias = (1 << ebits) + 1 + mbits;
+        mmask = (1 << mbits) - 1;
+        max_index = ( ( half_index = 1 << b ) << 1) - 1;
+    } else {
+        ebias = 0;
+        max_index = 255;
+        half_index = 128;
+    }
+    check_prec();
+    return b? 1:0;
+}
+
+void bool_coder_spec::cost_init()
+{
+    static cdouble c = -(1 << 20)/log( 2.);
+
+    FILE *f = fopen( "costs.txt", "w");
+    assert( f);
+
+    assert( sizeof(int) >= 4);  /* for C interface */
+    assert( max_index <= 255);   /* size of Ctbl */
+    uint i = 0;  do {
+        cdouble p = ( *this)( (Index) i);
+        Ctbl[i] = (uint32) ( log( p) * c);
+        fprintf(
+            f, "cost( %d -> %10.7f) = %10d = %12.5f bits\n",
+            i, p, Ctbl[i], (double) Ctbl[i] / (1<<20)
+        );
+    } while( ++i <= max_index);
+    fclose( f);
+}
+
+bool_coder_spec_explicit_table::bool_coder_spec_explicit_table(
+    cuint16 tbl[256], Rounding rr, uint prec
+)
+  : bool_coder_spec( prec, rr)
+{
+    check_prec();
+    uint i = 0;
+    if( tbl)
+        do { Ptbl[i] = tbl[i];}  while( ++i < 256);
+    else
+        do { Ptbl[i] = i << 8;}  while( ++i < 256);
+    cost_init();
+}
+
+
+bool_coder_spec_exponential_table::bool_coder_spec_exponential_table(
+    uint x, Rounding rr, uint prec
+)
+  : bool_coder_spec( prec, rr)
+{
+    assert( x > 1  &&  x <= 16);
+    check_prec();
+    Ptbl[128] = 32768u;
+    Ptbl[0] = (uint16) pow( 2., 16. - x);
+    --x;
+    int i=1;  do {
+        cdouble d = pow( .5, 1. + (1. - i/128.)*x) * 65536.;
+        uint16 v = (uint16) d;
+        if( v < i)
+            v = i;
+        Ptbl[256-i] = (uint16) ( 65536U - (Ptbl[i] = v));
+    } while( ++i < 128);
+    cost_init();
+}
+
+bool_coder_spec::bool_coder_spec( FILE *fp) {
+    fscanf( fp, "%d", &w);
+    int v;
+    fscanf( fp, "%d", &v);
+    assert( 0 <= v  &&  v <= 2);
+    r = (Rounding) v;
+    fscanf( fp, "%d", &ebits);
+    fscanf( fp, "%d", &mbits);
+    if( float_init( ebits, mbits))
+        return;
+    int i=0;  do {
+        uint v;
+        fscanf( fp, "%d", &v);
+        assert( 0 <=v  &&  v <= 65535U);
+        Ptbl[i] = v;
+    } while( ++i < 256);
+    cost_init();
+}
+
+void bool_coder_spec::dump( FILE *fp) const {
+    fprintf( fp, "%d %d %d %d\n", w, (int) r, ebits, mbits);
+    if( ebits  ||  mbits)
+        return;
+    int i=0;  do { fprintf( fp, "%d\n", Ptbl[i]);}  while( ++i < 256);
+}
+
+vp8bc_index_t bool_coder_spec::operator()( double p) const
+{
+    if( p <= 0.)
+        return 0;
+    if( p >= 1.)
+        return max_index;
+    if( ebias) {
+        if( p > .5)
+            return max_index - ( *this)( 1. - p);
+        int e;
+        uint m = (uint) ldexp( frexp( p, &e), mbits + 2);
+        uint x = 1 << (mbits + 1);
+        assert( x <= m  &&  m < x<<1);
+        if( (m = (m >> 1) + (m & 1)) >= x) {
+            m = x >> 1;
+            ++e;
+        }
+        int y = 1 << ebits;
+        if( (e += y) >= y)
+            return half_index - 1;
+        if( e < 0)
+            return 0;
+        return (Index) ( (e << mbits) + (m & mmask));
+    }
+
+    cuint16 v = (uint16) (p * 65536.);
+    int i = 128;
+    int j = 128;
+    uint16 w;
+    while( w = Ptbl[i], j >>= 1) {
+        if( w < v)
+            i += j;
+        else if( w == v)
+            return (uchar) i;
+        else
+            i -= j;
+    }
+    if( w > v) {
+        cuint16 x = Ptbl[i-1];
+        if( v <= x  ||  w - v > v - x)
+            --i;
+    } else if( w < v  &&  i < 255) {
+        cuint16 x = Ptbl[i+1];
+        if( x <= v  ||  x - v < v - w)
+            ++i;
+    }
+    return (Index) i;
+}
+
+double bool_coder_spec::operator()( Index i) const {
+    if( !ebias)
+        return Ptbl[i]/65536.;
+    if( i >= half_index)
+        return 1. - ( *this)( (Index) (max_index - i));
+    return ldexp( (double)mantissa( i), - (int) exponent( i));
+}
+
+
+
+void bool_writer::carry() {
+    uchar *p = B;
+    assert( p > Bstart);
+    while( *--p == 255) { assert( p > Bstart);  *p = 0;}
+    ++*p;
+}
+
+
+bool_writer::bool_writer( c_spec& s, uchar *Dest, size_t Len)
+  : bool_coder( s),
+    Bstart( Dest),
+    Bend( Len? Dest+Len : 0),
+    B( Dest)
+{
+    assert( Dest);
+    reset();
+}
+
+bool_writer::~bool_writer() { flush();}
+
+#if 1
+    extern "C" { int bc_v = 0;}
+#else
+#   define bc_v 0
+#endif
+
+
+void bool_writer::raw( bool value, uint32 s) {
+    uint32 L = Low;
+
+    assert( Range >= min_range  &&  Range <= spec.max_range());
+    assert( !is_toast  &&  s  &&  s < Range);
+
+    if( bc_v) printf(
+        "Writing a %d, B %x  Low %x  Range %x  s %x   blag %d ...\n",
+        value? 1:0, B-Bstart, Low, Range, s, bit_lag
+    );
+    if( value) {
+        L += s;
+        s = Range - s;
+    } else
+        s -= rinc;
+    if( s < min_range) {
+        int ct = bit_lag;  do {
+            if( !--ct) {
+                ct = 8;
+                if( L & (1 << 31))
+                    carry();
+                assert( !Bend  ||  B < Bend);
+                *B++ = (uchar) (L >> 23);
+                L &= (1<<23) - 1;
+            }
+        } while( L += L, (s += s + rinc) < min_range);
+        bit_lag = ct;
+    }
+    Low = L;
+    Range = s;
+    if( bc_v)
+        printf(
+            "...done, B %x  Low %x  Range %x  blag %d \n",
+                B-Bstart, Low, Range, bit_lag
+        );
+}
+
+bool_writer& bool_writer::flush() {
+    if( is_toast)
+        return *this;
+    int b = bit_lag;
+    uint32 L = Low;
+    assert( b);
+    if( L & (1 << (32 - b)))
+        carry();
+    L <<= b & 7;
+    b >>= 3;
+    while( --b >= 0)
+        L <<= 8;
+    b = 4;
+    assert( !Bend  ||  B + 4 <= Bend);
+    do {
+        *B++ = (uchar) (L >> 24);
+        L <<= 8;
+    } while( --b);
+    is_toast = 1;
+    return *this;
+}
+
+
+bool_reader::bool_reader( c_spec& s, cuchar *src, size_t Len)
+  : bool_coder( s),
+    Bstart( src),
+    B( src),
+    Bend( Len? src+Len : 0),
+    shf( 32 - s.w),
+    bct( 8)
+{
+    int i = 4;  do { Low <<= 8;  Low |= *B++;}  while( --i);
+}
+
+
+bool bool_reader::raw( uint32 s) {
+
+    bool val = 0;
+    uint32 L = Low;
+    cuint32 S = s << shf;
+
+    assert( Range >= min_range  &&  Range <= spec.max_range());
+    assert( s  &&  s < Range  &&  (L >> shf) < Range);
+
+    if( bc_v)
+        printf(
+            "Reading, B %x  Low %x  Range %x  s %x  bct %d ...\n",
+            B-Bstart, Low, Range, s, bct
+        );
+
+    if( L >= S) {
+        L -= S;
+        s = Range - s;
+        assert( L < (s << shf));
+        val = 1;
+    } else
+        s -= rinc;
+    if( s < min_range) {
+        int ct = bct;
+        do {
+            assert( ~L & (1 << 31));
+            L += L;
+            if( !--ct) {
+                ct = 8;
+                if( !Bend  ||  B < Bend)
+                    L |= *B++;
+            }
+        } while( (s += s + rinc) < min_range);
+        bct = ct;
+    }
+    Low = L;
+    Range = s;
+    if( bc_v)
+        printf(
+            "...done, val %d  B %x  Low %x  Range %x  bct %d\n",
+            val? 1:0, B-Bstart, Low, Range, bct
+        );
+    return val;
+}
+
+
+/* C interfaces */
+
+// spec interface
+
+struct NS : bool_coder_namespace {
+    static Rounding r( vp8bc_c_prec *p, Rounding rr =down_full) {
+        return p? (Rounding) p->r : rr;
+    }
+};
+
+bool_coder_spec *vp8bc_vp6spec() {
+    return new bool_coder_spec_explicit_table( 0, bool_coder_namespace::Down, 8);
+}
+bool_coder_spec *vp8bc_float_spec(
+    unsigned int Ebits, unsigned int Mbits, vp8bc_c_prec *p
+) {
+    return new bool_coder_spec_float( Ebits, Mbits, NS::r( p), p? p->prec : 12);
+}
+bool_coder_spec *vp8bc_literal_spec(
+    const unsigned short m[256], vp8bc_c_prec *p
+) {
+    return new bool_coder_spec_explicit_table( m, NS::r( p), p? p->prec : 16);
+}
+bool_coder_spec *vp8bc_exponential_spec( unsigned int x, vp8bc_c_prec *p)
+{
+    return new bool_coder_spec_exponential_table( x, NS::r( p), p? p->prec : 16);
+}
+bool_coder_spec *vp8bc_spec_from_file( FILE *fp) {
+    return new bool_coder_spec( fp);
+}
+void vp8bc_destroy_spec( c_bool_coder_spec *p) { delete p;}
+
+void vp8bc_spec_to_file( c_bool_coder_spec *p, FILE *fp) { p->dump( fp);}
+
+vp8bc_index_t vp8bc_index( c_bool_coder_spec *p, double x) {
+    return ( *p)( x);
+}
+
+vp8bc_index_t vp8bc_index_from_counts(
+    c_bool_coder_spec *p, unsigned int L, unsigned int R
+) {
+    return ( *p)( (R += L)? (double) L/R : .5);
+}
+
+double vp8bc_probability( c_bool_coder_spec *p, vp8bc_index_t i) {
+    return ( *p)( i);
+}
+
+vp8bc_index_t vp8bc_complement( c_bool_coder_spec *p, vp8bc_index_t i) {
+    return p->complement( i);
+}
+unsigned int vp8bc_cost_zero( c_bool_coder_spec *p, vp8bc_index_t i) {
+    return p->cost_zero( i);
+}
+unsigned int vp8bc_cost_one( c_bool_coder_spec *p, vp8bc_index_t i) {
+    return p->cost_one( i);
+}
+unsigned int vp8bc_cost_bit( c_bool_coder_spec *p, vp8bc_index_t i, int v) {
+    return p->cost_bit( i, v);
+}
+
+#if tim_vp8
+    extern "C" int tok_verbose;
+
+#   define dbg_l 1000000
+
+    static vp8bc_index_t dbg_i [dbg_l];
+    static char dbg_v [dbg_l];
+    static size_t dbg_w = 0, dbg_r = 0;
+#endif
+
+// writer interface
+
+bool_writer *vp8bc_create_writer(
+    c_bool_coder_spec *p, unsigned char *D, size_t L
+) {
+    return new bool_writer( *p, D, L);
+}
+
+size_t vp8bc_destroy_writer( bool_writer *p) {
+    const size_t s = p->flush().bytes_written();
+    delete p;
+    return s;
+}
+
+void vp8bc_write_bool( bool_writer *p, int v, vp8bc_index_t i)
+{
+#   if tim_vp8
+        // bc_v = dbg_w < 10;
+        if( bc_v = tok_verbose)
+            printf( " writing %d at prob %d\n", v? 1:0, i);
+        accum_entropy_bc( &p->Spec(), i, v);
+
+        ( *p)( i, (bool) v);
+
+        if( dbg_w < dbg_l) {
+            dbg_i [dbg_w] = i;
+            dbg_v [dbg_w++] = v? 1:0;
+        }
+#   else
+        ( *p)( i, (bool) v);
+#   endif
+}
+
+void vp8bc_write_bits( bool_writer *p, unsigned int v, int n)
+{
+#   if tim_vp8
+        {
+            c_bool_coder_spec * const s = & p->Spec();
+            const vp8bc_index_t i = s->half_index();
+            int m = n;
+            while( --m >= 0)
+                accum_entropy_bc( s, i, (v>>m) & 1);
+        }
+#   endif
+
+    p->write_bits( n, v);
+}
+
+c_bool_coder_spec *vp8bc_writer_spec( c_bool_writer *w) { return & w->Spec();}
+
+// reader interface
+
+bool_reader *vp8bc_create_reader(
+    c_bool_coder_spec *p, const unsigned char *S, size_t L
+) {
+    return new bool_reader( *p, S, L);
+}
+
+void vp8bc_destroy_reader( bool_reader * p) { delete p;}
+
+int vp8bc_read_bool( bool_reader *p, vp8bc_index_t i)
+{
+#   if tim_vp8
+        // bc_v = dbg_r < 10;
+        bc_v = tok_verbose;
+        const int v = ( *p)( i)? 1:0;
+        if( tok_verbose)
+            printf( " reading %d at prob %d\n", v, i);
+        if( dbg_r < dbg_l) {
+            assert( dbg_r <= dbg_w);
+            if( i != dbg_i[dbg_r]  ||  v != dbg_v[dbg_r]) {
+                printf(
+        "Position %d: INCORRECTLY READING %d  prob %d, wrote %d  prob %d\n",
+                    dbg_r, v, i, dbg_v[dbg_r], dbg_i[dbg_r]
+                );
+            }
+            ++dbg_r;
+        }
+        return v;
+#   else
+        return ( *p)( i)? 1:0;
+#   endif
+}
+
+unsigned int vp8bc_read_bits( bool_reader *p, int n) { return p->read_bits( n);}
+
+c_bool_coder_spec *vp8bc_reader_spec( c_bool_reader *r) { return & r->Spec();}
+
+#undef bc_v
--- a/vp8/common/x86/idctllm_mmx.asm
+++ b/vp8/common/x86/idctllm_mmx.asm
@@ -14,18 +14,18 @@
 ; /****************************************************************************
 ; * Notes:
 ; *
-; * This implementation makes use of 16 bit fixed point version of two multiply
+; * This implementation makes use of 16 bit fixed point verio of two multiply
 ; * constants:
 ; *        1.   sqrt(2) * cos (pi/8)
-; *        2.   sqrt(2) * sin (pi/8)
-; * Because the first constant is bigger than 1, to maintain the same 16 bit
-; * fixed point precision as the second one, we use a trick of
+; *         2.   sqrt(2) * sin (pi/8)
+; * Becuase the first constant is bigger than 1, to maintain the same 16 bit
+; * fixed point prrcision as the second one, we use a trick of
 ; *        x * a = x + x*(a-1)
 ; * so
 ; *        x * sqrt(2) * cos (pi/8) = x + x * (sqrt(2) *cos(pi/8)-1).
 ; *
-; * For the second constant, because of the 16bit version is 35468, which
-; * is bigger than 32768, in signed 16 bit multiply, it becomes a negative
+; * For     the second constant, becuase of the 16bit version is 35468, which
+; * is bigger than 32768, in signed 16 bit multiply, it become a negative
 ; * number.
 ; *        (x * (unsigned)35468 >> 16) = x * (signed)35468 >> 16 + x
 ; *
--- a/vp8/common/x86/idctllm_sse2.asm
+++ b/vp8/common/x86/idctllm_sse2.asm
@@ -11,7 +11,7 @@

 %include "vpx_ports/x86_abi_support.asm"

-;void vp8_idct_dequant_0_2x_sse2
+;void idct_dequant_0_2x_sse2
 ; (
 ;   short *qcoeff       - 0
 ;   short *dequant      - 1
@@ -21,8 +21,8 @@
 ;   int blk_stride      - 5
 ; )

-global sym(vp8_idct_dequant_0_2x_sse2)
-sym(vp8_idct_dequant_0_2x_sse2):
+global sym(idct_dequant_0_2x_sse2)
+sym(idct_dequant_0_2x_sse2):
    push        rbp
    mov         rbp, rsp
    SHADOW_ARGS_TO_STACK 6
@@ -32,6 +32,9 @@ sym(vp8_idct_dequant_0_2x_sse2):
        mov         rdx,            arg(1) ; dequant
        mov         rax,            arg(0) ; qcoeff

+    ; Zero out xmm7, for use unpacking
+        pxor        xmm7,           xmm7
+
        movd        xmm4,           [rax]
        movd        xmm5,           [rdx]

@@ -40,12 +43,9 @@ sym(vp8_idct_dequant_0_2x_sse2):

        pmullw      xmm4,           xmm5

-    ; Zero out xmm5, for use unpacking
-        pxor        xmm5,           xmm5
-
    ; clear coeffs
-        movd        [rax],          xmm5
-        movd        [rax+32],       xmm5
+        movd        [rax],          xmm7
+        movd        [rax+32],       xmm7
 ;pshufb
        pshuflw     xmm4,           xmm4,       00000000b
        pshufhw     xmm4,           xmm4,       00000000b
@@ -62,10 +62,10 @@ sym(vp8_idct_dequant_0_2x_sse2):
        lea         rcx,            [3*rcx]
        movq        xmm3,           [rax+rcx]

-        punpcklbw   xmm0,           xmm5
-        punpcklbw   xmm1,           xmm5
-        punpcklbw   xmm2,           xmm5
-        punpcklbw   xmm3,           xmm5
+        punpcklbw   xmm0,           xmm7
+        punpcklbw   xmm1,           xmm7
+        punpcklbw   xmm2,           xmm7
+        punpcklbw   xmm3,           xmm7

        mov         rax,            arg(3) ; dst
        movsxd      rdx,            dword ptr arg(4) ; dst_stride
@@ -77,10 +77,10 @@ sym(vp8_idct_dequant_0_2x_sse2):
        paddw       xmm3,           xmm4

    ; pack up before storing
-        packuswb    xmm0,           xmm5
-        packuswb    xmm1,           xmm5
-        packuswb    xmm2,           xmm5
-        packuswb    xmm3,           xmm5
+        packuswb    xmm0,           xmm7
+        packuswb    xmm1,           xmm7
+        packuswb    xmm2,           xmm7
+        packuswb    xmm3,           xmm7

    ; store blocks back out
        movq        [rax],          xmm0
@@ -97,12 +97,11 @@ sym(vp8_idct_dequant_0_2x_sse2):
    pop         rbp
    ret

-global sym(vp8_idct_dequant_full_2x_sse2)
-sym(vp8_idct_dequant_full_2x_sse2):
+global sym(idct_dequant_full_2x_sse2)
+sym(idct_dequant_full_2x_sse2):
    push        rbp
    mov         rbp, rsp
    SHADOW_ARGS_TO_STACK 7
-    SAVE_XMM 7
    GET_GOT     rbx
    push        rsi
    push        rdi
@@ -348,12 +347,11 @@ sym(vp8_idct_dequant_full_2x_sse2):
    pop         rdi
    pop         rsi
    RESTORE_GOT
-    RESTORE_XMM
    UNSHADOW_ARGS
    pop         rbp
    ret

-;void vp8_idct_dequant_dc_0_2x_sse2
+;void idct_dequant_dc_0_2x_sse2
 ; (
 ;   short *qcoeff       - 0
 ;   short *dequant      - 1
@@ -362,8 +360,8 @@ sym(vp8_idct_dequant_full_2x_sse2):
 ;   int dst_stride      - 4
 ;   short *dc           - 5
 ; )
-global sym(vp8_idct_dequant_dc_0_2x_sse2)
-sym(vp8_idct_dequant_dc_0_2x_sse2):
+global sym(idct_dequant_dc_0_2x_sse2)
+sym(idct_dequant_dc_0_2x_sse2):
    push        rbp
    mov         rbp, rsp
    SHADOW_ARGS_TO_STACK 7
@@ -379,8 +377,8 @@ sym(vp8_idct_dequant_dc_0_2x_sse2):
        mov         rdi,            arg(3) ; dst
        mov         rdx,            arg(5) ; dc

-    ; Zero out xmm5, for use unpacking
-        pxor        xmm5,           xmm5
+    ; Zero out xmm7, for use unpacking
+        pxor        xmm7,           xmm7

    ; load up 2 dc words here == 2*16 = doubleword
        movd        xmm4,           [rdx]
@@ -400,10 +398,10 @@ sym(vp8_idct_dequant_dc_0_2x_sse2):
        psraw       xmm4,           3

    ; Predict buffer needs to be expanded from bytes to words
-        punpcklbw   xmm0,           xmm5
-        punpcklbw   xmm1,           xmm5
-        punpcklbw   xmm2,           xmm5
-        punpcklbw   xmm3,           xmm5
+        punpcklbw   xmm0,           xmm7
+        punpcklbw   xmm1,           xmm7
+        punpcklbw   xmm2,           xmm7
+        punpcklbw   xmm3,           xmm7

    ; Add to predict buffer
        paddw       xmm0,           xmm4
@@ -412,10 +410,10 @@ sym(vp8_idct_dequant_dc_0_2x_sse2):
        paddw       xmm3,           xmm4

    ; pack up before storing
-        packuswb    xmm0,           xmm5
-        packuswb    xmm1,           xmm5
-        packuswb    xmm2,           xmm5
-        packuswb    xmm3,           xmm5
+        packuswb    xmm0,           xmm7
+        packuswb    xmm1,           xmm7
+        packuswb    xmm2,           xmm7
+        packuswb    xmm3,           xmm7

    ; Load destination stride before writing out,
    ;   doesn't need to persist
@@ -438,12 +436,11 @@ sym(vp8_idct_dequant_dc_0_2x_sse2):
    pop         rbp
    ret

-global sym(vp8_idct_dequant_dc_full_2x_sse2)
-sym(vp8_idct_dequant_dc_full_2x_sse2):
+global sym(idct_dequant_dc_full_2x_sse2)
+sym(idct_dequant_dc_full_2x_sse2):
    push        rbp
    mov         rbp, rsp
    SHADOW_ARGS_TO_STACK 7
-    SAVE_XMM 7
    GET_GOT     rbx
    push        rsi
    push        rdi
@@ -695,7 +692,6 @@ sym(vp8_idct_dequant_dc_full_2x_sse2):
    pop         rdi
    pop         rsi
    RESTORE_GOT
-    RESTORE_XMM
    UNSHADOW_ARGS
    pop         rbp
    ret
--- a/vp8/common/x86/iwalsh_sse2.asm
+++ b/vp8/common/x86/iwalsh_sse2.asm
@@ -17,7 +17,7 @@ sym(vp8_short_inv_walsh4x4_sse2):
    push        rbp
    mov         rbp, rsp
    SHADOW_ARGS_TO_STACK 2
-    SAVE_XMM 6
+    SAVE_XMM
    push        rsi
    push        rdi
    ; end prolog
@@ -41,7 +41,7 @@ sym(vp8_short_inv_walsh4x4_sse2):
    movdqa    xmm4, xmm0
    punpcklqdq  xmm0, xmm3          ;d1 a1
    punpckhqdq  xmm4, xmm3          ;c1 b1
-    movd    xmm6, eax
+    movd    xmm7, eax

    movdqa    xmm1, xmm4          ;c1 b1
    paddw   xmm4, xmm0          ;dl+cl a1+b1 aka op[4] op[0]
@@ -66,7 +66,7 @@ sym(vp8_short_inv_walsh4x4_sse2):
    pshufd    xmm2, xmm1, 4eh       ;ip[8] ip[12]
    movdqa    xmm3, xmm4          ;ip[4] ip[0]

-    pshufd    xmm6, xmm6, 0       ;03 03 03 03 03 03 03 03
+    pshufd    xmm7, xmm7, 0       ;03 03 03 03 03 03 03 03

    paddw   xmm4, xmm2          ;ip[4]+ip[8] ip[0]+ip[12] aka b1 a1
    psubw   xmm3, xmm2          ;ip[4]-ip[8] ip[0]-ip[12] aka c1 d1
@@ -90,8 +90,8 @@ sym(vp8_short_inv_walsh4x4_sse2):
    punpcklwd xmm5, xmm0          ; 31 21 11 01 30 20 10 00
    punpckhwd xmm1, xmm0          ; 33 23 13 03 32 22 12 02
 ;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-    paddw   xmm5, xmm6
-    paddw   xmm1, xmm6
+    paddw   xmm5, xmm7
+    paddw   xmm1, xmm7

    psraw   xmm5, 3
    psraw   xmm1, 3
--- a/vp8/common/x86/loopfilter_mmx.asm
+++ b/vp8/common/x86/loopfilter_mmx.asm
@@ -16,7 +16,7 @@
 ;(
 ;    unsigned char *src_ptr,
 ;    int src_pixel_step,
-;    const char *blimit,
+;    const char *flimit,
 ;    const char *limit,
 ;    const char *thresh,
 ;    int  count
@@ -40,7 +40,7 @@ sym(vp8_loop_filter_horizontal_edge_mmx):
        movsxd      rax, dword ptr arg(1) ;src_pixel_step     ; destination pitch?

        movsxd      rcx, dword ptr arg(5) ;count
-.next8_h:
+next8_h:
        mov         rdx, arg(3) ;limit
        movq        mm7, [rdx]
        mov         rdi, rsi              ; rdi points to row +1 for indirect addressing
@@ -122,10 +122,12 @@ sym(vp8_loop_filter_horizontal_edge_mmx):
        paddusb     mm5, mm5              ; abs(p0-q0)*2
        paddusb     mm5, mm2              ; abs (p0 - q0) *2 + abs(p1-q1)/2

-        mov         rdx, arg(2) ;blimit           ; get blimit
-        movq        mm7, [rdx]            ; blimit
+        mov         rdx, arg(2) ;flimit           ; get flimit
+        movq        mm2, [rdx]            ; flimit mm2
+        paddb       mm2, mm2              ; flimit*2 (less than 255)
+        paddb       mm7, mm2              ; flimit * 2 + limit (less than 255)

-        psubusb     mm5,    mm7           ; abs (p0 - q0) *2 + abs(p1-q1)/2  > blimit
+        psubusb     mm5,    mm7           ; abs (p0 - q0) *2 + abs(p1-q1)/2  > flimit * 2 + limit
        por         mm1,    mm5
        pxor        mm5,    mm5
        pcmpeqb     mm1,    mm5           ; mask mm1
@@ -211,7 +213,7 @@ sym(vp8_loop_filter_horizontal_edge_mmx):
        add         rsi,8
        neg         rax
        dec         rcx
-        jnz         .next8_h
+        jnz         next8_h

    add rsp, 32
    pop rsp
@@ -228,7 +230,7 @@ sym(vp8_loop_filter_horizontal_edge_mmx):
 ;(
 ;    unsigned char *src_ptr,
 ;    int  src_pixel_step,
-;    const char *blimit,
+;    const char *flimit,
 ;    const char *limit,
 ;    const char *thresh,
 ;    int count
@@ -255,7 +257,7 @@ sym(vp8_loop_filter_vertical_edge_mmx):
        lea         rsi,        [rsi + rax*4 - 4]

        movsxd      rcx,        dword ptr arg(5) ;count
-.next8_v:
+next8_v:
        mov         rdi,        rsi           ; rdi points to row +1 for indirect addressing
        add         rdi,        rax

@@ -404,9 +406,9 @@ sym(vp8_loop_filter_vertical_edge_mmx):
        pand        mm5,        [GLOBAL(tfe)]               ; set lsb of each byte to zero
        psrlw       mm5,        1                           ; abs(p1-q1)/2

-        mov         rdx,        arg(2) ;blimit                      ;
+        mov         rdx,        arg(2) ;flimit                      ;

-        movq        mm4,        [rdx]                       ;blimit
+        movq        mm2,        [rdx]                       ;flimit  mm2
        movq        mm1,        mm3                         ; mm1=mm3=p0

        movq        mm7,        mm6                         ; mm7=mm6=q0
@@ -417,7 +419,10 @@ sym(vp8_loop_filter_vertical_edge_mmx):
        paddusb     mm1,        mm1                         ; abs(q0-p0)*2
        paddusb     mm1,        mm5                         ; abs (p0 - q0) *2 + abs(p1-q1)/2

-        psubusb     mm1,        mm4                         ; abs (p0 - q0) *2 + abs(p1-q1)/2  > blimit
+        paddb       mm2,        mm2                         ; flimit*2 (less than 255)
+        paddb       mm4,        mm2                         ; flimit * 2 + limit (less than 255)
+
+        psubusb     mm1,        mm4                         ; abs (p0 - q0) *2 + abs(p1-q1)/2  > flimit * 2 + limit
        por         mm1,        mm0;                        ; mask

        pxor        mm0,        mm0
@@ -581,7 +586,7 @@ sym(vp8_loop_filter_vertical_edge_mmx):

        lea         rsi,        [rsi+rax*8]
        dec         rcx
-        jnz         .next8_v
+        jnz         next8_v

    add rsp, 64
    pop rsp
@@ -598,7 +603,7 @@ sym(vp8_loop_filter_vertical_edge_mmx):
 ;(
 ;    unsigned char *src_ptr,
 ;    int  src_pixel_step,
-;    const char *blimit,
+;    const char *flimit,
 ;    const char *limit,
 ;    const char *thresh,
 ;    int count
@@ -622,7 +627,7 @@ sym(vp8_mbloop_filter_horizontal_edge_mmx):
        movsxd      rax, dword ptr arg(1) ;src_pixel_step     ; destination pitch?

        movsxd      rcx, dword ptr arg(5) ;count
-.next8_mbh:
+next8_mbh:
        mov         rdx, arg(3) ;limit
        movq        mm7, [rdx]
        mov         rdi, rsi              ; rdi points to row +1 for indirect addressing
@@ -714,15 +719,17 @@ sym(vp8_mbloop_filter_horizontal_edge_mmx):
        paddusb     mm5, mm5              ; abs(p0-q0)*2
        paddusb     mm5, mm2              ; abs (p0 - q0) *2 + abs(p1-q1)/2

-        mov         rdx, arg(2) ;blimit           ; get blimit
-        movq        mm7, [rdx]            ; blimit
+        mov         rdx, arg(2) ;flimit           ; get flimit
+        movq        mm2, [rdx]            ; flimit mm2
+        paddb       mm2, mm2              ; flimit*2 (less than 255)
+        paddb       mm7, mm2              ; flimit * 2 + limit (less than 255)

-        psubusb     mm5,    mm7           ; abs (p0 - q0) *2 + abs(p1-q1)/2  > blimit
+        psubusb     mm5,    mm7           ; abs (p0 - q0) *2 + abs(p1-q1)/2  > flimit * 2 + limit
        por         mm1,    mm5
        pxor        mm5,    mm5
        pcmpeqb     mm1,    mm5           ; mask mm1

-        ; mm1 = mask, mm0=q0,  mm7 = blimit, t0 = abs(q0-q1) t1 = abs(p1-p0)
+        ; mm1 = mask, mm0=q0,  mm7 = flimit, t0 = abs(q0-q1) t1 = abs(p1-p0)
        ; mm6 = p0,

        ; calculate high edge variance
@@ -898,7 +905,7 @@ sym(vp8_mbloop_filter_horizontal_edge_mmx):
        neg         rax
        add         rsi,8
        dec         rcx
-        jnz         .next8_mbh
+        jnz         next8_mbh

    add rsp, 32
    pop rsp
@@ -915,7 +922,7 @@ sym(vp8_mbloop_filter_horizontal_edge_mmx):
 ;(
 ;    unsigned char *src_ptr,
 ;    int  src_pixel_step,
-;    const char *blimit,
+;    const char *flimit,
 ;    const char *limit,
 ;    const char *thresh,
 ;    int count
@@ -942,7 +949,7 @@ sym(vp8_mbloop_filter_vertical_edge_mmx):
        lea         rsi,        [rsi + rax*4 - 4]

        movsxd      rcx,        dword ptr arg(5) ;count
-.next8_mbv:
+next8_mbv:
        lea         rdi,        [rsi + rax]  ; rdi points to row +1 for indirect addressing

        ;transpose
@@ -1101,9 +1108,9 @@ sym(vp8_mbloop_filter_vertical_edge_mmx):
        pand        mm5,        [GLOBAL(tfe)]               ; set lsb of each byte to zero
        psrlw       mm5,        1                           ; abs(p1-q1)/2

-        mov         rdx,        arg(2) ;blimit                      ;
+        mov         rdx,        arg(2) ;flimit                      ;

-        movq        mm4,        [rdx]                       ;blimit
+        movq        mm2,        [rdx]                       ;flimit  mm2
        movq        mm1,        mm3                         ; mm1=mm3=p0

        movq        mm7,        mm6                         ; mm7=mm6=q0
@@ -1114,7 +1121,10 @@ sym(vp8_mbloop_filter_vertical_edge_mmx):
        paddusb     mm1,        mm1                         ; abs(q0-p0)*2
        paddusb     mm1,        mm5                         ; abs (p0 - q0) *2 + abs(p1-q1)/2

-        psubusb     mm1,        mm4                         ; abs (p0 - q0) *2 + abs(p1-q1)/2  > blimit
+        paddb       mm2,        mm2                         ; flimit*2 (less than 255)
+        paddb       mm4,        mm2                         ; flimit * 2 + limit (less than 255)
+
+        psubusb     mm1,        mm4                         ; abs (p0 - q0) *2 + abs(p1-q1)/2  > flimit * 2 + limit
        por         mm1,        mm0;                        ; mask

        pxor        mm0,        mm0
@@ -1365,7 +1375,7 @@ sym(vp8_mbloop_filter_vertical_edge_mmx):
        lea         rsi,        [rsi+rax*8]
        dec         rcx

-        jnz         .next8_mbv
+        jnz         next8_mbv

    add rsp, 96
    pop rsp
@@ -1382,13 +1392,16 @@ sym(vp8_mbloop_filter_vertical_edge_mmx):
 ;(
 ;    unsigned char *src_ptr,
 ;    int  src_pixel_step,
-;    const char *blimit
+;    const char *flimit,
+;    const char *limit,
+;    const char *thresh,
+;    int count
 ;)
 global sym(vp8_loop_filter_simple_horizontal_edge_mmx)
 sym(vp8_loop_filter_simple_horizontal_edge_mmx):
    push        rbp
    mov         rbp, rsp
-    SHADOW_ARGS_TO_STACK 3
+    SHADOW_ARGS_TO_STACK 6
    GET_GOT     rbx
    push        rsi
    push        rdi
@@ -1397,10 +1410,14 @@ sym(vp8_loop_filter_simple_horizontal_edge_mmx):
        mov         rsi, arg(0) ;src_ptr
        movsxd      rax, dword ptr arg(1) ;src_pixel_step     ; destination pitch?

-        mov         rcx, 2                ; count
-.nexts8_h:
-        mov         rdx, arg(2) ;blimit           ; get blimit
+        movsxd      rcx, dword ptr arg(5) ;count
+nexts8_h:
+        mov         rdx, arg(3) ;limit
+        movq        mm7, [rdx]
+        mov         rdx, arg(2) ;flimit           ; get flimit
        movq        mm3, [rdx]            ;
+        paddb       mm3, mm3              ; flimit*2 (less than 255)
+        paddb       mm3, mm7              ; flimit * 2 + limit (less than 255)

        mov         rdi, rsi              ; rdi points to row +1 for indirect addressing
        add         rdi, rax
@@ -1428,7 +1445,7 @@ sym(vp8_loop_filter_simple_horizontal_edge_mmx):
        paddusb     mm5, mm5              ; abs(p0-q0)*2
        paddusb     mm5, mm1              ; abs (p0 - q0) *2 + abs(p1-q1)/2

-        psubusb     mm5, mm3              ; abs(p0 - q0) *2 + abs(p1-q1)/2  > blimit
+        psubusb     mm5, mm3              ; abs(p0 - q0) *2 + abs(p1-q1)/2  > flimit * 2 + limit
        pxor        mm3, mm3
        pcmpeqb     mm5, mm3

@@ -1483,7 +1500,7 @@ sym(vp8_loop_filter_simple_horizontal_edge_mmx):
        add         rsi,8
        neg         rax
        dec         rcx
-        jnz         .nexts8_h
+        jnz         nexts8_h

    ; begin epilog
    pop rdi
@@ -1498,13 +1515,16 @@ sym(vp8_loop_filter_simple_horizontal_edge_mmx):
 ;(
 ;    unsigned char *src_ptr,
 ;    int  src_pixel_step,
-;    const char *blimit
+;    const char *flimit,
+;    const char *limit,
+;    const char *thresh,
+;    int count
 ;)
 global sym(vp8_loop_filter_simple_vertical_edge_mmx)
 sym(vp8_loop_filter_simple_vertical_edge_mmx):
    push        rbp
    mov         rbp, rsp
-    SHADOW_ARGS_TO_STACK 3
+    SHADOW_ARGS_TO_STACK 6
    GET_GOT     rbx
    push        rsi
    push        rdi
@@ -1519,8 +1539,8 @@ sym(vp8_loop_filter_simple_vertical_edge_mmx):
        movsxd      rax, dword ptr arg(1) ;src_pixel_step     ; destination pitch?

        lea         rsi, [rsi + rax*4- 2];  ;
-        mov         rcx, 2                                      ; count
-.nexts8_v:
+        movsxd      rcx, dword ptr arg(5) ;count
+nexts8_v:

        lea         rdi,        [rsi + rax];
        movd        mm0,        [rdi + rax * 2]                 ; xx xx xx xx 73 72 71 70
@@ -1582,10 +1602,14 @@ sym(vp8_loop_filter_simple_vertical_edge_mmx):
        paddusb     mm5,        mm5                             ; abs(p0-q0)*2
        paddusb     mm5,        mm6                             ; abs (p0 - q0) *2 + abs(p1-q1)/2

-        mov         rdx,        arg(2) ;blimit                          ; get blimit
+        mov         rdx,        arg(2) ;flimit                          ; get flimit
        movq        mm7,        [rdx]
+        mov         rdx,        arg(3)                          ; get limit
+        movq        mm6,        [rdx]
+        paddb       mm7,        mm7                             ; flimit*2 (less than 255)
+        paddb       mm7,        mm6                             ; flimit * 2 + limit (less than 255)

-        psubusb     mm5,        mm7                             ; abs(p0 - q0) *2 + abs(p1-q1)/2  > blimit
+        psubusb     mm5,        mm7                             ; abs(p0 - q0) *2 + abs(p1-q1)/2  > flimit * 2 + limit
        pxor        mm7,        mm7
        pcmpeqb     mm5,        mm7                             ; mm5 = mask

@@ -1695,7 +1719,7 @@ sym(vp8_loop_filter_simple_vertical_edge_mmx):
        lea         rsi,        [rsi+rax*8]                 ; next 8

        dec         rcx
-        jnz         .nexts8_v
+        jnz         nexts8_v

    add rsp, 32
    pop rsp
--- a/vp8/common/x86/loopfilter_sse2.asm
+++ b/vp8/common/x86/loopfilter_sse2.asm
@@ -110,7 +110,7 @@
        psubusb     xmm6,                   xmm5              ; p1-=p0

        por         xmm6,                   xmm4              ; abs(p1 - p0)
-        mov         rdx,                    arg(2)            ; get blimit
+        mov         rdx,                    arg(2)            ; get flimit

        movdqa        t1,                   xmm6              ; save to t1

@@ -123,7 +123,7 @@
        psubusb     xmm1,                   xmm7
        por         xmm2,                   xmm3              ; abs(p1-q1)

-        movdqa      xmm7,                   XMMWORD PTR [rdx] ; blimit
+        movdqa      xmm4,                   XMMWORD PTR [rdx] ; flimit

        movdqa      xmm3,                   xmm0              ; q0
        pand        xmm2,                   [GLOBAL(tfe)]     ; set lsb of each byte to zero
@@ -134,11 +134,13 @@
        psrlw       xmm2,                   1                 ; abs(p1-q1)/2

        psubusb     xmm5,                   xmm3              ; p0-=q0
+        paddb       xmm4,                   xmm4              ; flimit*2 (less than 255)

        psubusb     xmm3,                   xmm6              ; q0-=p0
        por         xmm5,                   xmm3              ; abs(p0 - q0)

        paddusb     xmm5,                   xmm5              ; abs(p0-q0)*2
+        paddb       xmm7,                   xmm4              ; flimit * 2 + limit (less than 255)

        movdqa      xmm4,                   t0                ; hev get abs (q1 - q0)

@@ -148,7 +150,7 @@

        movdqa      xmm2,                   XMMWORD PTR [rdx] ; hev

-        psubusb     xmm5,                   xmm7              ; abs (p0 - q0) *2 + abs(p1-q1)/2  > blimit
+        psubusb     xmm5,                   xmm7              ; abs (p0 - q0) *2 + abs(p1-q1)/2  > flimit * 2 + limit
        psubusb     xmm4,                   xmm2              ; hev

        psubusb     xmm3,                   xmm2              ; hev
@@ -276,7 +278,7 @@
 ;(
 ;    unsigned char *src_ptr,
 ;    int            src_pixel_step,
-;    const char    *blimit,
+;    const char    *flimit,
 ;    const char    *limit,
 ;    const char    *thresh,
 ;    int            count
@@ -286,7 +288,7 @@ sym(vp8_loop_filter_horizontal_edge_sse2):
    push        rbp
    mov         rbp, rsp
    SHADOW_ARGS_TO_STACK 6
-    SAVE_XMM 7
+    SAVE_XMM
    GET_GOT     rbx
    push        rsi
    push        rdi
@@ -326,7 +328,7 @@ sym(vp8_loop_filter_horizontal_edge_sse2):
 ;(
 ;    unsigned char *src_ptr,
 ;    int            src_pixel_step,
-;    const char    *blimit,
+;    const char    *flimit,
 ;    const char    *limit,
 ;    const char    *thresh,
 ;    int            count
@@ -336,7 +338,7 @@ sym(vp8_loop_filter_horizontal_edge_uv_sse2):
    push        rbp
    mov         rbp, rsp
    SHADOW_ARGS_TO_STACK 6
-    SAVE_XMM 7
+    SAVE_XMM
    GET_GOT     rbx
    push        rsi
    push        rdi
@@ -572,7 +574,7 @@ sym(vp8_loop_filter_horizontal_edge_uv_sse2):
 ;(
 ;    unsigned char *src_ptr,
 ;    int            src_pixel_step,
-;    const char    *blimit,
+;    const char    *flimit,
 ;    const char    *limit,
 ;    const char    *thresh,
 ;    int            count
@@ -582,7 +584,7 @@ sym(vp8_mbloop_filter_horizontal_edge_sse2):
    push        rbp
    mov         rbp, rsp
    SHADOW_ARGS_TO_STACK 6
-    SAVE_XMM 7
+    SAVE_XMM
    GET_GOT     rbx
    push        rsi
    push        rdi
@@ -622,7 +624,7 @@ sym(vp8_mbloop_filter_horizontal_edge_sse2):
 ;(
 ;    unsigned char *u,
 ;    int            src_pixel_step,
-;    const char    *blimit,
+;    const char    *flimit,
 ;    const char    *limit,
 ;    const char    *thresh,
 ;    unsigned char *v
@@ -632,7 +634,7 @@ sym(vp8_mbloop_filter_horizontal_edge_uv_sse2):
    push        rbp
    mov         rbp, rsp
    SHADOW_ARGS_TO_STACK 6
-    SAVE_XMM 7
+    SAVE_XMM
    GET_GOT     rbx
    push        rsi
    push        rdi
@@ -902,7 +904,7 @@ sym(vp8_mbloop_filter_horizontal_edge_uv_sse2):
        movdqa      xmm4,               XMMWORD PTR [rdx]; limit

        pmaxub      xmm0,               xmm7
-        mov         rdx,                arg(2)          ; blimit
+        mov         rdx,                arg(2)          ; flimit

        psubusb     xmm0,               xmm4
        movdqa      xmm5,               xmm2            ; q1
@@ -919,11 +921,12 @@ sym(vp8_mbloop_filter_horizontal_edge_uv_sse2):
        psrlw       xmm5,               1               ; abs(p1-q1)/2
        psubusb     xmm6,               xmm3            ; q0-p0

-        movdqa      xmm4,               XMMWORD PTR [rdx]; blimit
+        movdqa      xmm2,               XMMWORD PTR [rdx]; flimit

        mov         rdx,                arg(4)          ; get thresh

        por         xmm1,               xmm6            ; abs(q0-p0)
+        paddb       xmm2,               xmm2            ; flimit*2 (less than 255)

        movdqa      xmm6,               t0              ; get abs (q1 - q0)

@@ -936,9 +939,10 @@ sym(vp8_mbloop_filter_horizontal_edge_uv_sse2):
        paddusb     xmm1,               xmm5            ; abs (p0 - q0) *2 + abs(p1-q1)/2
        psubusb     xmm6,               xmm7            ; abs(q1 - q0) > thresh

+        paddb       xmm4,               xmm2            ; flimit * 2 + limit (less than 255)
        psubusb     xmm3,               xmm7            ; abs(p1 - p0)> thresh

-        psubusb     xmm1,               xmm4            ; abs (p0 - q0) *2 + abs(p1-q1)/2  > blimit
+        psubusb     xmm1,               xmm4            ; abs (p0 - q0) *2 + abs(p1-q1)/2  > flimit * 2 + limit
        por         xmm6,               xmm3            ; abs(q1 - q0) > thresh || abs(p1 - p0) > thresh

        por         xmm1,               xmm0            ; mask
@@ -1010,7 +1014,7 @@ sym(vp8_mbloop_filter_horizontal_edge_uv_sse2):
 ;(
 ;    unsigned char *src_ptr,
 ;    int            src_pixel_step,
-;    const char    *blimit,
+;    const char    *flimit,
 ;    const char    *limit,
 ;    const char    *thresh,
 ;    int            count
@@ -1020,7 +1024,7 @@ sym(vp8_loop_filter_vertical_edge_sse2):
    push        rbp
    mov         rbp, rsp
    SHADOW_ARGS_TO_STACK 6
-    SAVE_XMM 7
+    SAVE_XMM
    GET_GOT     rbx
    push        rsi
    push        rdi
@@ -1077,7 +1081,7 @@ sym(vp8_loop_filter_vertical_edge_sse2):
 ;(
 ;    unsigned char *u,
 ;    int            src_pixel_step,
-;    const char    *blimit,
+;    const char    *flimit,
 ;    const char    *limit,
 ;    const char    *thresh,
 ;    unsigned char *v
@@ -1087,7 +1091,7 @@ sym(vp8_loop_filter_vertical_edge_uv_sse2):
    push        rbp
    mov         rbp, rsp
    SHADOW_ARGS_TO_STACK 6
-    SAVE_XMM 7
+    SAVE_XMM
    GET_GOT     rbx
    push        rsi
    push        rdi
@@ -1235,7 +1239,7 @@ sym(vp8_loop_filter_vertical_edge_uv_sse2):
 ;(
 ;    unsigned char *src_ptr,
 ;    int            src_pixel_step,
-;    const char    *blimit,
+;    const char    *flimit,
 ;    const char    *limit,
 ;    const char    *thresh,
 ;    int            count
@@ -1245,7 +1249,7 @@ sym(vp8_mbloop_filter_vertical_edge_sse2):
    push        rbp
    mov         rbp, rsp
    SHADOW_ARGS_TO_STACK 6
-    SAVE_XMM 7
+    SAVE_XMM
    GET_GOT     rbx
    push        rsi
    push        rdi
@@ -1304,7 +1308,7 @@ sym(vp8_mbloop_filter_vertical_edge_sse2):
 ;(
 ;    unsigned char *u,
 ;    int            src_pixel_step,
-;    const char    *blimit,
+;    const char    *flimit,
 ;    const char    *limit,
 ;    const char    *thresh,
 ;    unsigned char *v
@@ -1314,7 +1318,7 @@ sym(vp8_mbloop_filter_vertical_edge_uv_sse2):
    push        rbp
    mov         rbp, rsp
    SHADOW_ARGS_TO_STACK 6
-    SAVE_XMM 7
+    SAVE_XMM
    GET_GOT     rbx
    push        rsi
    push        rdi
@@ -1372,14 +1376,17 @@ sym(vp8_mbloop_filter_vertical_edge_uv_sse2):
 ;(
 ;    unsigned char *src_ptr,
 ;    int  src_pixel_step,
-;    const char *blimit,
+;    const char *flimit,
+;    const char *limit,
+;    const char *thresh,
+;    int count
 ;)
 global sym(vp8_loop_filter_simple_horizontal_edge_sse2)
 sym(vp8_loop_filter_simple_horizontal_edge_sse2):
    push        rbp
    mov         rbp, rsp
-    SHADOW_ARGS_TO_STACK 3
-    SAVE_XMM 7
+    SHADOW_ARGS_TO_STACK 6
+    SAVE_XMM
    GET_GOT     rbx
    push        rsi
    push        rdi
@@ -1387,16 +1394,21 @@ sym(vp8_loop_filter_simple_horizontal_edge_sse2):

        mov         rsi, arg(0)             ;src_ptr
        movsxd      rax, dword ptr arg(1)   ;src_pixel_step     ; destination pitch?
-        mov         rdx, arg(2)             ;blimit
+        mov         rdx, arg(2) ;flimit     ; get flimit
        movdqa      xmm3, XMMWORD PTR [rdx]
+        mov         rdx, arg(3) ;limit
+        movdqa      xmm7, XMMWORD PTR [rdx]
+
+        paddb       xmm3, xmm3              ; flimit*2 (less than 255)
+        paddb       xmm3, xmm7              ; flimit * 2 + limit (less than 255)

        mov         rdi, rsi                ; rdi points to row +1 for indirect addressing
        add         rdi, rax
        neg         rax

        ; calculate mask
-        movdqa      xmm1, [rsi+2*rax]       ; p1
-        movdqa      xmm0, [rdi]             ; q1
+        movdqu      xmm1, [rsi+2*rax]       ; p1
+        movdqu      xmm0, [rdi]             ; q1
        movdqa      xmm2, xmm1
        movdqa      xmm7, xmm0
        movdqa      xmm4, xmm0
@@ -1406,8 +1418,8 @@ sym(vp8_loop_filter_simple_horizontal_edge_sse2):
        pand        xmm1, [GLOBAL(tfe)]     ; set lsb of each byte to zero
        psrlw       xmm1, 1                 ; abs(p1-q1)/2

-        movdqa      xmm5, [rsi+rax]         ; p0
-        movdqa      xmm4, [rsi]             ; q0
+        movdqu      xmm5, [rsi+rax]         ; p0
+        movdqu      xmm4, [rsi]             ; q0
        movdqa      xmm0, xmm4              ; q0
        movdqa      xmm6, xmm5              ; p0
        psubusb     xmm5, xmm4              ; p0-=q0
@@ -1416,7 +1428,7 @@ sym(vp8_loop_filter_simple_horizontal_edge_sse2):
        paddusb     xmm5, xmm5              ; abs(p0-q0)*2
        paddusb     xmm5, xmm1              ; abs (p0 - q0) *2 + abs(p1-q1)/2

-        psubusb     xmm5, xmm3              ; abs(p0 - q0) *2 + abs(p1-q1)/2  > blimit
+        psubusb     xmm5, xmm3              ; abs(p0 - q0) *2 + abs(p1-q1)/2  > flimit * 2 + limit
        pxor        xmm3, xmm3
        pcmpeqb     xmm5, xmm3

@@ -1449,7 +1461,7 @@ sym(vp8_loop_filter_simple_horizontal_edge_sse2):

        psubsb      xmm3, xmm0              ; q0-= q0 add
        pxor        xmm3, [GLOBAL(t80)]     ; unoffset
-        movdqa      [rsi], xmm3             ; write back
+        movdqu      [rsi], xmm3             ; write back

        ; now do +3 side
        psubsb      xmm5, [GLOBAL(t1s)]     ; +3 instead of +4
@@ -1465,7 +1477,7 @@ sym(vp8_loop_filter_simple_horizontal_edge_sse2):

        paddsb      xmm6, xmm0              ; p0+= p0 add
        pxor        xmm6, [GLOBAL(t80)]     ; unoffset
-        movdqa      [rsi+rax], xmm6         ; write back
+        movdqu      [rsi+rax], xmm6         ; write back

    ; begin epilog
    pop rdi
@@ -1481,14 +1493,17 @@ sym(vp8_loop_filter_simple_horizontal_edge_sse2):
 ;(
 ;    unsigned char *src_ptr,
 ;    int  src_pixel_step,
-;    const char *blimit,
+;    const char *flimit,
+;    const char *limit,
+;    const char *thresh,
+;    int count
 ;)
 global sym(vp8_loop_filter_simple_vertical_edge_sse2)
 sym(vp8_loop_filter_simple_vertical_edge_sse2):
    push        rbp         ; save old base pointer value.
    mov         rbp, rsp    ; set new base pointer value.
-    SHADOW_ARGS_TO_STACK 3
-    SAVE_XMM 7
+    SHADOW_ARGS_TO_STACK 6
+    SAVE_XMM
    GET_GOT     rbx         ; save callee-saved reg
    push        rsi
    push        rdi
@@ -1507,17 +1522,17 @@ sym(vp8_loop_filter_simple_vertical_edge_sse2):
        lea         rdx,        [rsi + rax*4]
        lea         rcx,        [rdx + rax]

-        movd        xmm0,       [rsi]                   ; (high 96 bits unused) 03 02 01 00
-        movd        xmm1,       [rdx]                   ; (high 96 bits unused) 43 42 41 40
-        movd        xmm2,       [rdi]                   ; 13 12 11 10
-        movd        xmm3,       [rcx]                   ; 53 52 51 50
+        movdqu      xmm0,       [rsi]                   ; (high 96 bits unused) 03 02 01 00
+        movdqu      xmm1,       [rdx]                   ; (high 96 bits unused) 43 42 41 40
+        movdqu      xmm2,       [rdi]                   ; 13 12 11 10
+        movdqu      xmm3,       [rcx]                   ; 53 52 51 50
        punpckldq   xmm0,       xmm1                    ; (high 64 bits unused) 43 42 41 40 03 02 01 00
        punpckldq   xmm2,       xmm3                    ; 53 52 51 50 13 12 11 10

-        movd        xmm4,       [rsi + rax*2]           ; 23 22 21 20
-        movd        xmm5,       [rdx + rax*2]           ; 63 62 61 60
-        movd        xmm6,       [rdi + rax*2]           ; 33 32 31 30
-        movd        xmm7,       [rcx + rax*2]           ; 73 72 71 70
+        movdqu      xmm4,       [rsi + rax*2]           ; 23 22 21 20
+        movdqu      xmm5,       [rdx + rax*2]           ; 63 62 61 60
+        movdqu      xmm6,       [rdi + rax*2]           ; 33 32 31 30
+        movdqu      xmm7,       [rcx + rax*2]           ; 73 72 71 70
        punpckldq   xmm4,       xmm5                    ; 63 62 61 60 23 22 21 20
        punpckldq   xmm6,       xmm7                    ; 73 72 71 70 33 32 31 30

@@ -1540,17 +1555,17 @@ sym(vp8_loop_filter_simple_vertical_edge_sse2):
        lea         rdx,        [rsi + rax*4]
        lea         rcx,        [rdx + rax]

-        movd        xmm4,       [rsi]                   ; 83 82 81 80
-        movd        xmm1,       [rdx]                   ; c3 c2 c1 c0
-        movd        xmm6,       [rdi]                   ; 93 92 91 90
-        movd        xmm3,       [rcx]                   ; d3 d2 d1 d0
+        movdqu      xmm4,       [rsi]                   ; 83 82 81 80
+        movdqu      xmm1,       [rdx]                   ; c3 c2 c1 c0
+        movdqu      xmm6,       [rdi]                   ; 93 92 91 90
+        movdqu      xmm3,       [rcx]                   ; d3 d2 d1 d0
        punpckldq   xmm4,       xmm1                    ; c3 c2 c1 c0 83 82 81 80
        punpckldq   xmm6,       xmm3                    ; d3 d2 d1 d0 93 92 91 90

-        movd        xmm0,       [rsi + rax*2]           ; a3 a2 a1 a0
-        movd        xmm5,       [rdx + rax*2]           ; e3 e2 e1 e0
-        movd        xmm2,       [rdi + rax*2]           ; b3 b2 b1 b0
-        movd        xmm7,       [rcx + rax*2]           ; f3 f2 f1 f0
+        movdqu      xmm0,       [rsi + rax*2]           ; a3 a2 a1 a0
+        movdqu      xmm5,       [rdx + rax*2]           ; e3 e2 e1 e0
+        movdqu      xmm2,       [rdi + rax*2]           ; b3 b2 b1 b0
+        movdqu      xmm7,       [rcx + rax*2]           ; f3 f2 f1 f0
        punpckldq   xmm0,       xmm5                    ; e3 e2 e1 e0 a3 a2 a1 a0
        punpckldq   xmm2,       xmm7                    ; f3 f2 f1 f0 b3 b2 b1 b0

@@ -1592,10 +1607,14 @@ sym(vp8_loop_filter_simple_vertical_edge_sse2):
        paddusb     xmm5,       xmm5                            ; abs(p0-q0)*2
        paddusb     xmm5,       xmm6                            ; abs (p0 - q0) *2 + abs(p1-q1)/2

-        mov         rdx,        arg(2)                          ;blimit
+        mov         rdx,        arg(2)                          ;flimit
        movdqa      xmm7, XMMWORD PTR [rdx]
+        mov         rdx,        arg(3)                          ; get limit
+        movdqa      xmm6, XMMWORD PTR [rdx]
+        paddb       xmm7,        xmm7                           ; flimit*2 (less than 255)
+        paddb       xmm7,        xmm6                           ; flimit * 2 + limit (less than 255)

-        psubusb     xmm5,        xmm7                           ; abs(p0 - q0) *2 + abs(p1-q1)/2  > blimit
+        psubusb     xmm5,        xmm7                           ; abs(p0 - q0) *2 + abs(p1-q1)/2  > flimit * 2 + limit
        pxor        xmm7,        xmm7
        pcmpeqb     xmm5,        xmm7                           ; mm5 = mask

--- a/vp8/common/x86/loopfilter_x86.c
+++ b/vp8/common/x86/loopfilter_x86.c
@@ -9,18 +9,30 @@
 */


-#include "vpx_config.h"
+#include "vpx_ports/config.h"
 #include "vp8/common/loopfilter.h"

+prototype_loopfilter(vp8_loop_filter_horizontal_edge_c);
+prototype_loopfilter(vp8_loop_filter_vertical_edge_c);
+prototype_loopfilter(vp8_mbloop_filter_horizontal_edge_c);
+prototype_loopfilter(vp8_mbloop_filter_vertical_edge_c);
+prototype_loopfilter(vp8_loop_filter_simple_horizontal_edge_c);
+prototype_loopfilter(vp8_loop_filter_simple_vertical_edge_c);
+
 prototype_loopfilter(vp8_mbloop_filter_vertical_edge_mmx);
 prototype_loopfilter(vp8_mbloop_filter_horizontal_edge_mmx);
 prototype_loopfilter(vp8_loop_filter_vertical_edge_mmx);
 prototype_loopfilter(vp8_loop_filter_horizontal_edge_mmx);
+prototype_loopfilter(vp8_loop_filter_simple_vertical_edge_mmx);
+prototype_loopfilter(vp8_loop_filter_simple_horizontal_edge_mmx);

 prototype_loopfilter(vp8_loop_filter_vertical_edge_sse2);
 prototype_loopfilter(vp8_loop_filter_horizontal_edge_sse2);
 prototype_loopfilter(vp8_mbloop_filter_vertical_edge_sse2);
 prototype_loopfilter(vp8_mbloop_filter_horizontal_edge_sse2);
+prototype_loopfilter(vp8_loop_filter_simple_vertical_edge_sse2);
+prototype_loopfilter(vp8_loop_filter_simple_horizontal_edge_sse2);
+prototype_loopfilter(vp8_fast_loop_filter_vertical_edges_sse2);

 extern loop_filter_uvfunction vp8_loop_filter_horizontal_edge_uv_sse2;
 extern loop_filter_uvfunction vp8_loop_filter_vertical_edge_uv_sse2;
@@ -30,77 +42,113 @@ extern loop_filter_uvfunction vp8_mbloop_filter_vertical_edge_uv_sse2;
 #if HAVE_MMX
 /* Horizontal MB filtering */
 void vp8_loop_filter_mbh_mmx(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
-                             int y_stride, int uv_stride, loop_filter_info *lfi)
+                             int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    vp8_mbloop_filter_horizontal_edge_mmx(y_ptr, y_stride, lfi->mblim, lfi->lim, lfi->hev_thr, 2);
+    (void) simpler_lpf;
+    vp8_mbloop_filter_horizontal_edge_mmx(y_ptr, y_stride, lfi->mbflim, lfi->lim, lfi->thr, 2);

    if (u_ptr)
-        vp8_mbloop_filter_horizontal_edge_mmx(u_ptr, uv_stride, lfi->mblim, lfi->lim, lfi->hev_thr, 1);
+        vp8_mbloop_filter_horizontal_edge_mmx(u_ptr, uv_stride, lfi->mbflim, lfi->lim, lfi->thr, 1);

    if (v_ptr)
-        vp8_mbloop_filter_horizontal_edge_mmx(v_ptr, uv_stride, lfi->mblim, lfi->lim, lfi->hev_thr, 1);
+        vp8_mbloop_filter_horizontal_edge_mmx(v_ptr, uv_stride, lfi->mbflim, lfi->lim, lfi->thr, 1);
+}
+
+
+void vp8_loop_filter_mbhs_mmx(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
+                              int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
+{
+    (void) u_ptr;
+    (void) v_ptr;
+    (void) uv_stride;
+    (void) simpler_lpf;
+    vp8_loop_filter_simple_horizontal_edge_mmx(y_ptr, y_stride, lfi->mbflim, lfi->lim, lfi->thr, 2);
 }


 /* Vertical MB Filtering */
 void vp8_loop_filter_mbv_mmx(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
-                             int y_stride, int uv_stride, loop_filter_info *lfi)
+                             int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    vp8_mbloop_filter_vertical_edge_mmx(y_ptr, y_stride, lfi->mblim, lfi->lim, lfi->hev_thr, 2);
+    (void) simpler_lpf;
+    vp8_mbloop_filter_vertical_edge_mmx(y_ptr, y_stride, lfi->mbflim, lfi->lim, lfi->thr, 2);

    if (u_ptr)
-        vp8_mbloop_filter_vertical_edge_mmx(u_ptr, uv_stride, lfi->mblim, lfi->lim, lfi->hev_thr, 1);
+        vp8_mbloop_filter_vertical_edge_mmx(u_ptr, uv_stride, lfi->mbflim, lfi->lim, lfi->thr, 1);

    if (v_ptr)
-        vp8_mbloop_filter_vertical_edge_mmx(v_ptr, uv_stride, lfi->mblim, lfi->lim, lfi->hev_thr, 1);
+        vp8_mbloop_filter_vertical_edge_mmx(v_ptr, uv_stride, lfi->mbflim, lfi->lim, lfi->thr, 1);
+}
+
+
+void vp8_loop_filter_mbvs_mmx(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
+                              int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
+{
+    (void) u_ptr;
+    (void) v_ptr;
+    (void) uv_stride;
+    (void) simpler_lpf;
+    vp8_loop_filter_simple_vertical_edge_mmx(y_ptr, y_stride, lfi->mbflim, lfi->lim, lfi->thr, 2);
 }


 /* Horizontal B Filtering */
 void vp8_loop_filter_bh_mmx(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
-                            int y_stride, int uv_stride, loop_filter_info *lfi)
+                            int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    vp8_loop_filter_horizontal_edge_mmx(y_ptr + 4 * y_stride, y_stride, lfi->blim, lfi->lim, lfi->hev_thr, 2);
-    vp8_loop_filter_horizontal_edge_mmx(y_ptr + 8 * y_stride, y_stride, lfi->blim, lfi->lim, lfi->hev_thr, 2);
-    vp8_loop_filter_horizontal_edge_mmx(y_ptr + 12 * y_stride, y_stride, lfi->blim, lfi->lim, lfi->hev_thr, 2);
+    (void) simpler_lpf;
+    vp8_loop_filter_horizontal_edge_mmx(y_ptr + 4 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_horizontal_edge_mmx(y_ptr + 8 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_horizontal_edge_mmx(y_ptr + 12 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);

    if (u_ptr)
-        vp8_loop_filter_horizontal_edge_mmx(u_ptr + 4 * uv_stride, uv_stride, lfi->blim, lfi->lim, lfi->hev_thr, 1);
+        vp8_loop_filter_horizontal_edge_mmx(u_ptr + 4 * uv_stride, uv_stride, lfi->flim, lfi->lim, lfi->thr, 1);

    if (v_ptr)
-        vp8_loop_filter_horizontal_edge_mmx(v_ptr + 4 * uv_stride, uv_stride, lfi->blim, lfi->lim, lfi->hev_thr, 1);
+        vp8_loop_filter_horizontal_edge_mmx(v_ptr + 4 * uv_stride, uv_stride, lfi->flim, lfi->lim, lfi->thr, 1);
 }


-void vp8_loop_filter_bhs_mmx(unsigned char *y_ptr, int y_stride, const unsigned char *blimit)
+void vp8_loop_filter_bhs_mmx(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
+                             int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    vp8_loop_filter_simple_horizontal_edge_mmx(y_ptr + 4 * y_stride, y_stride, blimit);
-    vp8_loop_filter_simple_horizontal_edge_mmx(y_ptr + 8 * y_stride, y_stride, blimit);
-    vp8_loop_filter_simple_horizontal_edge_mmx(y_ptr + 12 * y_stride, y_stride, blimit);
+    (void) u_ptr;
+    (void) v_ptr;
+    (void) uv_stride;
+    (void) simpler_lpf;
+    vp8_loop_filter_simple_horizontal_edge_mmx(y_ptr + 4 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_simple_horizontal_edge_mmx(y_ptr + 8 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_simple_horizontal_edge_mmx(y_ptr + 12 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
 }


 /* Vertical B Filtering */
 void vp8_loop_filter_bv_mmx(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
-                            int y_stride, int uv_stride, loop_filter_info *lfi)
+                            int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    vp8_loop_filter_vertical_edge_mmx(y_ptr + 4, y_stride, lfi->blim, lfi->lim, lfi->hev_thr, 2);
-    vp8_loop_filter_vertical_edge_mmx(y_ptr + 8, y_stride, lfi->blim, lfi->lim, lfi->hev_thr, 2);
-    vp8_loop_filter_vertical_edge_mmx(y_ptr + 12, y_stride, lfi->blim, lfi->lim, lfi->hev_thr, 2);
+    (void) simpler_lpf;
+    vp8_loop_filter_vertical_edge_mmx(y_ptr + 4, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_vertical_edge_mmx(y_ptr + 8, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_vertical_edge_mmx(y_ptr + 12, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);

    if (u_ptr)
-        vp8_loop_filter_vertical_edge_mmx(u_ptr + 4, uv_stride, lfi->blim, lfi->lim, lfi->hev_thr, 1);
+        vp8_loop_filter_vertical_edge_mmx(u_ptr + 4, uv_stride, lfi->flim, lfi->lim, lfi->thr, 1);

    if (v_ptr)
-        vp8_loop_filter_vertical_edge_mmx(v_ptr + 4, uv_stride, lfi->blim, lfi->lim, lfi->hev_thr, 1);
+        vp8_loop_filter_vertical_edge_mmx(v_ptr + 4, uv_stride, lfi->flim, lfi->lim, lfi->thr, 1);
 }


-void vp8_loop_filter_bvs_mmx(unsigned char *y_ptr, int y_stride, const unsigned char *blimit)
+void vp8_loop_filter_bvs_mmx(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
+                             int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    vp8_loop_filter_simple_vertical_edge_mmx(y_ptr + 4, y_stride, blimit);
-    vp8_loop_filter_simple_vertical_edge_mmx(y_ptr + 8, y_stride, blimit);
-    vp8_loop_filter_simple_vertical_edge_mmx(y_ptr + 12, y_stride, blimit);
+    (void) u_ptr;
+    (void) v_ptr;
+    (void) uv_stride;
+    (void) simpler_lpf;
+    vp8_loop_filter_simple_vertical_edge_mmx(y_ptr + 4, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_simple_vertical_edge_mmx(y_ptr + 8, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_simple_vertical_edge_mmx(y_ptr + 12, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
 }
 #endif

@@ -108,65 +156,113 @@ void vp8_loop_filter_bvs_mmx(unsigned char *y_ptr, int y_stride, const unsigned
 /* Horizontal MB filtering */
 #if HAVE_SSE2
 void vp8_loop_filter_mbh_sse2(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
-                              int y_stride, int uv_stride, loop_filter_info *lfi)
+                              int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    vp8_mbloop_filter_horizontal_edge_sse2(y_ptr, y_stride, lfi->mblim, lfi->lim, lfi->hev_thr, 2);
+    (void) simpler_lpf;
+    vp8_mbloop_filter_horizontal_edge_sse2(y_ptr, y_stride, lfi->mbflim, lfi->lim, lfi->thr, 2);

    if (u_ptr)
-        vp8_mbloop_filter_horizontal_edge_uv_sse2(u_ptr, uv_stride, lfi->mblim, lfi->lim, lfi->hev_thr, v_ptr);
+        vp8_mbloop_filter_horizontal_edge_uv_sse2(u_ptr, uv_stride, lfi->mbflim, lfi->lim, lfi->thr, v_ptr);
+}
+
+
+void vp8_loop_filter_mbhs_sse2(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
+                               int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
+{
+    (void) u_ptr;
+    (void) v_ptr;
+    (void) uv_stride;
+    (void) simpler_lpf;
+    vp8_loop_filter_simple_horizontal_edge_sse2(y_ptr, y_stride, lfi->mbflim, lfi->lim, lfi->thr, 2);
 }


 /* Vertical MB Filtering */
 void vp8_loop_filter_mbv_sse2(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
-                              int y_stride, int uv_stride, loop_filter_info *lfi)
+                              int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    vp8_mbloop_filter_vertical_edge_sse2(y_ptr, y_stride, lfi->mblim, lfi->lim, lfi->hev_thr, 2);
+    (void) simpler_lpf;
+    vp8_mbloop_filter_vertical_edge_sse2(y_ptr, y_stride, lfi->mbflim, lfi->lim, lfi->thr, 2);

    if (u_ptr)
-        vp8_mbloop_filter_vertical_edge_uv_sse2(u_ptr, uv_stride, lfi->mblim, lfi->lim, lfi->hev_thr, v_ptr);
+        vp8_mbloop_filter_vertical_edge_uv_sse2(u_ptr, uv_stride, lfi->mbflim, lfi->lim, lfi->thr, v_ptr);
+}
+
+
+void vp8_loop_filter_mbvs_sse2(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
+                               int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
+{
+    (void) u_ptr;
+    (void) v_ptr;
+    (void) uv_stride;
+    (void) simpler_lpf;
+    vp8_loop_filter_simple_vertical_edge_sse2(y_ptr, y_stride, lfi->mbflim, lfi->lim, lfi->thr, 2);
 }


 /* Horizontal B Filtering */
 void vp8_loop_filter_bh_sse2(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
-                             int y_stride, int uv_stride, loop_filter_info *lfi)
+                             int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    vp8_loop_filter_horizontal_edge_sse2(y_ptr + 4 * y_stride, y_stride, lfi->blim, lfi->lim, lfi->hev_thr, 2);
-    vp8_loop_filter_horizontal_edge_sse2(y_ptr + 8 * y_stride, y_stride, lfi->blim, lfi->lim, lfi->hev_thr, 2);
-    vp8_loop_filter_horizontal_edge_sse2(y_ptr + 12 * y_stride, y_stride, lfi->blim, lfi->lim, lfi->hev_thr, 2);
+    (void) simpler_lpf;
+    vp8_loop_filter_horizontal_edge_sse2(y_ptr + 4 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_horizontal_edge_sse2(y_ptr + 8 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_horizontal_edge_sse2(y_ptr + 12 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);

    if (u_ptr)
-        vp8_loop_filter_horizontal_edge_uv_sse2(u_ptr + 4 * uv_stride, uv_stride, lfi->blim, lfi->lim, lfi->hev_thr, v_ptr + 4 * uv_stride);
+        vp8_loop_filter_horizontal_edge_uv_sse2(u_ptr + 4 * uv_stride, uv_stride, lfi->flim, lfi->lim, lfi->thr, v_ptr + 4 * uv_stride);
 }


-void vp8_loop_filter_bhs_sse2(unsigned char *y_ptr, int y_stride, const unsigned char *blimit)
+void vp8_loop_filter_bhs_sse2(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
+                              int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    vp8_loop_filter_simple_horizontal_edge_sse2(y_ptr + 4 * y_stride, y_stride, blimit);
-    vp8_loop_filter_simple_horizontal_edge_sse2(y_ptr + 8 * y_stride, y_stride, blimit);
-    vp8_loop_filter_simple_horizontal_edge_sse2(y_ptr + 12 * y_stride, y_stride, blimit);
+    (void) u_ptr;
+    (void) v_ptr;
+    (void) uv_stride;
+    (void) simpler_lpf;
+    vp8_loop_filter_simple_horizontal_edge_sse2(y_ptr + 4 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_simple_horizontal_edge_sse2(y_ptr + 8 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_simple_horizontal_edge_sse2(y_ptr + 12 * y_stride, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
 }


 /* Vertical B Filtering */
 void vp8_loop_filter_bv_sse2(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
-                             int y_stride, int uv_stride, loop_filter_info *lfi)
+                             int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    vp8_loop_filter_vertical_edge_sse2(y_ptr + 4, y_stride, lfi->blim, lfi->lim, lfi->hev_thr, 2);
-    vp8_loop_filter_vertical_edge_sse2(y_ptr + 8, y_stride, lfi->blim, lfi->lim, lfi->hev_thr, 2);
-    vp8_loop_filter_vertical_edge_sse2(y_ptr + 12, y_stride, lfi->blim, lfi->lim, lfi->hev_thr, 2);
+    (void) simpler_lpf;
+    vp8_loop_filter_vertical_edge_sse2(y_ptr + 4, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_vertical_edge_sse2(y_ptr + 8, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_vertical_edge_sse2(y_ptr + 12, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);

    if (u_ptr)
-        vp8_loop_filter_vertical_edge_uv_sse2(u_ptr + 4, uv_stride, lfi->blim, lfi->lim, lfi->hev_thr, v_ptr + 4);
+        vp8_loop_filter_vertical_edge_uv_sse2(u_ptr + 4, uv_stride, lfi->flim, lfi->lim, lfi->thr, v_ptr + 4);
 }


-void vp8_loop_filter_bvs_sse2(unsigned char *y_ptr, int y_stride, const unsigned char *blimit)
+void vp8_loop_filter_bvs_sse2(unsigned char *y_ptr, unsigned char *u_ptr, unsigned char *v_ptr,
+                              int y_stride, int uv_stride, loop_filter_info *lfi, int simpler_lpf)
 {
-    vp8_loop_filter_simple_vertical_edge_sse2(y_ptr + 4, y_stride, blimit);
-    vp8_loop_filter_simple_vertical_edge_sse2(y_ptr + 8, y_stride, blimit);
-    vp8_loop_filter_simple_vertical_edge_sse2(y_ptr + 12, y_stride, blimit);
+    (void) u_ptr;
+    (void) v_ptr;
+    (void) uv_stride;
+    (void) simpler_lpf;
+    vp8_loop_filter_simple_vertical_edge_sse2(y_ptr + 4, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_simple_vertical_edge_sse2(y_ptr + 8, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_simple_vertical_edge_sse2(y_ptr + 12, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
 }

 #endif
+
+#if 0
+void vp8_fast_loop_filter_vertical_edges_sse(unsigned char *y_ptr,
+        int y_stride,
+        loop_filter_info *lfi)
+{
+
+    vp8_loop_filter_simple_vertical_edge_sse2(y_ptr + 4, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_simple_vertical_edge_sse2(y_ptr + 8, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+    vp8_loop_filter_simple_vertical_edge_sse2(y_ptr + 12, y_stride, lfi->flim, lfi->lim, lfi->thr, 2);
+}
+#endif
--- a/vp8/common/x86/loopfilter_x86.h
+++ b/vp8/common/x86/loopfilter_x86.h
@@ -24,10 +24,10 @@ extern prototype_loopfilter_block(vp8_loop_filter_mbv_mmx);
 extern prototype_loopfilter_block(vp8_loop_filter_bv_mmx);
 extern prototype_loopfilter_block(vp8_loop_filter_mbh_mmx);
 extern prototype_loopfilter_block(vp8_loop_filter_bh_mmx);
-extern prototype_simple_loopfilter(vp8_loop_filter_simple_vertical_edge_mmx);
-extern prototype_simple_loopfilter(vp8_loop_filter_bvs_mmx);
-extern prototype_simple_loopfilter(vp8_loop_filter_simple_horizontal_edge_mmx);
-extern prototype_simple_loopfilter(vp8_loop_filter_bhs_mmx);
+extern prototype_loopfilter_block(vp8_loop_filter_mbvs_mmx);
+extern prototype_loopfilter_block(vp8_loop_filter_bvs_mmx);
+extern prototype_loopfilter_block(vp8_loop_filter_mbhs_mmx);
+extern prototype_loopfilter_block(vp8_loop_filter_bhs_mmx);


 #if !CONFIG_RUNTIME_CPU_DETECT
@@ -44,13 +44,13 @@ extern prototype_simple_loopfilter(vp8_loop_filter_bhs_mmx);
 #define vp8_lf_normal_b_h vp8_loop_filter_bh_mmx

 #undef  vp8_lf_simple_mb_v
-#define vp8_lf_simple_mb_v vp8_loop_filter_simple_vertical_edge_mmx
+#define vp8_lf_simple_mb_v vp8_loop_filter_mbvs_mmx

 #undef  vp8_lf_simple_b_v
 #define vp8_lf_simple_b_v vp8_loop_filter_bvs_mmx

 #undef  vp8_lf_simple_mb_h
-#define vp8_lf_simple_mb_h vp8_loop_filter_simple_horizontal_edge_mmx
+#define vp8_lf_simple_mb_h vp8_loop_filter_mbhs_mmx

 #undef  vp8_lf_simple_b_h
 #define vp8_lf_simple_b_h vp8_loop_filter_bhs_mmx
@@ -63,10 +63,10 @@ extern prototype_loopfilter_block(vp8_loop_filter_mbv_sse2);
 extern prototype_loopfilter_block(vp8_loop_filter_bv_sse2);
 extern prototype_loopfilter_block(vp8_loop_filter_mbh_sse2);
 extern prototype_loopfilter_block(vp8_loop_filter_bh_sse2);
-extern prototype_simple_loopfilter(vp8_loop_filter_simple_vertical_edge_sse2);
-extern prototype_simple_loopfilter(vp8_loop_filter_bvs_sse2);
-extern prototype_simple_loopfilter(vp8_loop_filter_simple_horizontal_edge_sse2);
-extern prototype_simple_loopfilter(vp8_loop_filter_bhs_sse2);
+extern prototype_loopfilter_block(vp8_loop_filter_mbvs_sse2);
+extern prototype_loopfilter_block(vp8_loop_filter_bvs_sse2);
+extern prototype_loopfilter_block(vp8_loop_filter_mbhs_sse2);
+extern prototype_loopfilter_block(vp8_loop_filter_bhs_sse2);


 #if !CONFIG_RUNTIME_CPU_DETECT
@@ -83,13 +83,13 @@ extern prototype_simple_loopfilter(vp8_loop_filter_bhs_sse2);
 #define vp8_lf_normal_b_h vp8_loop_filter_bh_sse2

 #undef  vp8_lf_simple_mb_v
-#define vp8_lf_simple_mb_v vp8_loop_filter_simple_vertical_edge_sse2
+#define vp8_lf_simple_mb_v vp8_loop_filter_mbvs_sse2

 #undef  vp8_lf_simple_b_v
 #define vp8_lf_simple_b_v vp8_loop_filter_bvs_sse2

 #undef  vp8_lf_simple_mb_h
-#define vp8_lf_simple_mb_h vp8_loop_filter_simple_horizontal_edge_sse2
+#define vp8_lf_simple_mb_h vp8_loop_filter_mbhs_sse2

 #undef  vp8_lf_simple_b_h
 #define vp8_lf_simple_b_h vp8_loop_filter_bhs_sse2
--- a/vp8/common/x86/postproc_mmx.asm
+++ b/vp8/common/x86/postproc_mmx.asm
@@ -58,10 +58,10 @@ sym(vp8_post_proc_down_and_across_mmx):
        movsxd      rax, DWORD PTR arg(2) ;src_pixels_per_line ; destination pitch?
        pxor        mm0, mm0              ; mm0 = 00000000

-.nextrow:
+nextrow:

        xor         rdx,        rdx       ; clear out rdx for use as loop counter
-.nextcol:
+nextcol:

        pxor        mm7, mm7              ; mm7 = 00000000
        movq        mm6, [rbx + 32 ]      ; mm6 = kernel 2 taps
@@ -146,7 +146,7 @@ sym(vp8_post_proc_down_and_across_mmx):
        add         rdx, 4

        cmp         edx, dword ptr arg(5) ;cols
-        jl          .nextcol
+        jl          nextcol
        ; done with the all cols, start the across filtering in place
        sub         rsi, rdx
        sub         rdi, rdx
@@ -156,7 +156,7 @@ sym(vp8_post_proc_down_and_across_mmx):
        xor         rdx,    rdx
        mov         rax,    [rdi-4];

-.acrossnextcol:
+acrossnextcol:
        pxor        mm7, mm7              ; mm7 = 00000000
        movq        mm6, [rbx + 32 ]      ;
        movq        mm4, [rdi+rdx]        ; mm4 = p0..p7
@@ -237,7 +237,7 @@ sym(vp8_post_proc_down_and_across_mmx):

        add         rdx, 4
        cmp         edx, dword ptr arg(5) ;cols
-        jl          .acrossnextcol;
+        jl          acrossnextcol;

        mov         DWORD PTR [rdi+rdx-4],  eax
        pop         rax
@@ -249,7 +249,7 @@ sym(vp8_post_proc_down_and_across_mmx):
        movsxd      rax, dword ptr arg(2) ;src_pixels_per_line ; destination pitch?

        dec         rcx                   ; decrement count
-        jnz         .nextrow               ; next row
+        jnz         nextrow               ; next row
        pop         rbx

    ; begin epilog
@@ -293,7 +293,7 @@ sym(vp8_mbpost_proc_down_mmx):
    add         dword ptr arg(2), 8

    ;for(c=0; c<cols; c+=4)
-.loop_col:
+loop_col:
            mov         rsi,        arg(0)  ;s
            pxor        mm0,        mm0     ;

@@ -312,7 +312,7 @@ sym(vp8_mbpost_proc_down_mmx):

            mov         rcx,        15          ;

-.loop_initvar:
+loop_initvar:
            movd        mm1,        DWORD PTR [rdi];
            punpcklbw   mm1,        mm0     ;

@@ -329,10 +329,10 @@ sym(vp8_mbpost_proc_down_mmx):
            lea         rdi,        [rdi+rax]   ;

            dec         rcx
-            jne         .loop_initvar
+            jne         loop_initvar
            ;save the var and sum
            xor         rdx,        rdx
-.loop_row:
+loop_row:
            movd        mm1,        DWORD PTR [rsi]     ; [s-pitch*8]
            movd        mm2,        DWORD PTR [rdi]     ; [s+pitch*7]

@@ -438,13 +438,13 @@ sym(vp8_mbpost_proc_down_mmx):
            add         rdx,        1

            cmp         edx,        dword arg(2) ;rows
-            jl          .loop_row
+            jl          loop_row


        add         dword arg(0), 4 ; s += 4
        sub         dword arg(3), 4 ; cols -= 4
        cmp         dword arg(3), 0
-        jg          .loop_col
+        jg          loop_col

    add         rsp, 136
    pop         rsp
@@ -475,7 +475,7 @@ sym(vp8_plane_add_noise_mmx):
    push        rdi
    ; end prolog

-.addnoise_loop:
+addnoise_loop:
    call sym(rand) WRT_PLT
    mov     rcx, arg(1) ;noise
    and     rax, 0xff
@@ -492,7 +492,7 @@ sym(vp8_plane_add_noise_mmx):
            mov     rsi, arg(0) ;Pos
            xor         rax,rax

-.addnoise_nextset:
+addnoise_nextset:
            movq        mm1,[rsi+rax]         ; get the source

            psubusb     mm1, [rdx]    ;blackclamp        ; clamp both sides so we don't outrange adding noise
@@ -506,12 +506,12 @@ sym(vp8_plane_add_noise_mmx):
            add         rax,8                 ; move to the next line

            cmp         rax, rcx
-            jl          .addnoise_nextset
+            jl          addnoise_nextset

    movsxd  rax, dword arg(7) ; Pitch
    add     arg(0), rax ; Start += Pitch
    sub     dword arg(6), 1   ; Height -= 1
-    jg      .addnoise_loop
+    jg      addnoise_loop

    ; begin epilog
    pop rdi
--- a/vp8/common/x86/postproc_mmx.c
+++ b/vp8/common/x86/postproc_mmx.c
--- a/vp8/common/x86/postproc_sse2.asm
+++ b/vp8/common/x86/postproc_sse2.asm
@@ -26,7 +26,7 @@ sym(vp8_post_proc_down_and_across_xmm):
    push        rbp
    mov         rbp, rsp
    SHADOW_ARGS_TO_STACK 7
-    SAVE_XMM 7
+    SAVE_XMM
    GET_GOT     rbx
    push        rsi
    push        rdi
@@ -57,10 +57,10 @@ sym(vp8_post_proc_down_and_across_xmm):
        movsxd      rax,        DWORD PTR arg(2) ;src_pixels_per_line ; destination pitch?
        pxor        xmm0,       xmm0              ; mm0 = 00000000

-.nextrow:
+nextrow:

        xor         rdx,        rdx       ; clear out rdx for use as loop counter
-.nextcol:
+nextcol:
        movq        xmm3,       QWORD PTR [rsi]         ; mm4 = r0 p0..p7
        punpcklbw   xmm3,       xmm0                    ; mm3 = p0..p3
        movdqa      xmm1,       xmm3                    ; mm1 = p0..p3
@@ -133,7 +133,7 @@ sym(vp8_post_proc_down_and_across_xmm):
        add         rdx,        8
        cmp         edx,        dword arg(5) ;cols

-        jl          .nextcol
+        jl          nextcol

        ; done with the all cols, start the across filtering in place
        sub         rsi,        rdx
@@ -142,7 +142,7 @@ sym(vp8_post_proc_down_and_across_xmm):
        xor         rdx,        rdx
        movq        mm0,        QWORD PTR [rdi-8];

-.acrossnextcol:
+acrossnextcol:
        movq        xmm7,       QWORD PTR [rdi +rdx -2]
        movd        xmm4,       DWORD PTR [rdi +rdx +6]

@@ -219,7 +219,7 @@ sym(vp8_post_proc_down_and_across_xmm):

        add         rdx,        8
        cmp         edx,        dword arg(5) ;cols
-        jl          .acrossnextcol;
+        jl          acrossnextcol;

        ; last 8 pixels
        movq        QWORD PTR [rdi+rdx-8],  mm0
@@ -231,7 +231,7 @@ sym(vp8_post_proc_down_and_across_xmm):
        mov         eax, dword arg(2) ;src_pixels_per_line ; destination pitch?

        dec         rcx                   ; decrement count
-        jnz         .nextrow              ; next row
+        jnz         nextrow               ; next row

 %if ABI_IS_32BIT=1 && CONFIG_PIC=1
    add rsp,16
@@ -256,7 +256,7 @@ sym(vp8_mbpost_proc_down_xmm):
    push        rbp
    mov         rbp, rsp
    SHADOW_ARGS_TO_STACK 5
-    SAVE_XMM 7
+    SAVE_XMM
    GET_GOT     rbx
    push        rsi
    push        rdi
@@ -282,7 +282,7 @@ sym(vp8_mbpost_proc_down_xmm):
    add         dword arg(2), 8

    ;for(c=0; c<cols; c+=8)
-.loop_col:
+loop_col:
            mov         rsi,        arg(0) ; s
            pxor        xmm0,       xmm0        ;

@@ -301,7 +301,7 @@ sym(vp8_mbpost_proc_down_xmm):

            mov         rcx,        15          ;

-.loop_initvar:
+loop_initvar:
            movq        xmm1,       QWORD PTR [rdi];
            punpcklbw   xmm1,       xmm0        ;

@@ -318,10 +318,10 @@ sym(vp8_mbpost_proc_down_xmm):
            lea         rdi,        [rdi+rax]   ;

            dec         rcx
-            jne         .loop_initvar
+            jne         loop_initvar
            ;save the var and sum
            xor         rdx,        rdx
-.loop_row:
+loop_row:
            movq        xmm1,       QWORD PTR [rsi]     ; [s-pitch*8]
            movq        xmm2,       QWORD PTR [rdi]     ; [s+pitch*7]

@@ -428,12 +428,12 @@ sym(vp8_mbpost_proc_down_xmm):
            add         rdx,        1

            cmp         edx,        dword arg(2) ;rows
-            jl          .loop_row
+            jl          loop_row

        add         dword arg(0), 8 ; s += 8
        sub         dword arg(3), 8 ; cols -= 8
        cmp         dword arg(3), 0
-        jg          .loop_col
+        jg          loop_col

    add         rsp, 128+16
    pop         rsp
@@ -456,7 +456,7 @@ sym(vp8_mbpost_proc_across_ip_xmm):
    push        rbp
    mov         rbp, rsp
    SHADOW_ARGS_TO_STACK 5
-    SAVE_XMM 7
+    SAVE_XMM
    GET_GOT     rbx
    push        rsi
    push        rdi
@@ -475,13 +475,13 @@ sym(vp8_mbpost_proc_across_ip_xmm):


    ;for(r=0;r<rows;r++)
-.ip_row_loop:
+ip_row_loop:

        xor         rdx,    rdx ;sumsq=0;
        xor         rcx,    rcx ;sum=0;
        mov         rsi,    arg(0); s
        mov         rdi,    -8
-.ip_var_loop:
+ip_var_loop:
        ;for(i=-8;i<=6;i++)
        ;{
        ;    sumsq += s[i]*s[i];
@@ -493,7 +493,7 @@ sym(vp8_mbpost_proc_across_ip_xmm):
        add         edx, eax
        add         rdi, 1
        cmp         rdi, 6
-        jle         .ip_var_loop
+        jle         ip_var_loop


            ;mov         rax,    sumsq
@@ -513,7 +513,7 @@ sym(vp8_mbpost_proc_across_ip_xmm):
            pxor        mm1,    mm1

            pxor        xmm0,   xmm0
-.nextcol4:
+nextcol4:

            movd        xmm1,   DWORD PTR [rsi+rcx-8]   ; -8 -7 -6 -5
            movd        xmm2,   DWORD PTR [rsi+rcx+7]   ; +7 +8 +9 +10
@@ -600,7 +600,7 @@ sym(vp8_mbpost_proc_across_ip_xmm):
            add         rcx,    4

            cmp         rcx,    rdx
-            jl          .nextcol4
+            jl          nextcol4

        ;s+=pitch;
        movsxd rax, dword arg(1)
@@ -608,7 +608,7 @@ sym(vp8_mbpost_proc_across_ip_xmm):

        sub dword arg(2), 1 ;rows-=1
        cmp dword arg(2), 0
-        jg .ip_row_loop
+        jg ip_row_loop

    add         rsp, 16
    pop         rsp
@@ -640,7 +640,7 @@ sym(vp8_plane_add_noise_wmt):
    push        rdi
    ; end prolog

-.addnoise_loop:
+addnoise_loop:
    call sym(rand) WRT_PLT
    mov     rcx, arg(1) ;noise
    and     rax, 0xff
@@ -657,7 +657,7 @@ sym(vp8_plane_add_noise_wmt):
            mov     rsi, arg(0) ;Pos
            xor         rax,rax

-.addnoise_nextset:
+addnoise_nextset:
            movdqu      xmm1,[rsi+rax]         ; get the source

            psubusb     xmm1, [rdx]    ;blackclamp        ; clamp both sides so we don't outrange adding noise
@@ -671,12 +671,12 @@ sym(vp8_plane_add_noise_wmt):
            add         rax,16                 ; move to the next line

            cmp         rax, rcx
-            jl          .addnoise_nextset
+            jl          addnoise_nextset

    movsxd  rax, dword arg(7) ; Pitch
    add     arg(0), rax ; Start += Pitch
    sub     dword arg(6), 1   ; Height -= 1
-    jg      .addnoise_loop
+    jg      addnoise_loop

    ; begin epilog
    pop rdi
--- a/vp8/common/x86/recon_sse2.asm
+++ b/vp8/common/x86/recon_sse2.asm
@@ -67,7 +67,7 @@ sym(vp8_recon4b_sse2):
    push        rbp
    mov         rbp, rsp
    SHADOW_ARGS_TO_STACK 4
-    SAVE_XMM 7
+    SAVE_XMM
    push        rsi
    push        rdi
    ; end prolog
@@ -229,460 +229,3 @@ sym(vp8_copy_mem16x16_sse2):
    UNSHADOW_ARGS
    pop         rbp
    ret
-
-
-;void vp8_intra_pred_uv_dc_mmx2(
-;    unsigned char *dst,
-;    int dst_stride
-;    unsigned char *src,
-;    int src_stride,
-;    )
-global sym(vp8_intra_pred_uv_dc_mmx2)
-sym(vp8_intra_pred_uv_dc_mmx2):
-    push        rbp
-    mov         rbp, rsp
-    SHADOW_ARGS_TO_STACK 4
-    push        rsi
-    push        rdi
-    ; end prolog
-
-    ; from top
-    mov         rsi,        arg(2) ;src;
-    movsxd      rax,        dword ptr arg(3) ;src_stride;
-    sub         rsi,        rax
-    pxor        mm0,        mm0
-    movq        mm1,        [rsi]
-    psadbw      mm1,        mm0
-
-    ; from left
-    dec         rsi
-    lea         rdi,        [rax*3]
-    movzx       ecx,        byte [rsi+rax]
-    movzx       edx,        byte [rsi+rax*2]
-    add         ecx,        edx
-    movzx       edx,        byte [rsi+rdi]
-    add         ecx,        edx
-    lea         rsi,        [rsi+rax*4]
-    movzx       edx,        byte [rsi]
-    add         ecx,        edx
-    movzx       edx,        byte [rsi+rax]
-    add         ecx,        edx
-    movzx       edx,        byte [rsi+rax*2]
-    add         ecx,        edx
-    movzx       edx,        byte [rsi+rdi]
-    add         ecx,        edx
-    movzx       edx,        byte [rsi+rax*4]
-    add         ecx,        edx
-
-    ; add up
-    pextrw      edx,        mm1, 0x0
-    lea         edx,        [edx+ecx+8]
-    sar         edx,        4
-    movd        mm1,        edx
-    pshufw      mm1,        mm1, 0x0
-    packuswb    mm1,        mm1
-
-    ; write out
-    mov         rdi,        arg(0) ;dst;
-    movsxd      rcx,        dword ptr arg(1) ;dst_stride
-    lea         rax,        [rcx*3]
-
-    movq [rdi      ],       mm1
-    movq [rdi+rcx  ],       mm1
-    movq [rdi+rcx*2],       mm1
-    movq [rdi+rax  ],       mm1
-    lea         rdi,        [rdi+rcx*4]
-    movq [rdi      ],       mm1
-    movq [rdi+rcx  ],       mm1
-    movq [rdi+rcx*2],       mm1
-    movq [rdi+rax  ],       mm1
-
-    ; begin epilog
-    pop         rdi
-    pop         rsi
-    UNSHADOW_ARGS
-    pop         rbp
-    ret
-
-;void vp8_intra_pred_uv_dctop_mmx2(
-;    unsigned char *dst,
-;    int dst_stride
-;    unsigned char *src,
-;    int src_stride,
-;    )
-global sym(vp8_intra_pred_uv_dctop_mmx2)
-sym(vp8_intra_pred_uv_dctop_mmx2):
-    push        rbp
-    mov         rbp, rsp
-    SHADOW_ARGS_TO_STACK 4
-    GET_GOT     rbx
-    push        rsi
-    push        rdi
-    ; end prolog
-
-    ; from top
-    mov         rsi,        arg(2) ;src;
-    movsxd      rax,        dword ptr arg(3) ;src_stride;
-    sub         rsi,        rax
-    pxor        mm0,        mm0
-    movq        mm1,        [rsi]
-    psadbw      mm1,        mm0
-
-    ; add up
-    paddw       mm1,        [GLOBAL(dc_4)]
-    psraw       mm1,        3
-    pshufw      mm1,        mm1, 0x0
-    packuswb    mm1,        mm1
-
-    ; write out
-    mov         rdi,        arg(0) ;dst;
-    movsxd      rcx,        dword ptr arg(1) ;dst_stride
-    lea         rax,        [rcx*3]
-
-    movq [rdi      ],       mm1
-    movq [rdi+rcx  ],       mm1
-    movq [rdi+rcx*2],       mm1
-    movq [rdi+rax  ],       mm1
-    lea         rdi,        [rdi+rcx*4]
-    movq [rdi      ],       mm1
-    movq [rdi+rcx  ],       mm1
-    movq [rdi+rcx*2],       mm1
-    movq [rdi+rax  ],       mm1
-
-    ; begin epilog
-    pop         rdi
-    pop         rsi
-    RESTORE_GOT
-    UNSHADOW_ARGS
-    pop         rbp
-    ret
-
-;void vp8_intra_pred_uv_dcleft_mmx2(
-;    unsigned char *dst,
-;    int dst_stride
-;    unsigned char *src,
-;    int src_stride,
-;    )
-global sym(vp8_intra_pred_uv_dcleft_mmx2)
-sym(vp8_intra_pred_uv_dcleft_mmx2):
-    push        rbp
-    mov         rbp, rsp
-    SHADOW_ARGS_TO_STACK 4
-    push        rsi
-    push        rdi
-    ; end prolog
-
-    ; from left
-    mov         rsi,        arg(2) ;src;
-    movsxd      rax,        dword ptr arg(3) ;src_stride;
-    dec         rsi
-    lea         rdi,        [rax*3]
-    movzx       ecx,        byte [rsi]
-    movzx       edx,        byte [rsi+rax]
-    add         ecx,        edx
-    movzx       edx,        byte [rsi+rax*2]
-    add         ecx,        edx
-    movzx       edx,        byte [rsi+rdi]
-    add         ecx,        edx
-    lea         rsi,        [rsi+rax*4]
-    movzx       edx,        byte [rsi]
-    add         ecx,        edx
-    movzx       edx,        byte [rsi+rax]
-    add         ecx,        edx
-    movzx       edx,        byte [rsi+rax*2]
-    add         ecx,        edx
-    movzx       edx,        byte [rsi+rdi]
-    lea         edx,        [ecx+edx+4]
-
-    ; add up
-    shr         edx,        3
-    movd        mm1,        edx
-    pshufw      mm1,        mm1, 0x0
-    packuswb    mm1,        mm1
-
-    ; write out
-    mov         rdi,        arg(0) ;dst;
-    movsxd      rcx,        dword ptr arg(1) ;dst_stride
-    lea         rax,        [rcx*3]
-
-    movq [rdi      ],       mm1
-    movq [rdi+rcx  ],       mm1
-    movq [rdi+rcx*2],       mm1
-    movq [rdi+rax  ],       mm1
-    lea         rdi,        [rdi+rcx*4]
-    movq [rdi      ],       mm1
-    movq [rdi+rcx  ],       mm1
-    movq [rdi+rcx*2],       mm1
-    movq [rdi+rax  ],       mm1
-
-    ; begin epilog
-    pop         rdi
-    pop         rsi
-    UNSHADOW_ARGS
-    pop         rbp
-    ret
-
-;void vp8_intra_pred_uv_dc128_mmx(
-;    unsigned char *dst,
-;    int dst_stride
-;    unsigned char *src,
-;    int src_stride,
-;    )
-global sym(vp8_intra_pred_uv_dc128_mmx)
-sym(vp8_intra_pred_uv_dc128_mmx):
-    push        rbp
-    mov         rbp, rsp
-    SHADOW_ARGS_TO_STACK 4
-    GET_GOT     rbx
-    ; end prolog
-
-    ; write out
-    movq        mm1,        [GLOBAL(dc_128)]
-    mov         rax,        arg(0) ;dst;
-    movsxd      rdx,        dword ptr arg(1) ;dst_stride
-    lea         rcx,        [rdx*3]
-
-    movq [rax      ],       mm1
-    movq [rax+rdx  ],       mm1
-    movq [rax+rdx*2],       mm1
-    movq [rax+rcx  ],       mm1
-    lea         rax,        [rax+rdx*4]
-    movq [rax      ],       mm1
-    movq [rax+rdx  ],       mm1
-    movq [rax+rdx*2],       mm1
-    movq [rax+rcx  ],       mm1
-
-    ; begin epilog
-    RESTORE_GOT
-    UNSHADOW_ARGS
-    pop         rbp
-    ret
-
-;void vp8_intra_pred_uv_tm_sse2(
-;    unsigned char *dst,
-;    int dst_stride
-;    unsigned char *src,
-;    int src_stride,
-;    )
-%macro vp8_intra_pred_uv_tm 1
-global sym(vp8_intra_pred_uv_tm_%1)
-sym(vp8_intra_pred_uv_tm_%1):
-    push        rbp
-    mov         rbp, rsp
-    SHADOW_ARGS_TO_STACK 4
-    GET_GOT     rbx
-    push        rsi
-    push        rdi
-    ; end prolog
-
-    ; read top row
-    mov         edx,        4
-    mov         rsi,        arg(2) ;src;
-    movsxd      rax,        dword ptr arg(3) ;src_stride;
-    sub         rsi,        rax
-    pxor        xmm0,       xmm0
-%ifidn %1, ssse3
-    movdqa      xmm2,       [GLOBAL(dc_1024)]
-%endif
-    movq        xmm1,       [rsi]
-    punpcklbw   xmm1,       xmm0
-
-    ; set up left ptrs ans subtract topleft
-    movd        xmm3,       [rsi-1]
-    lea         rsi,        [rsi+rax-1]
-%ifidn %1, sse2
-    punpcklbw   xmm3,       xmm0
-    pshuflw     xmm3,       xmm3, 0x0
-    punpcklqdq  xmm3,       xmm3
-%else
-    pshufb      xmm3,       xmm2
-%endif
-    psubw       xmm1,       xmm3
-
-    ; set up dest ptrs
-    mov         rdi,        arg(0) ;dst;
-    movsxd      rcx,        dword ptr arg(1) ;dst_stride
-
-.vp8_intra_pred_uv_tm_%1_loop:
-    movd        xmm3,       [rsi]
-    movd        xmm5,       [rsi+rax]
-%ifidn %1, sse2
-    punpcklbw   xmm3,       xmm0
-    punpcklbw   xmm5,       xmm0
-    pshuflw     xmm3,       xmm3, 0x0
-    pshuflw     xmm5,       xmm5, 0x0
-    punpcklqdq  xmm3,       xmm3
-    punpcklqdq  xmm5,       xmm5
-%else
-    pshufb      xmm3,       xmm2
-    pshufb      xmm5,       xmm2
-%endif
-    paddw       xmm3,       xmm1
-    paddw       xmm5,       xmm1
-    packuswb    xmm3,       xmm5
-    movq  [rdi    ],        xmm3
-    movhps[rdi+rcx],        xmm3
-    lea         rsi,        [rsi+rax*2]
-    lea         rdi,        [rdi+rcx*2]
-    dec         edx
-    jnz .vp8_intra_pred_uv_tm_%1_loop
-
-    ; begin epilog
-    pop         rdi
-    pop         rsi
-    RESTORE_GOT
-    UNSHADOW_ARGS
-    pop         rbp
-    ret
-%endmacro
-
-vp8_intra_pred_uv_tm sse2
-vp8_intra_pred_uv_tm ssse3
-
-;void vp8_intra_pred_uv_ve_mmx(
-;    unsigned char *dst,
-;    int dst_stride
-;    unsigned char *src,
-;    int src_stride,
-;    )
-global sym(vp8_intra_pred_uv_ve_mmx)
-sym(vp8_intra_pred_uv_ve_mmx):
-    push        rbp
-    mov         rbp, rsp
-    SHADOW_ARGS_TO_STACK 4
-    ; end prolog
-
-    ; read from top
-    mov         rax,        arg(2) ;src;
-    movsxd      rdx,        dword ptr arg(3) ;src_stride;
-    sub         rax,        rdx
-    movq        mm1,        [rax]
-
-    ; write out
-    mov         rax,        arg(0) ;dst;
-    movsxd      rdx,        dword ptr arg(1) ;dst_stride
-    lea         rcx,        [rdx*3]
-
-    movq [rax      ],       mm1
-    movq [rax+rdx  ],       mm1
-    movq [rax+rdx*2],       mm1
-    movq [rax+rcx  ],       mm1
-    lea         rax,        [rax+rdx*4]
-    movq [rax      ],       mm1
-    movq [rax+rdx  ],       mm1
-    movq [rax+rdx*2],       mm1
-    movq [rax+rcx  ],       mm1
-
-    ; begin epilog
-    UNSHADOW_ARGS
-    pop         rbp
-    ret
-
-;void vp8_intra_pred_uv_ho_mmx2(
-;    unsigned char *dst,
-;    int dst_stride
-;    unsigned char *src,
-;    int src_stride,
-;    )
-%macro vp8_intra_pred_uv_ho 1
-global sym(vp8_intra_pred_uv_ho_%1)
-sym(vp8_intra_pred_uv_ho_%1):
-    push        rbp
-    mov         rbp, rsp
-    SHADOW_ARGS_TO_STACK 4
-    push        rsi
-    push        rdi
-%ifidn %1, ssse3
-%ifndef GET_GOT_SAVE_ARG
-    push        rbx
-%endif
-    GET_GOT     rbx
-%endif
-    ; end prolog
-
-    ; read from left and write out
-%ifidn %1, mmx2
-    mov         edx,        4
-%endif
-    mov         rsi,        arg(2) ;src;
-    movsxd      rax,        dword ptr arg(3) ;src_stride;
-    mov         rdi,        arg(0) ;dst;
-    movsxd      rcx,        dword ptr arg(1) ;dst_stride
-%ifidn %1, ssse3
-    lea         rdx,        [rcx*3]
-    movdqa      xmm2,       [GLOBAL(dc_00001111)]
-    lea         rbx,        [rax*3]
-%endif
-    dec         rsi
-%ifidn %1, mmx2
-.vp8_intra_pred_uv_ho_%1_loop:
-    movd        mm0,        [rsi]
-    movd        mm1,        [rsi+rax]
-    punpcklbw   mm0,        mm0
-    punpcklbw   mm1,        mm1
-    pshufw      mm0,        mm0, 0x0
-    pshufw      mm1,        mm1, 0x0
-    movq  [rdi    ],        mm0
-    movq  [rdi+rcx],        mm1
-    lea         rsi,        [rsi+rax*2]
-    lea         rdi,        [rdi+rcx*2]
-    dec         edx
-    jnz .vp8_intra_pred_uv_ho_%1_loop
-%else
-    movd        xmm0,       [rsi]
-    movd        xmm3,       [rsi+rax]
-    movd        xmm1,       [rsi+rax*2]
-    movd        xmm4,       [rsi+rbx]
-    punpcklbw   xmm0,       xmm3
-    punpcklbw   xmm1,       xmm4
-    pshufb      xmm0,       xmm2
-    pshufb      xmm1,       xmm2
-    movq   [rdi    ],       xmm0
-    movhps [rdi+rcx],       xmm0
-    movq [rdi+rcx*2],       xmm1
-    movhps [rdi+rdx],       xmm1
-    lea         rsi,        [rsi+rax*4]
-    lea         rdi,        [rdi+rcx*4]
-    movd        xmm0,       [rsi]
-    movd        xmm3,       [rsi+rax]
-    movd        xmm1,       [rsi+rax*2]
-    movd        xmm4,       [rsi+rbx]
-    punpcklbw   xmm0,       xmm3
-    punpcklbw   xmm1,       xmm4
-    pshufb      xmm0,       xmm2
-    pshufb      xmm1,       xmm2
-    movq   [rdi    ],       xmm0
-    movhps [rdi+rcx],       xmm0
-    movq [rdi+rcx*2],       xmm1
-    movhps [rdi+rdx],       xmm1
-%endif
-
-    ; begin epilog
-%ifidn %1, ssse3
-    RESTORE_GOT
-%ifndef GET_GOT_SAVE_ARG
-    pop         rbx
-%endif
-%endif
-    pop         rdi
-    pop         rsi
-    UNSHADOW_ARGS
-    pop         rbp
-    ret
-%endmacro
-
-vp8_intra_pred_uv_ho mmx2
-vp8_intra_pred_uv_ho ssse3
-
-SECTION_RODATA
-dc_128:
-    times 8 db 128
-dc_4:
-    times 4 dw 4
-align 16
-dc_1024:
-    times 8 dw 0x400
-align 16
-dc_00001111:
-    times 8 db 0
-    times 8 db 1
--- a/vp8/common/x86/recon_wrapper_sse2.c
+++ b/vp8/common/x86/recon_wrapper_sse2.c
@@ -1,96 +0,0 @@
-/*
- *  Copyright (c) 2010 The WebM project authors. All Rights Reserved.
- *
- *  Use of this source code is governed by a BSD-style license
- *  that can be found in the LICENSE file in the root of the source
- *  tree. An additional intellectual property rights grant can be found
- *  in the file PATENTS.  All contributing project authors may
- *  be found in the AUTHORS file in the root of the source tree.
- */
-
-#include "vpx_config.h"
-#include "vp8/common/recon.h"
-#include "recon_x86.h"
-#include "vpx_mem/vpx_mem.h"
-
-#define build_intra_predictors_mbuv_prototype(sym) \
-    void sym(unsigned char *dst, int dst_stride, \
-             const unsigned char *src, int src_stride)
-typedef build_intra_predictors_mbuv_prototype((*build_intra_predictors_mbuv_fn_t));
-
-extern build_intra_predictors_mbuv_prototype(vp8_intra_pred_uv_dc_mmx2);
-extern build_intra_predictors_mbuv_prototype(vp8_intra_pred_uv_dctop_mmx2);
-extern build_intra_predictors_mbuv_prototype(vp8_intra_pred_uv_dcleft_mmx2);
-extern build_intra_predictors_mbuv_prototype(vp8_intra_pred_uv_dc128_mmx);
-extern build_intra_predictors_mbuv_prototype(vp8_intra_pred_uv_ho_mmx2);
-extern build_intra_predictors_mbuv_prototype(vp8_intra_pred_uv_ho_ssse3);
-extern build_intra_predictors_mbuv_prototype(vp8_intra_pred_uv_ve_mmx);
-extern build_intra_predictors_mbuv_prototype(vp8_intra_pred_uv_tm_sse2);
-extern build_intra_predictors_mbuv_prototype(vp8_intra_pred_uv_tm_ssse3);
-
-static void vp8_build_intra_predictors_mbuv_x86(MACROBLOCKD *x,
-                                                unsigned char *dst_u,
-                                                unsigned char *dst_v,
-                                                int dst_stride,
-                                                build_intra_predictors_mbuv_fn_t tm_func,
-                                                build_intra_predictors_mbuv_fn_t ho_func)
-{
-    int mode = x->mode_info_context->mbmi.uv_mode;
-    build_intra_predictors_mbuv_fn_t fn;
-    int src_stride = x->dst.uv_stride;
-
-    switch (mode) {
-        case  V_PRED: fn = vp8_intra_pred_uv_ve_mmx; break;
-        case  H_PRED: fn = ho_func; break;
-        case TM_PRED: fn = tm_func; break;
-        case DC_PRED:
-            if (x->up_available) {
-                if (x->left_available) {
-                    fn = vp8_intra_pred_uv_dc_mmx2; break;
-                } else {
-                    fn = vp8_intra_pred_uv_dctop_mmx2; break;
-                }
-            } else if (x->left_available) {
-                fn = vp8_intra_pred_uv_dcleft_mmx2; break;
-            } else {
-                fn = vp8_intra_pred_uv_dc128_mmx; break;
-            }
-            break;
-        default: return;
-    }
-
-    fn(dst_u, dst_stride, x->dst.u_buffer, src_stride);
-    fn(dst_v, dst_stride, x->dst.v_buffer, src_stride);
-}
-
-void vp8_build_intra_predictors_mbuv_sse2(MACROBLOCKD *x)
-{
-    vp8_build_intra_predictors_mbuv_x86(x, &x->predictor[256],
-                                        &x->predictor[320], 8,
-                                        vp8_intra_pred_uv_tm_sse2,
-                                        vp8_intra_pred_uv_ho_mmx2);
-}
-
-void vp8_build_intra_predictors_mbuv_ssse3(MACROBLOCKD *x)
-{
-    vp8_build_intra_predictors_mbuv_x86(x, &x->predictor[256],
-                                        &x->predictor[320], 8,
-                                        vp8_intra_pred_uv_tm_ssse3,
-                                        vp8_intra_pred_uv_ho_ssse3);
-}
-
-void vp8_build_intra_predictors_mbuv_s_sse2(MACROBLOCKD *x)
-{
-    vp8_build_intra_predictors_mbuv_x86(x, x->dst.u_buffer,
-                                        x->dst.v_buffer, x->dst.uv_stride,
-                                        vp8_intra_pred_uv_tm_sse2,
-                                        vp8_intra_pred_uv_ho_mmx2);
-}
-
-void vp8_build_intra_predictors_mbuv_s_ssse3(MACROBLOCKD *x)
-{
-    vp8_build_intra_predictors_mbuv_x86(x, x->dst.u_buffer,
-                                        x->dst.v_buffer, x->dst.uv_stride,
-                                        vp8_intra_pred_uv_tm_ssse3,
-                                        vp8_intra_pred_uv_ho_ssse3);
-}
--- a/vp8/common/x86/recon_x86.h
+++ b/vp8/common/x86/recon_x86.h
@@ -46,8 +46,6 @@ extern prototype_copy_block(vp8_copy_mem16x16_mmx);
 extern prototype_recon_block(vp8_recon2b_sse2);
 extern prototype_recon_block(vp8_recon4b_sse2);
 extern prototype_copy_block(vp8_copy_mem16x16_sse2);
-extern prototype_build_intra_predictors(vp8_build_intra_predictors_mbuv_sse2);
-extern prototype_build_intra_predictors(vp8_build_intra_predictors_mbuv_s_sse2);

 #if !CONFIG_RUNTIME_CPU_DETECT
 #undef  vp8_recon_recon2
@@ -59,26 +57,6 @@ extern prototype_build_intra_predictors(vp8_build_intra_predictors_mbuv_s_sse2);
 #undef  vp8_recon_copy16x16
 #define vp8_recon_copy16x16 vp8_copy_mem16x16_sse2

-#undef  vp8_recon_build_intra_predictors_mbuv
-#define vp8_recon_build_intra_predictors_mbuv vp8_build_intra_predictors_mbuv_sse2
-
-#undef  vp8_recon_build_intra_predictors_mbuv_s
-#define vp8_recon_build_intra_predictors_mbuv_s vp8_build_intra_predictors_mbuv_s_sse2
-
-#endif
-#endif
-
-#if HAVE_SSSE3
-extern prototype_build_intra_predictors(vp8_build_intra_predictors_mbuv_ssse3);
-extern prototype_build_intra_predictors(vp8_build_intra_predictors_mbuv_s_ssse3);
-
-#if !CONFIG_RUNTIME_CPU_DETECT
-#undef  vp8_recon_build_intra_predictors_mbuv
-#define vp8_recon_build_intra_predictors_mbuv vp8_build_intra_predictors_mbuv_ssse3
-
-#undef  vp8_recon_build_intra_predictors_mbuv_s
-#define vp8_recon_build_intra_predictors_mbuv_s vp8_build_intra_predictors_mbuv_s_ssse3
-
 #endif
 #endif
 #endif
--- a/vp8/common/x86/subpixel_mmx.asm
+++ b/vp8/common/x86/subpixel_mmx.asm
@@ -50,7 +50,7 @@ sym(vp8_filter_block1d_h6_mmx):
        movsxd      rax,    dword ptr arg(5) ;output_width      ; destination pitch?
        pxor        mm0,    mm0              ; mm0 = 00000000

-.nextrow:
+nextrow:
        movq        mm3,    [rsi-2]          ; mm3 = p-2..p5
        movq        mm4,    mm3              ; mm4 = p-2..p5
        psrlq       mm3,    8                ; mm3 = p-1..p5
@@ -102,7 +102,7 @@ sym(vp8_filter_block1d_h6_mmx):
 %endif

        dec         rcx                      ; decrement count
-        jnz         .nextrow                 ; next row
+        jnz         nextrow                  ; next row

    ; begin epilog
    pop rdi
@@ -152,7 +152,7 @@ sym(vp8_filter_block1dc_v6_mmx):
        pxor        mm0, mm0              ; mm0 = 00000000


-.nextrow_cv:
+nextrow_cv:
        movq        mm3, [rsi+rdx]        ; mm3 = p0..p8  = row -1
        pmullw      mm3, mm1              ; mm3 *= kernel 1 modifiers.

@@ -190,7 +190,7 @@ sym(vp8_filter_block1dc_v6_mmx):
        ; avoidable!!!.
        lea         rdi,  [rdi+rax] ;
        dec         rcx                   ; decrement count
-        jnz         .nextrow_cv           ; next row
+        jnz         nextrow_cv             ; next row

        pop         rbx

@@ -282,7 +282,7 @@ sym(vp8_bilinear_predict8x8_mmx):
        packuswb    mm7,        mm4                 ;

        add         rsi,        rdx                 ; next line
-.next_row_8x8:
+next_row_8x8:
        movq        mm3,        [rsi]               ; xx 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14
        movq        mm4,        mm3                 ; make a copy of current line

@@ -349,7 +349,7 @@ sym(vp8_bilinear_predict8x8_mmx):
        add         rdi,        r8                  ;dst_pitch
 %endif
        cmp         rdi,        rcx                 ;
-        jne         .next_row_8x8
+        jne         next_row_8x8

    ; begin epilog
    pop rdi
@@ -437,7 +437,7 @@ sym(vp8_bilinear_predict8x4_mmx):
        packuswb    mm7,        mm4                 ;

        add         rsi,        rdx                 ; next line
-.next_row_8x4:
+next_row_8x4:
        movq        mm3,        [rsi]               ; xx 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14
        movq        mm4,        mm3                 ; make a copy of current line

@@ -504,7 +504,7 @@ sym(vp8_bilinear_predict8x4_mmx):
        add         rdi,        r8
 %endif
        cmp         rdi,        rcx                 ;
-        jne         .next_row_8x4
+        jne         next_row_8x4

    ; begin epilog
    pop rdi
@@ -579,7 +579,7 @@ sym(vp8_bilinear_predict4x4_mmx):
        packuswb    mm7,        mm0                 ;

        add         rsi,        rdx                 ; next line
-.next_row_4x4:
+next_row_4x4:
        movd        mm3,        [rsi]               ; xx 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14
        punpcklbw   mm3,        mm0                 ; xx 00 01 02 03 04 05 06

@@ -622,7 +622,7 @@ sym(vp8_bilinear_predict4x4_mmx):
 %endif

        cmp         rdi,        rcx                 ;
-        jne         .next_row_4x4
+        jne         next_row_4x4

    ; begin epilog
    pop rdi
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
John Koleszar	744a58bc1c	vpx_codec_dec_init: check that the iface is a decoder Make sure the given interface is actually a decoder interface before initializing it. Change-Id: Ie48d737f2956cc2f0891666de5ea87251e96bc49	2011-03-24 15:05:10 +02:00
John Koleszar	86b5556f5a	Remove unused vp8_get4x4sse_cs_mmx declaration This declaration did not match the prototype_sad() prototype, but was unused in this translation unit, so it is removed instead. Fixes issue 290. Change-Id: I168854f88a85f73ca9aaf61d1e5dc0f43fc3fdb3	2011-03-24 15:05:10 +02:00
John Koleszar	4375b4ac39	Allow specifying --end-usage by enum name Map an enum to the --end-usage values, so you can specify --end-usage=cq instead of --end-usage=2. The numerical values still work for historical scripts, etc, but this is more user friendly. Change-Id: I445ecd9638f801f5924a71eabf449bee293cdd34	2011-03-24 15:05:10 +02:00
Tero Rintaluoma	71595edd47	ARMv6 optimized fdct4x4 Optimized fdct4x4 (8x4) for ARMv6 instruction set. - No interlocks in Cortex-A8 pipeline - One interlock cycle in ARM11 pipeline - About 2.16 times faster than current C-code compiled with -O3 Change-Id: I60484ecd144365da45bb68a960d30196b59952b8	2011-03-24 15:05:10 +02:00
Attila Nagy	848dddee15	Fix multithreaded encoding for 1 MB wide frame Thread synchronization was not correct when frame width was 1 MB. Number of allocated encoding threads is limited by the sync_range. There is no point having more because each thread lags sync_range MBs behind the thread processing the row above. http://code.google.com/p/webm/issues/detail?id=302 Change-Id: Icaf67a883beecc5ebf2f11e9be47b6997fdf6f26	2011-03-24 15:05:09 +02:00
John Koleszar	f1ba70e199	Increase static linkage, remove unused functions A large number of functions were defined with external linkage, even though they were only used from within one file. This patch changes their linkage to static and removes the vp8_ prefix from their names, which should make it more obvious to the reader that the function is contained within the current translation unit. Functions that were not referenced were removed. These symbols were identified by: $ nm -A libvpx.a \| sort -k3 \| uniq -c -f2 \| grep ' [A-Z] ' \ \| sort \| grep '^ *1 ' Change-Id: I59609f58ab65312012c047036ae1e0634f795779	2011-03-24 15:05:09 +02:00
Attila Nagy	a22df2e29d	use semaphore for partition thread synch Change-Id: If368371097d93614ae497d99be2d39c7b0eb5f47	2011-03-18 13:25:51 +02:00