Commit Graph

218 Commits

Author SHA1 Message Date
Jingning Han
0c4a4225ec Merge "Enable SSSE3 inverse 2D-DCT with 10 non-zero coeffs" 2014-06-03 16:51:39 -07:00
Jingning Han
2c1cdf69b6 Fix a potential overflow issue in inverse 16x16 full 2D-DCT
An overflow issue could potentially happen in the second round 1-D
transform of the SSSE3 full inverse 16x16 2D-DCT. This commit fixes
this issue.

Change-Id: Ia19e4888fda1cc929a28a5f89a5beec612d628dc
2014-05-29 11:46:32 -07:00
Jingning Han
6d21cbd20b Enable SSSE3 inverse 2D-DCT with 10 non-zero coeffs
This commit enables SSSE3 implementation of the inverse 2D-DCT
with only first 10 coefficients non-zero. It reduces the runtime
of SSE2 version from 745 cycles to 538 cycles, i.e., 27% speed-up.

Change-Id: I18ba4128859b09c704a6ee361d69a86c09fe8dfe
2014-05-28 10:53:33 -07:00
Jingning Han
239e68ddbf Fix compiling error in MSVS
Need to include math.h before tmmintrin.h in some versions of MSVS.

Change-Id: Ia6b83ae599316887ecf30c4e4b9e4355fb8a4219
2014-05-27 15:58:47 -07:00
Yunqing Wang
a591ac9e5a Merge "Fix decoder mismatch in sub-pixel AVX2 intrinsic filters" 2014-05-27 10:52:16 -07:00
levytamar82
773596050f Fix decoder mismatch in sub-pixel AVX2 intrinsic filters
The subpixel SSSE3 was fixed in this patch:
https://gerrit.chromium.org/gerrit/#/c/70283/
So the equivalent AVX2 is fixed accordingly.

Change-Id: Ieebbc1949c99d34b12b8b47692df71aca5001f3a
2014-05-23 16:48:40 -07:00
Jingning Han
59c3f446fe Merge "Inverse 16x16 2D-DCT SSSE3 implementation" 2014-05-23 16:01:22 -07:00
Jingning Han
48b0891370 Inverse 16x16 2D-DCT SSSE3 implementation
This commit enables the SSSE3 implementation of full inverse 16x16
2D-DCT. The unit runtime goes down from 1642 cycles to 1519 cycles,
about 7% speed-up.

Change-Id: I14d2fdf9da1fb4ed1e5db7ce24f77a1bfc8ea90d
2014-05-23 15:09:35 -07:00
Yunqing Wang
c5443fc881 Fix decoder mismatch in sub-pixel SSSE3 intrinsic filters
In 8-tap filtering, to guarantee the intermediate results fit in
16 bits, the order of accumulating the products needs to be done
correctly, and the largest product should be added last. This
patch fixed the problem using the method in commit "Correct ssse3
8/16-pixel wide sub-pixel filter calculation".

Change-Id: I79d0ad60c057b15011ece84cda9648eee0809423
2014-05-23 11:52:20 -07:00
Yaowu Xu
9410330893 Merge "change to use assembly version of ssse3 filter code" 2014-05-23 08:02:28 -07:00
Yaowu Xu
7a0c9b82f2 change to use assembly version of ssse3 filter code
As mismatchs were found  between the intrinsic version and c only. The
commit temporarily revert to use the matching assembly version to
allow further investigation.

Change-Id: I08436c47d4888b562c0eac8e8856d90a831442df
2014-05-22 17:11:57 -07:00
Yunqing Wang
aaf204e550 Merge "Fix a decoding mismatch in sub-pixel filters" 2014-05-22 17:09:14 -07:00
Yunqing Wang
efcdf946ed Fix a decoding mismatch in sub-pixel filters
This did the same correction as the one in commit "Correct ssse3
8/16-pixel wide sub-pixel filter calculation" to avoid saturation
during filtering.

Change-Id: Ife9aa3f62daf9114eb24fe38f7baa3c3f361b2d6
2014-05-22 15:42:13 -07:00
Deb Mukherjee
e272273443 Renames x86_64 specific asm files
Renames all x86_64 specific assembly files to consistently
end in _x86_64.asm. This will be useful for build systems to
handle these files differently.
All new 64-bit specific assembly files should use the new
naming convention.

Change-Id: I36c89584967c82ffc4088b1b5044ac15d2bb7536
2014-05-21 13:55:56 -07:00
Jingning Han
41a350a83d Change eob threshold for partial inverse 8x8 2D-DCT to 12
The scanning order has the first 12 coefficients of the 8x8 2D-DCT
sitting in the top left 4x4 block. Hence the partial inverse 8x8
2D-DCT allows to handle cases with eob below 12.

The overall runtime of the inverse 8x8 2D-DCT unit is reduced from
166 cycles (using SSE2) to 150 cycles (using SSSE3).

Change-Id: I4514f9748042809ac84df4c14382c00f313f1cd2
2014-05-08 09:48:58 -07:00
Jingning Han
9e7b09bc5d SSSE3 8x8 inverse 2D-DCT with first 10 coeffs non-zero
This commit enables ssse3 assembly implementation of the 8x8
inverse 2D-DCT with only first 10 coefficients non-zero. The
average runtime for this unit goes down from 198 cycles to 129
cycles (34.8% faster).

Change-Id: Ie7fa4386f6d3a2fe0d47a2eb26fc2a6bbc592ac7
2014-05-07 17:40:02 -07:00
Jingning Han
52ae97b6aa SSSE3 implementation of full inverse 8x8 2D-DCT
This commit enables SSSE3 version full inverse 8x8 2D-DCT and
reconstruction. It makes the runtime of vp9_idct8x8_64_add down
from 256 cycles (SSE2) to 246 cycles.

Change-Id: I0600feac894d6a443a3c9d18daf34156d4e225c3
2014-05-05 10:49:27 -07:00
Yunqing Wang
23ccf71924 Merge "Fix encoder uninitialized read errors reported by drmemory" 2014-04-10 09:45:08 -07:00
Yunqing Wang
3a6670fcf8 Fix encoder uninitialized read errors reported by drmemory
This patch fixed the uninitialized read errors in Issue 748:
"dr memory VP9 encode errors". In vp9_convolve_avg_sse2,
when width is 4, pavgb reads 8 bytes from dst buffer that is
out of range. An error is reported although the data is not
actually used later. This issue was resolved by preventing
uninitialized reads.

Change-Id: I109a54910aa47139cb13119de86f2062cff207df
2014-04-09 09:59:15 -07:00
Tom Finegan
f600b50a6e Fix avx builds on macosx with clang 5.0.
The macosx release of clang v5.0 identifies itself as:
Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn)

This version of clang uses the older _mm_broadcastsi128_si256, like
v3.3, as given away in the LLVM svn version above.

Change-Id: I4d6d59d5454efd57d2ae9e75f5eb7486af7cbd0c
2014-04-08 18:56:03 -07:00
James Zern
caecedc92f vp9_subpixel_8t_intrin_avx2: fix build w/clang 3.4+
clang reports gcc-4.2.1 in e.g., 3.3, 3.4; add a specific clang version
check for _mm256_broadcastsi128_si256

fixes issue #720

Change-Id: I5c8e3c27fdea05d8a5b050e8cb74894b595f4709
2014-03-06 10:55:44 -08:00
James Zern
e2f614be53 Merge "vp9_subpixel_8t_intrin_ssse3.c: make some tables static" 2014-02-20 16:02:16 -08:00
James Zern
d73d621e5d vp9_subpixel_8t_intrin_ssse3.c: make some tables static
+ fix formatting

Change-Id: I344d4de089d03e403f0c7b3e64aeb7086cce86ac
2014-02-18 20:42:00 -08:00
James Zern
a96af49bab vp9_subpixel_8t_intrin_avx2.c: make some tables static
+ fix formatting

Change-Id: Ia62610bff3d63855104366d7860749b6a3cf4577
2014-02-18 20:40:40 -08:00
Yunqing Wang
0cc71c9c9f Merge "SSSE3 convolution optimization" 2014-02-18 12:55:34 -08:00
Yaowu Xu
ecf392a155 Merge "minor spelling cleanup in comments" 2014-02-14 14:29:35 -08:00
levytamar82
3068d7d944 SSSE3 convolution optimization
Optimizing all SSSE3 assembly for convolution:
1. vp9_filter_block1d4_h8_sse2
2. vp9_filter_block1d8_h8_sse2
3. vp9_filter_block1d16_h8_sse2
4. vp9_filter_block1d4_v8_sse2
5. vp9_filter_block1d8_v8_sse2
6. vp9_filter_block1d16_v8_sse2
my optimization include:
-processing 2x8 elements in one 128 bit register instead of processing
8 elements in one 128 bit register.
-removing unecessary loads.
This optimization gives between 2.4% user level gain for 480p input
and 1.6% user level gain for 720p.
This Optimization is done only for 64 bit

Change-Id: Ic07fce2f9360329b4f2d956efda1480ae958766b
2014-02-14 15:08:42 -07:00
levytamar82
876c72a093 AVX2 Convolve Optimization
Two convolve functions were optimized for AVX2:
1. vp9_filter_block1d16_h8
2. vp9_filter_block1d16_v8
vp9_filter_block1d16_v8 was optimized for AVX2 by reducing the number of
loop strides by half, two strides were processed in parallel.
vp9_filter_block1d16_v8 was also optimized in the same way also some of the
loads were being done outside of the loop and by that preventing redundant
loads.
This Optimization gives 43% function level gain and 1.3% user level gain.
Now can be compiled in Windows

Change-Id: I2714124cfb0c14a77d7a0ce126a20db92ffbf92c
2014-02-12 20:45:31 -07:00
Andrew Russell
549c31f8ae minor spelling cleanup in comments
Change-Id: Ia91c6c406273345b08505097ffe1af3896980f06
2014-02-12 16:32:51 -08:00
Tom Finegan
60e91a92c3 vp9/common/x86: Silence MSVC warnings in vp9_asm_stubs.c.
Update filter_1dfunction definition to match usage.

Change-Id: Ie3cae13dc1ec3f5838c5f29d1c76a1a98a9217fa
2014-02-10 15:08:42 -08:00
Yunqing Wang
d1961e6fbf Optimize bilinear sub-pixel filters in ssse3
This patch added ssse3 optimization of bilinear sub-pixel filters.
The real time encoder was speeded up by ~1%.

Change-Id: Ie82e98976f411183cb8c61ab8d2ba0276e55a338
2014-02-04 08:01:55 -08:00
Yunqing Wang
2488cb34bc Optimize bilinear sub-pixel filters in sse2
Using bilinear filters could speed up the codec in real-time mode.
This patch added sse2 optimizations of bilinear filters that
operate on different-sized blocks.

Tests showed that the real-time encoder was speeded up by 3%.

Change-Id: If99a7ee4385fcc225c3ee7445d962d5752e57c3f
2014-02-03 10:34:45 -08:00
Yunqing Wang
3c29cbffbf Add macros for convolve functions
Added macros to reduce the code duplication.

Change-Id: I1916aa5a386ea07d961d4ec439ab09bb8c45487d
2014-01-28 18:40:23 -08:00
Dmitry Kovalev
ff41764920 Removing _1d suffix from transform names.
It is enough to specify (e.g.) idct16, it is obviously different from
idct16x16.

Change-Id: I6b408a37a945de3162429380b59a775b03b95db0
2014-01-27 16:15:36 -08:00
James Zern
0940c9cfde vp9/common: add extern "C" to headers
Change-Id: Ic334da9aee968e33762c2b25d9fbad24c844b411
2014-01-23 16:21:24 -08:00
Yunqing Wang
d2bb0c51d3 Revert "Revert "Revert "SSSE3 convolution optimization"""
This reverts commit f9404f2406.

This patch caused some ASAN error.

Change-Id: If15b7e581310e19061d111c69f2931809662ed19
2014-01-16 16:11:46 -08:00
Yunqing Wang
f9404f2406 Revert "Revert "SSSE3 convolution optimization""
This reverts commit b645257121.

Change-Id: I60d1bf57ae8e9eb6127f42f2d5a780124ac51b45
2014-01-13 12:29:55 -08:00
Paul Wilkins
b645257121 Revert "SSSE3 convolution optimization"
This reverts commit 511d218c60.

In current form intrinsics break borg build.

Change-Id: Ied37936af841250ecff449802e69a3d3761c91b9
2014-01-10 13:38:26 +00:00
Jingning Han
a4c94a94cc Merge "Optimze inv 16x16 DCT with 10 non-zero coeffs - P2" 2014-01-09 18:17:25 -08:00
Jingning Han
faa2ba86cc Merge "Optimze inv 16x16 DCT with 10 non-zero coeffs - P1" 2014-01-09 18:17:12 -08:00
Jingning Han
af31b27aae Optimze inv 16x16 DCT with 10 non-zero coeffs - P2
This commit further optimizes SSE2 operations in the second 1-D
inverse 16x16 DCT, with (<10) non-zero coefficients. The average
runtime of this module goes down from 779 cycles -> 725 cycles.

Change-Id: Iac31b123640d9b1e8f906e770702936b71f0ba7f
2014-01-09 12:46:09 -08:00
Yunqing Wang
f3b9b97c0e Merge "SSSE3 convolution optimization" 2014-01-09 12:39:47 -08:00
levytamar82
511d218c60 SSSE3 convolution optimization
Optimizing all SSSE3 assembly for convolution:
1. vp9_filter_block1d4_h8_sse2
2. vp9_filter_block1d8_h8_sse2
3. vp9_filter_block1d16_h8_sse2
4. vp9_filter_block1d4_v8_sse2
5. vp9_filter_block1d8_v8_sse2
6. vp9_filter_block1d16_v8_sse2
my optimization include:
-processing 2x8 elements in one 128 bit register instead of processing
8 elements in one 128 bit register.
-removing unecessary loads.
This optimization gives between 2.4% user level gain for 480p input
and 1.6% user level gain for 720p.
This Optimization done only for 64bit.

Change-Id: Icb586dc0c938b56699864fcee6c52fd43b36b969
2014-01-09 12:27:51 -07:00
Jingning Han
ba6ab46cdc Optimze inv 16x16 DCT with 10 non-zero coeffs - P1
This commit is the first patch optimizing SSE2 implementation of inverse
16x16 DCT with <10 non-zero coefficients. It focused on the first 1-D (row)
transformation. It exploits the fact that only top-left 4x4 block contains
non-zero coefficients, in a 2-D inverse 16x16 DCT with <10 coeffients.

The average runtime of idct16x16_10 unit is reduced from
883 cycles -> 779 cycles (12% faster).

For pedestrian_area_1080p 300 frames at 4000 kbps, the speed 2 runtime goes
down from 310651 ms  -> 305910 ms. The decoding speed goes up from
80.37 fps -> 80.87 fps.

Change-Id: Ic6f3ac5a637a76c07ba73ddaafe318a699fea645
2014-01-08 15:36:45 -08:00
Jingning Han
3e0c62b53f Tune IDCT8_1D macro function interface
This commit adds input/output ports for IDCT8_1D macro function to
provide more flexibility in variable use. It allows to skip several
buffer swap operations.

Change-Id: I21f3450509537322293043b3281bfd3949868677
2014-01-03 15:23:47 -08:00
Jingning Han
0b1a27135a Reduce num of buffer swap calls in idct8_1d_sse2
This commit merges the initial buffer swap operations in idct8_1d_sse2
into the array transpose step, hence reducing number of instructions
therein.

Change-Id: I219f6f50813390d2ec3ee37eecf2a4a2b44ae479
2014-01-03 12:12:03 -08:00
Jingning Han
1bb11781e2 Rework idct8x8_10 SSE2 implementation
This commit optimizes the SSE2 implmentation of idct8x8_10. It exploits
the fact that only top-left 4x4 block contains non-zero coefficients,
and hence reduces the instructions needed.

The runtime of idct8x8_10_sse2 goes down from 216 to 198 CPU cycles,
estimated by averaging over 100000 runs. For pedestrian_area_1080p 300
frames coded at 4000kbps, the average decoding speed goes up from
79.3 fps to 79.7 fps.

Change-Id: I6d277bbaa3ec9e1562667906975bae06904cb180
2014-01-03 12:04:09 -08:00
Yunqing Wang
b6a0ac11f0 Merge "Code clean up" 2013-12-20 08:46:11 -08:00
Yunqing Wang
09faf55916 Code clean up
Removed unused filter coefficients.

Change-Id: Ib395a51305e23ff41ab69c1808d56946d25961cd
2013-12-19 11:09:23 -08:00
Jim Bankoski
b720ba165f rename loop filter functions
This renames all the loop filter functions so that they no
longer refer to mb

Change-Id: I8a58a8c7fd253d835cb619bde13913e896ece90b
2013-12-17 17:34:34 -08:00