325 Commits

Author SHA1 Message Date
hkuang
4082bf9d7c Add neon optimize vp9_short_idct10_16x16_add.
vp9_short_idct10_16x16_add is used to handle the block that only have valid data
at top left 4x4 block. All the other datas are 0. So we could cut many
unnecessary calculations in order to save instructions.

Change-Id: I6e30a3fee1ece5af7f258532416d0bfddd1143f0
2013-08-22 15:53:22 -07:00
James Zern
ac12f3926b Merge "vp9 rtcd: remove non-existent sad functions" 2013-08-21 13:55:59 -07:00
James Zern
ae455fabd8 vp9 rtcd: remove non-existent sad functions
vp9_sad32x3, vp9_sad3x32

+ remove unnecessary sad include from vp9_findnearmv.c

Change-Id: Idef2a89cadc3fec64eff82ba9be60ffff50b3468
2013-08-20 18:07:53 -07:00
hkuang
37cda6dc4c Add neon optimize vp9_short_idct10_8x8_add.
vp9_short_idct10_8x8_add is used to handle the block that only have valid data
at top left 4x4 block. All the other datas are 0. So we could cut several
unnecessary calculations in order to save instructions.

Change-Id: I34fda95e29082b789aded97c2df193991c2d9195
2013-08-20 11:51:07 -07:00
Dmitry Kovalev
1462433370 Merge "Renaming d27 predictor to d207." 2013-08-16 12:07:24 -07:00
Johann
a9aa7d07d0 Merge "vp9: neon: add vp9_convolve_avg_neon" 2013-08-15 14:55:15 -07:00
Johann
63e140eaa7 Merge "vp9: neon: add vp9_convolve_copy_neon" 2013-08-15 14:55:08 -07:00
Dmitry Kovalev
81d7bd50f5 Renaming d27 predictor to d207.
27 degrees intra predictor is actually 207 degrees, so renaming it.

Change-Id: Ife96a910437eb80ccdc0b7a5b7a62c77542ae5be
2013-08-15 11:09:49 -07:00
hkuang
39f42c8713 Merge "Add neon optimize vp9_short_idct16x16_add." 2013-08-14 14:16:20 -07:00
hkuang
cf6beea661 Add neon optimize vp9_short_idct16x16_add.
Change-Id: I27134b9a5cace2bdad53534562c91d829b48838d
2013-08-14 13:52:16 -07:00
Dmitry Kovalev
f2c073efaa Adding const to arguments of intra prediction functions.
Adding const to above and left pointers. Cleanup.

Change-Id: I51e195fa2e2923048043fe68b4e38a47ee82cda1
2013-08-14 10:35:56 -07:00
Mans Rullgard
0f1deccf86 vp9: neon: add vp9_convolve_avg_neon
Change-Id: I33cff9ac4f2234558f6f87729f9b2e88a33fbf58
2013-08-14 16:27:55 +01:00
Mans Rullgard
635ba269be vp9: neon: add vp9_convolve_copy_neon
Change-Id: I15adbbda15d1842e9f15f21878a5ffbb75c3c0c9
2013-08-14 16:27:55 +01:00
Jingning Han
78136edcdc SSE2 high precision 32x32 forward DCT
Enable SSE2 implementation of high precision 32x32 forward DCT. The
intermediate stacks are of 32-bits. The run-time goes down from
32126 cycles to 13442 cycles.

Change-Id: Ib5ccafe3176c65bd6f2dbdef790bd47bbc880e56
2013-08-12 16:52:53 -07:00
Christian Duvivier
78182538d6 Neon version of vp9_short_idct4x4_add.
Change-Id: Idec4cae0cb9b3a29835fd2750d354c1393d47aa4
2013-08-06 18:41:27 -07:00
Jim Bankoski
5b307886fb variance x86inc guards
also fixed bug in sad calcs

Change-Id: I6571fcbe37556c16ae32be66dc0fd879852aac1d
2013-08-06 14:17:13 -07:00
Jim Bankoski
6eb1254b88 sse3 intrapred x86inc protected
Change-Id: I4a3c83119cdf8a205920034c8019d855d5504605
2013-08-06 14:17:13 -07:00
Jim Bankoski
c9126e0b30 sad + miscellaneous updates
Enable use_x86inc as a commandline option.  Fix Bug with sse2 when
x86inc is disabled. Adds Sad asm protection to x86inc protection

Change-Id: Iee0f9dd235ea10e8ace512eb362ba9bebe8c9df6
2013-08-06 12:16:04 -07:00
Jim Bankoski
efc94102f0 Merge "intrapred x86inc guards" 2013-08-06 10:39:19 -07:00
Jim Bankoski
25ec1375c9 intrapred x86inc guards
Change-Id: If0399d8e11f4ebe75a5c91abb8d6a52a7709065b
2013-08-06 09:39:30 -07:00
Jim Bankoski
62c6aa884d block error / x86inc mods
Change-Id: Icb607745634e10b9bac5019d06661ece09fcdb40
2013-08-06 06:23:38 -07:00
Jim Bankoski
a93b115cd6 reworked config for use_x86_inc
Support enabling it or disabling it.  Moved read out to configure.sh
so that its done once instead of in make and in config.

Change-Id: I73a9190cf31de9f03e8a577f478fa522f8c01c8b
2013-08-05 17:35:25 -07:00
Jim Bankoski
f4837579d1 fixed script problem with config_force_x86_inc
Change-Id: I226e5094d216b09dc47fa5511a66e2d314608000
2013-08-05 14:48:20 -07:00
Jim Bankoski
c3809f3de5 Begin to restrict x86inc.asm usage
Chromium does not support 32bit builds for Mac which use x86inc.asm.
Make the files which include it work if 64bit or not PIC enabled
starting with vp9_copy_sse2.asm

Consolidate these targets in vp9_rtcd_defs.sh

Change-Id: If18f0b957a611efd085a3ee7d245cf1eb91e8248
2013-08-05 12:07:30 -07:00
Dmitry Kovalev
5d86f3886d Moving struct loop_filter_info from *.h to *.c file.
Change-Id: I3fe90eb40088a5b07bdc7d66d93ffe6ef99943d5
2013-08-02 11:53:49 -07:00
Mans Rullgard
d85ae87183 vp9: neon: add vp9_mb_lpf_* functions
Change-Id: I13e0880df234f15abc4cc7c57fe84488d5d46a75
2013-08-02 08:10:50 -07:00
Jingning Han
67719abde1 Remove unused vp9_short_idct10_32x32_add
The inverse 32x32 transform detects all zero entries and skips the
computations accordingly per 8 rows in the first 1-D operation. The
function vp9_short_idct10_32x32_add performs differently and is not
used anywhere, hence removed.

Change-Id: Ic4fad422debbde7b6b6ffed47c69fbd4268a906c
2013-08-01 12:45:16 -07:00
Jingning Han
a7c4de22e1 16x16 inverse 2D-DCT with DC only
This commit provides special handle on 16x16 inverse 2D-DCT, where
only DC coefficient is quantized to be non-zero value.

Change-Id: I7bf71be7fa13384fab453dc8742b5b50e77a277c
2013-07-29 14:45:53 -07:00
Ronald S. Bultje
6f3054b65d Merge "d45 intra prediction SSSE3 optimizations." 2013-07-26 17:21:09 -07:00
Jingning Han
325e0aa650 Special handle on DC only inverse 8x8 2D-DCT
This commit enables a special handle for the 8x8 inverse 2D-DCT,
where only DC coefficient is quantized to be non-zero. For bus_cif
at 2000 kbps, it provides about 1% speed-up at speed 0.

Change-Id: I2523222359eec26b144cf8fd4c63a4ad63b1b011
2013-07-26 14:16:51 -07:00
Ronald S. Bultje
94b0c6791d d45 intra prediction SSSE3 optimizations.
Change-Id: Ie48035ff4f93c41f8a9b3023e6444fd10432d8fb
2013-07-26 13:30:02 -07:00
Jingning Han
384e37e32b SSE2 inverse 4x4 2D-DCT with DC only
Add SSE2 implementation to handle the special case of inverse 2D-DCT
where only DC coefficient is non-zero.

Change-Id: I2c6a59e21e5e77b8cf39a4af5eecf4d5ade32e2f
2013-07-24 23:19:56 -07:00
Jingning Han
d2de1ca37b Merge vp9_dc_only_idct_add and vp9_short_idct4x4_1
They share the same functionality, so merging together.

Change-Id: I98a0386fcee052cb854f9ff90c283c1b844bcb79
2013-07-24 16:51:15 -07:00
hkuang
d757de744c Add neon optimize vp9_short_idct8x8_add.
Change-Id: Ic32acf3e2939c6d12d9c2bf192a5f5da59705fda
2013-07-18 16:40:41 -07:00
Johann
9ca66ec050 Merge "vp9_convolve8_neon placeholder" 2013-07-17 10:09:00 -07:00
Johann
59dc4e9cdd vp9_convolve8_neon placeholder
Call the individually optimized horizontal and vertical functions. This
implementation abuses the temp buffer.

This will be replaced with a custom optimized function.

Over 2x speedup.

Change-Id: I5b908d2a73d264e9810d6022bbff73207a3055dd
2013-07-17 08:39:27 -07:00
Jingning Han
d05f66aa10 SSE2 16x16 inverse ADST/DCT hybrid transform
This commit enables SSE2 implementation of 16x16 inverse ADST/DCT
hybrid transform. The runtime goes from 5742 cycles -> 1821 cycles.
This provides about 1% encoding speed-up at speed 0.

Change-Id: I1678d0988bf30b9efd524877705bbb3645edb17b
2013-07-16 12:51:42 -07:00
Jingning Han
5851904744 Merge "SSE2 8x8 inverse ADST/DCT transform" 2013-07-16 11:00:11 -07:00
Jingning Han
91365addf8 SSE2 8x8 inverse ADST/DCT transform
This commit enables SSE2 implementation of 8x8 inverse ADST/DCT
transform. The runtime goes from 1216 cycles -> 266 cycles.
For bus_cif at 2000 kbps, the overall runtime reduces from
253707ms -> 248430ms, i.e., 2% speed-up at speed 0.

Change-Id: Ib0372e17e9162d7b11a10d653b1c8be547c878fb
2013-07-12 21:03:16 -07:00
Johann
a15bebfc0a vp9_convolve8_[horiz|vert]_avg
Super basic conversion from the other implementations. Any changes to
one should be trivial to copy over keep in sync.

Change-Id: I1720b4128e0aba4b2779e3761f6494f8a09d3ea8
2013-07-12 16:21:33 -07:00
Jingning Han
dac5891a1a Merge "SSE2 4x4 invserse ADST/DCT transform" 2013-07-11 14:17:23 -07:00
Johann
158c80cbb0 convolve8 optimizations for neon
Independent horizontal and vertical implementations.

Requires that blocks be built from 4x4 and [xy]_step_q4 == 16

6-10% improvement. CIF improved the least.

Change-Id: I137f5ceae4440adc0960bf88e4453e55a618bcda
2013-07-11 11:08:19 -07:00
hkuang
c9b25dcae4 Add neon optimize vp9_dc_only_idct_add.
Change-Id: Iae84ab945cc9662a0ddd839aa2b9ca59f2ae5423
2013-07-11 10:30:47 -07:00
Jim Bankoski
5000cdf0ff Merge "Wide loopfilter 16 pix at a time" 2013-07-11 06:44:02 -07:00
Jingning Han
49b6302044 SSE2 4x4 invserse ADST/DCT transform
Enable SSE2 4x4 inverse ADST/DCT transform. The runtime goes from
292 cycles down to 89 cycles. Running bus_cif at 2000 kbps, the
overall runtime of speed 0 goes from 301s to 295s (2% speed-up).

Change-Id: I24098136e7fee7ab2fbf1c11755bdf2ca37f3628
2013-07-10 20:16:02 -07:00
Ronald S. Bultje
decead7336 Replace copy_memNxM functions with a generic copy/avg function.
Change-Id: I3ce849452ed4f08527de9565a9914d5ee36170aa
2013-07-10 18:27:24 -07:00
John Koleszar
64f7a4d8cb Wide loopfilter 16 pix at a time
Where possible, do the 16 pixel wide filter while doing the horizontal
filtering pass. The same approach can be taken for the mbloop_filter
when that's implemented. Doing so on the vertical pass is a little more
involved, but possible.

Change-Id: I010cb505e623464247ae8f67fa25a0cdac091320
2013-07-10 16:32:44 -07:00
Ronald S. Bultje
e6f955251f Merge "SSSE3 assembly for 4x4/8x8/16x16/32x32 H intra prediction." 2013-07-10 14:52:23 -07:00
Ronald S. Bultje
6a60249071 Merge "SSE/SSE2 assembly for 4x4/8x8/16x16/32x32 TM intra prediction." 2013-07-10 14:52:19 -07:00
Jingning Han
114423538f SSE2 16x16 ADST/DCT hybrid transform
This commit enables 16x16 ADST/DCT forward hybrid transform using SSE2
operations. It reduces the runtime from 5433 cycles to 1621 cycles, at
no compression performance loss.

Change-Id: I75fd7f1984e9e28846af459f810ff0d6ae125230
2013-07-10 12:14:53 -07:00