Linfeng Zhang
a76b6b232c
Update load_input_data() in x86
...
Split to load_input_data4() and load_input_data8().
Use pack with signed saturation instruction for high bitdepth.
Change-Id: Icda3e0129a6fdb4a51d1cafbdc652ae3a65f4e06
2017-06-26 13:38:33 -07:00
Linfeng Zhang
8253a27904
Add vpx_highbd_idct4x4_16_add_sse4_1()
...
BUG=webm:1412
Change-Id: Ie33482409351a01be4e89466b0441834eb1e905a
2017-06-23 14:30:12 -07:00
Linfeng Zhang
b8a4b5dd8d
Cosmetics, 8x8 idct SSE2 optimization
...
Change-Id: Id21fa94fd323e36cd19a2d890bf4a0cafb7d964d
2017-06-23 14:30:12 -07:00
James Zern
88a302e743
Merge changes from topic 'missing-proto'
...
* changes:
onyxd_int.h: add missing prototypes
onyxd.h: add vp8dx_references_buffer prototype
vp[89],vpx_dsp: add missing includes
vp8,encodeframe.h: correct prototypes
vp8: add temporal_filter.h
add picklpf.h
add ethreading.h
vp8,bitstream.h: add missing prototypes
vp8: remove vp8_fast_quantize_b_mmx
vp8,loopfilter_filters: make some functions static
vp9_ratectrl: make adjust_gf_boost_lag_one_pass_vbr static
vp9_encodeframe: make scale_part_thresh_sumdiff static
vp9_alt_ref_aq: correct vp9_alt_ref_aq_create proto
tiny_ssim: make some functions static
2017-06-23 05:44:24 +00:00
Johann Koenig
794a5ad713
Merge "fdct32x32 neon implementation"
2017-06-23 01:58:00 +00:00
Johann
e67660cf37
fdct32x32 neon implementation
...
Almost 3x faster in constrained loop testing. Over 10x faster in HBD
builds.
BUG=webm:1424
Change-Id: I2b7f8453e1d4ada63cde729d8115d684c4a71ff9
2017-06-22 06:40:17 -07:00
James Zern
44418c659f
vp[89],vpx_dsp: add missing includes
...
quiets -Wmissing-prototypes
Change-Id: I841cfc019d592f2bc6b3fec5818051a31f4c53b5
2017-06-21 19:00:15 -07:00
Linfeng Zhang
466b667ff3
Clean vpx_idct16x16_256_add_sse2()
...
Remove macro IDCT16 which is redundant with idct16_8col().
Change-Id: I783c5f4fda038a22d5ee5c2b22e8c2cdfb38432c
2017-06-21 13:47:15 -07:00
Linfeng Zhang
42522ce0b7
Update vpx_idct{8x8,16x16,32x32}_1_add_sse2()
...
Change-Id: I365f8e53d9ccd028cef0f561d4de9e5916278609
2017-06-21 13:47:05 -07:00
Linfeng Zhang
2b43a1ee18
Clean 32x32 full idct sse2 and ssse3 code
...
vpx_idct32x32_1024_add_ssse3() is actually a sse2 function and faster
than vpx_idct32x32_1024_add_sse2(). Replace the slow one. All are
code relocations, no new code.
Change-Id: I5dac0e98cc411a4ce05660406921118986638d19
2017-06-21 13:46:49 -07:00
Linfeng Zhang
c7e4917e97
Clean 8x8 idct x86 optimization
...
Create load_buffer_8x8() and write_buffer_8x8().
Change-Id: Ib26dd515d734a5402971c91de336ab481b213fdf
2017-06-15 14:30:00 -07:00
Linfeng Zhang
98967645a1
Remove vpx_idct8x8_64_add_ssse3()
...
It's almost identical with vpx_idct8x8_64_add_sse2(), except little
difference in instructions order.
Change-Id: Ie60dabc35eaa6ebae7c755e6cff00a710aad284f
2017-06-15 14:09:33 -07:00
Linfeng Zhang
6da6a23291
Update high bitdepth load_input_data() in x86
...
BUG=webm:1412
Change-Id: Ibf9d120b80c7d3a7637e79e123cf2f0aae6dd78c
2017-06-13 16:53:53 -07:00
Linfeng Zhang
d6eeef9ee6
Clean array_transpose_{4X8,16x16,16x16_2) in x86
...
Change-Id: I341399ecbde37065375ea7e63511a26bfc285ea0
2017-06-13 16:50:44 -07:00
Linfeng Zhang
9c72e85e4c
Remove array_transpose_8x8() in x86
...
Duplicate of transpose_16bit_8x8()
Change-Id: Iaa5dd63b5cccb044974a65af22c90e13418e311f
2017-06-13 16:50:44 -07:00
Linfeng Zhang
cbb991b6b8
Convert 8x8 idct x86 macros to inline functions
...
Change-Id: Id59865fd6c453a24121ce7160048d67875fc67ce
2017-06-13 16:50:43 -07:00
Jerome Jiang
943f9ee25c
Merge "Merge skin detection code in vp8/9."
2017-06-08 16:36:00 +00:00
Johann Koenig
903375a48a
Merge "fdct16x16 neon optimization"
2017-06-08 15:19:36 +00:00
Jerome Jiang
658e854252
Merge skin detection code in vp8/9.
...
BUG=webm:1438
Change-Id: Ie3dc034c7dbb498a0b088a767b1936ddeed4df14
2017-06-07 21:20:34 -07:00
Johann
eae7cf2368
fdct16x16 neon optimization
...
Roughly 2x speedup. Since the only change for HBD is to store(), the
improvement appears to hold there as well.
BUG=webm:1424
Change-Id: I15b813d50deb2e47b49a6b0705945de748e83c19
2017-06-07 14:59:55 -07:00
James Zern
ff42e04f9c
Merge "ppc: Add vpx_sadnxmx4d_vsx for n,m = {8, 16, 32 ,64}"
2017-06-06 23:52:39 +00:00
James Zern
4753c23983
Merge "ppc: Add vpx_sad64/32/16x64/32/16_avg_vsx"
2017-06-06 02:19:41 +00:00
Johann Koenig
755b3daf90
Merge "comp_avg_pred neon: used by sub pixel avg variance"
2017-05-31 18:17:28 +00:00
Linfeng Zhang
30ea3ef283
Merge "Update vpx_highbd_idct4x4_16_add_sse2()"
2017-05-31 15:56:20 +00:00
Johann
f695b30ac2
comp_avg_pred neon: used by sub pixel avg variance
...
BUG=webm:1423
Change-Id: I33de537f238f58f89b7a6c1c2d6e8110de4b8804
2017-05-30 22:47:34 +00:00
Linfeng Zhang
45048dc9dc
Update vpx_highbd_idct4x4_16_add_sse2()
...
BUG=webm:1412
Change-Id: I26e4b34ae9bc1ae80c24f56d740d737a95f1ab84
2017-05-30 09:25:30 -07:00
Johann Koenig
b9649d2407
Merge "comp_avg_pred: alignment"
2017-05-30 16:21:05 +00:00
Johann
ea8b4a450d
comp_avg_pred: alignment
...
x86 requires 16 byte alignment for some vector loads/stores.
arm does not have the same requirement.
The asserts are still in avg_pred_sse2.c. This just removes them from
the common code.
Change-Id: Ic5175c607a94d2abf0b80d431c4e30c8a6f731b6
2017-05-30 07:46:43 -07:00
Johann
42ce25821d
remove DECLARE_ALIGNED from neon code
...
Unlike x86 neon only requires type alignment when loading into vectors.
Change-Id: I7bbbe4d51f78776e499ce137578d8c0effdbc02f
2017-05-26 10:41:57 -07:00
Johann
f3c97ed32e
subpel variance neon: reduce stack usage
...
Unlike x86, arm does not impose additional alignment restrictions on
vector loads. For incoming values to the first pass, it uses vld1_u32()
which typically does impose a 4 byte alignment. However, as the first
pass operates on user-supplied values we must prepare for unaligned
values anyway (and have, see mem_neon.h).
But for the local temporary values there is no stride and the load will
use vld1_u8 which does not require 4 byte alignment.
There are 3 temporary structures. In the C, one is uint16_t. The arm
saturates between passes but still passes tests. If this becomes an
issue new functions will be needed.
Change-Id: I3c9d4701bfeb14b77c783d0164608e621bfecfb1
2017-05-24 13:28:13 -07:00
Johann
d204c4bf01
Use vdup instead of vmov
...
Change-Id: Idb6248c1429b55176bb3e9f4e8365ea0ed2be62a
2017-05-24 11:38:15 -07:00
Johann Koenig
de1a9c77a7
Merge changes Iaab2b9a1,Idfb458d3
...
* changes:
sub pel avg variance neon: 4x block sizes
sub pel variance neon: 4x block sizes
2017-05-24 18:33:53 +00:00
Johann Koenig
b11a37f540
Merge changes I31fa6ef8,I228c6f29
...
* changes:
sub pel avg variance neon: add neon optimizations
sub pel variance neon: normalize variable names
2017-05-24 18:32:02 +00:00
Alexandra Hájková
8bf6eaf433
ppc: Add vpx_sadnxmx4d_vsx for n,m = {8, 16, 32 ,64}
...
Change-Id: I547d0099e15591655eae954e3ce65fdf3b003123
2017-05-24 13:27:09 +00:00
Linfeng Zhang
6444958f62
Update inv_txfm_sse2.h and inv_txfm_sse2.c
...
Extract shared code into inline functions.
Change-Id: Iee1e5a4bc6396aeed0d301163095c9b21aa66b2f
2017-05-23 14:54:46 -07:00
Johann
f6fcd3410d
sub pel avg variance neon: 4x block sizes
...
BUG=webm:1423
Change-Id: Iaab2b9a183fdb54aae5f717aba95d90dc36a9e3b
2017-05-22 14:40:05 -07:00
Johann
188d58eaa9
sub pel variance neon: 4x block sizes
...
Add optimizations for blocks of width 4
BUG=webm:1423
Change-Id: Idfb458d36db3014d48fbfbe7f5462aa6eb249938
2017-05-22 14:40:01 -07:00
Johann
9b0d306a2f
sub pel avg variance neon: add neon optimizations
...
These are missing an optimized version of vpx_comp_avg_pred
BUG=webm:1423
Change-Id: I31fa6ef842e98f7ff3ea079ffed51ae33178e2ed
2017-05-22 13:58:43 -07:00
Johann
e0d294c3af
sub pel variance neon: normalize variable names
...
match vpx_dsp/variance.c variable names
Change-Id: I228c6f296c183af147b079b7c8bcdf97bd09cf3a
2017-05-22 13:58:43 -07:00
Linfeng Zhang
27beada6d0
Merge "Add vpx_highbd_idct{4x4,8x8,16x16}_1_add_sse2"
2017-05-22 20:58:18 +00:00
Johann
67ac68e399
variance neon: assert overflow conditions
...
Change-Id: I12faca82d062eb33dc48dfeb39739b25112316cd
2017-05-22 11:25:06 -07:00
Linfeng Zhang
c167345ffb
Add vpx_highbd_idct{4x4,8x8,16x16}_1_add_sse2
...
BUG=webm:1412
Change-Id: Ia338a6057d36f9ed7eaa9cbd4dfbf0c3cbdc6468
2017-05-22 11:24:21 -07:00
Johann
d217c87139
neon variance: special case 4x
...
The sub pixel variance uses a temp buffer which guarantees width ==
stride. Take advantage of this with the 4x and avoid the very costly
lane loads.
Change-Id: Ia0c97eb8c29dc8dfa6e51a29dff9b75b3c6726f1
2017-05-22 10:51:31 -07:00
Johann Koenig
e7cac13016
Merge changes Ib8dd96f7,Ie9854b77
...
* changes:
neon variance: process 4x blocks
use memcpy for unaligned neon stores
2017-05-22 17:48:33 +00:00
Johann Koenig
b5055002d7
Merge "neon 4 byte helper functions"
2017-05-19 17:11:30 +00:00
Johann Koenig
3c603eadb4
Merge "neon fdct: 4x4 implementation"
2017-05-19 17:08:58 +00:00
Johann
7b742da63e
neon variance: process 4x blocks
...
Continue processing sets of 16 values. Plenty of improvement for 4x8
(doubles the speed) but only about 30% for 4x4.
BUG=webm:1422
Change-Id: Ib8dd96f75d474f0348800271d11e58356b620905
2017-05-17 17:35:01 -07:00
Johann
2057d3ef75
use memcpy for unaligned neon stores
...
Advise the compiler that the store is eventually going to a uint8_t
buffer. This helps avoid getting alignment hints which would cause the
memory access to fail.
Originally added as a workaround for clang:
https://bugs.llvm.org//show_bug.cgi?id=24421
Change-Id: Ie9854b777cfb2f4baaee66764f0e51dcb094d51e
2017-05-17 12:11:31 -07:00
Johann
105503b839
neon fdct: 4x4 implementation
...
Approximately twice as fast as C implementation.
BUG=webm:1424
Change-Id: I3c0307fb08ddc23df42545cd089a78e2ed5c9d3f
2017-05-17 07:38:18 -07:00
Linfeng Zhang
18e8baa5c0
Add transpose_32bit_4x4() and rename transpose_4x4() for vpx_dsp/x86
...
Change-Id: Ib57377f6cf6573c04720d3cc5dea4285362b4220
2017-05-16 17:46:37 -07:00