generic-library/vpx

Author	SHA1	Message	Date
Jim Bankoski	c3809f3de5	Begin to restrict x86inc.asm usage Chromium does not support 32bit builds for Mac which use x86inc.asm. Make the files which include it work if 64bit or not PIC enabled starting with vp9_copy_sse2.asm Consolidate these targets in vp9_rtcd_defs.sh Change-Id: If18f0b957a611efd085a3ee7d245cf1eb91e8248	2013-08-05 12:07:30 -07:00
Dmitry Kovalev	5d86f3886d	Moving struct loop_filter_info from .h to .c file. Change-Id: I3fe90eb40088a5b07bdc7d66d93ffe6ef99943d5	2013-08-02 11:53:49 -07:00
Mans Rullgard	d85ae87183	vp9: neon: add vp9_mb_lpf_* functions Change-Id: I13e0880df234f15abc4cc7c57fe84488d5d46a75	2013-08-02 08:10:50 -07:00
Jingning Han	67719abde1	Remove unused vp9_short_idct10_32x32_add The inverse 32x32 transform detects all zero entries and skips the computations accordingly per 8 rows in the first 1-D operation. The function vp9_short_idct10_32x32_add performs differently and is not used anywhere, hence removed. Change-Id: Ic4fad422debbde7b6b6ffed47c69fbd4268a906c	2013-08-01 12:45:16 -07:00
Jingning Han	a7c4de22e1	16x16 inverse 2D-DCT with DC only This commit provides special handle on 16x16 inverse 2D-DCT, where only DC coefficient is quantized to be non-zero value. Change-Id: I7bf71be7fa13384fab453dc8742b5b50e77a277c	2013-07-29 14:45:53 -07:00
Ronald S. Bultje	6f3054b65d	Merge "d45 intra prediction SSSE3 optimizations."	2013-07-26 17:21:09 -07:00
Jingning Han	325e0aa650	Special handle on DC only inverse 8x8 2D-DCT This commit enables a special handle for the 8x8 inverse 2D-DCT, where only DC coefficient is quantized to be non-zero. For bus_cif at 2000 kbps, it provides about 1% speed-up at speed 0. Change-Id: I2523222359eec26b144cf8fd4c63a4ad63b1b011	2013-07-26 14:16:51 -07:00
Ronald S. Bultje	94b0c6791d	d45 intra prediction SSSE3 optimizations. Change-Id: Ie48035ff4f93c41f8a9b3023e6444fd10432d8fb	2013-07-26 13:30:02 -07:00
Jingning Han	384e37e32b	SSE2 inverse 4x4 2D-DCT with DC only Add SSE2 implementation to handle the special case of inverse 2D-DCT where only DC coefficient is non-zero. Change-Id: I2c6a59e21e5e77b8cf39a4af5eecf4d5ade32e2f	2013-07-24 23:19:56 -07:00
Jingning Han	d2de1ca37b	Merge vp9_dc_only_idct_add and vp9_short_idct4x4_1 They share the same functionality, so merging together. Change-Id: I98a0386fcee052cb854f9ff90c283c1b844bcb79	2013-07-24 16:51:15 -07:00
hkuang	d757de744c	Add neon optimize vp9_short_idct8x8_add. Change-Id: Ic32acf3e2939c6d12d9c2bf192a5f5da59705fda	2013-07-18 16:40:41 -07:00
Johann	9ca66ec050	Merge "vp9_convolve8_neon placeholder"	2013-07-17 10:09:00 -07:00
Johann	59dc4e9cdd	vp9_convolve8_neon placeholder Call the individually optimized horizontal and vertical functions. This implementation abuses the temp buffer. This will be replaced with a custom optimized function. Over 2x speedup. Change-Id: I5b908d2a73d264e9810d6022bbff73207a3055dd	2013-07-17 08:39:27 -07:00
Jingning Han	d05f66aa10	SSE2 16x16 inverse ADST/DCT hybrid transform This commit enables SSE2 implementation of 16x16 inverse ADST/DCT hybrid transform. The runtime goes from 5742 cycles -> 1821 cycles. This provides about 1% encoding speed-up at speed 0. Change-Id: I1678d0988bf30b9efd524877705bbb3645edb17b	2013-07-16 12:51:42 -07:00
Jingning Han	5851904744	Merge "SSE2 8x8 inverse ADST/DCT transform"	2013-07-16 11:00:11 -07:00
Jingning Han	91365addf8	SSE2 8x8 inverse ADST/DCT transform This commit enables SSE2 implementation of 8x8 inverse ADST/DCT transform. The runtime goes from 1216 cycles -> 266 cycles. For bus_cif at 2000 kbps, the overall runtime reduces from 253707ms -> 248430ms, i.e., 2% speed-up at speed 0. Change-Id: Ib0372e17e9162d7b11a10d653b1c8be547c878fb	2013-07-12 21:03:16 -07:00
Johann	a15bebfc0a	vp9_convolve8_[horiz\|vert]_avg Super basic conversion from the other implementations. Any changes to one should be trivial to copy over keep in sync. Change-Id: I1720b4128e0aba4b2779e3761f6494f8a09d3ea8	2013-07-12 16:21:33 -07:00
Jingning Han	dac5891a1a	Merge "SSE2 4x4 invserse ADST/DCT transform"	2013-07-11 14:17:23 -07:00
Johann	158c80cbb0	convolve8 optimizations for neon Independent horizontal and vertical implementations. Requires that blocks be built from 4x4 and [xy]_step_q4 == 16 6-10% improvement. CIF improved the least. Change-Id: I137f5ceae4440adc0960bf88e4453e55a618bcda	2013-07-11 11:08:19 -07:00
hkuang	c9b25dcae4	Add neon optimize vp9_dc_only_idct_add. Change-Id: Iae84ab945cc9662a0ddd839aa2b9ca59f2ae5423	2013-07-11 10:30:47 -07:00
Jim Bankoski	5000cdf0ff	Merge "Wide loopfilter 16 pix at a time"	2013-07-11 06:44:02 -07:00
Jingning Han	49b6302044	SSE2 4x4 invserse ADST/DCT transform Enable SSE2 4x4 inverse ADST/DCT transform. The runtime goes from 292 cycles down to 89 cycles. Running bus_cif at 2000 kbps, the overall runtime of speed 0 goes from 301s to 295s (2% speed-up). Change-Id: I24098136e7fee7ab2fbf1c11755bdf2ca37f3628	2013-07-10 20:16:02 -07:00
Ronald S. Bultje	decead7336	Replace copy_memNxM functions with a generic copy/avg function. Change-Id: I3ce849452ed4f08527de9565a9914d5ee36170aa	2013-07-10 18:27:24 -07:00
John Koleszar	64f7a4d8cb	Wide loopfilter 16 pix at a time Where possible, do the 16 pixel wide filter while doing the horizontal filtering pass. The same approach can be taken for the mbloop_filter when that's implemented. Doing so on the vertical pass is a little more involved, but possible. Change-Id: I010cb505e623464247ae8f67fa25a0cdac091320	2013-07-10 16:32:44 -07:00
Ronald S. Bultje	e6f955251f	Merge "SSSE3 assembly for 4x4/8x8/16x16/32x32 H intra prediction."	2013-07-10 14:52:23 -07:00
Ronald S. Bultje	6a60249071	Merge "SSE/SSE2 assembly for 4x4/8x8/16x16/32x32 TM intra prediction."	2013-07-10 14:52:19 -07:00
Jingning Han	114423538f	SSE2 16x16 ADST/DCT hybrid transform This commit enables 16x16 ADST/DCT forward hybrid transform using SSE2 operations. It reduces the runtime from 5433 cycles to 1621 cycles, at no compression performance loss. Change-Id: I75fd7f1984e9e28846af459f810ff0d6ae125230	2013-07-10 12:14:53 -07:00
Ronald S. Bultje	7fd643264a	SSSE3 assembly for 4x4/8x8/16x16/32x32 H intra prediction. Change-Id: Iad70966b986f65259329070e258f76ef0af816b4	2013-07-10 09:28:03 -07:00
Ronald S. Bultje	8dade638a1	SSE/SSE2 assembly for 4x4/8x8/16x16/32x32 TM intra prediction. Change-Id: I3441c059214c2956e8261331bbf521525a617a86	2013-07-10 09:28:03 -07:00
Ronald S. Bultje	75b33c68c7	SSE/SSE2 assembly for 4x4/8x8/16x16/32x32 V intra prediction. Change-Id: I55a6cfa2daba738cbc0c4a02f806893f7e556997	2013-07-10 09:28:03 -07:00
Ronald S. Bultje	92c5d3665d	SSE/SSE2 assembly for 4x4/8x8/16x16/32x32 DC intra prediction. Change-Id: Ibe1690afc5459f3b3beca401e7734fcd03da6dd0	2013-07-10 09:28:03 -07:00
Frank Galligan	198fa6d0a0	Add Neon horizontal and vertical vp9_mbloop_filter - The vp9 mbfilter C code will branch on flat and mask. This CL will perform both branches and combine the data. A later CL will perform a check to see if all patch will take one branch. - These functions are about 1.75 times faster than the C code on Nexus 7. PS #3 - Changed all functions to dub limit, blimit, and thresh from vld {dx[]}, freeing up r4-r6. - Changed code to use vbif to reduce one instruction and free up a d register. Change-Id: I028dae0e434dc9891c3677bdb182e201ffb04777	2013-07-09 12:40:05 -07:00
Ronald S. Bultje	8350e7fe38	Make intra prediction pointers RTCD-based. This probably has a mildly negative impact on performance, but will (in future commits - or possibly merged with this one) allow SIMD implementations of individual intra prediction functions. We may perhaps want to consider having separate functions per txfm-size also (i.e. 4x4, 8x8, 16x16 and 32x32 intra prediction functions for each intra prediction mode), but I haven't played much with that yet. Change-Id: Ie739985eee0a3fcbb7aed29ee6910fdb653ea269	2013-07-08 17:25:51 -07:00
Ronald S. Bultje	c8defcfdee	Update quantize SSSE3 SIMD to cover 32x32 transform case also. Encode time of bus (speed 0) 50 frames @ 1500kbps goes from 2min14.4 to 2min10.1, i.e. a 2.3% overall speed increase. Change-Id: I3699580e74ec26c7d24e03681bc47ba25ee1ee87	2013-07-01 11:36:33 -07:00
Ronald S. Bultje	7353ceab9d	Quantize (64-bit only, for now) SSSE3 SIMD. Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is x86-64 only, it needs some minor modifications to be 32bit compatible, because it uses 15 xmm registers, whereas 32bit only has 8. Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904	2013-07-01 11:36:07 -07:00
Jingning Han	993942ce0c	Merge "Enable SSE2 4x4 ADST/DCT transform"	2013-06-29 15:57:04 -07:00
Christian Duvivier	466e0cf303	SSE2 version of vp9_short_fdct32x32_rd. 43,000 -> 5,750 cycles, about 7.5x faster. Change-Id: Ibfd92821b9603f4ed9c256e0ececec14fa4565d0	2013-06-29 13:53:00 -07:00
chm	a83cfd4da1	add Neon optimized add constant residual functions - Add add_constant_residual_8x8 16x16 32x32 functions - Tested under RealView debugger enviroment Change-Id: I5c3a432f651b49bf375de6496353706a33e3e68e	2013-06-28 19:06:51 -07:00
Jingning Han	1109b6b888	Enable SSE2 4x4 ADST/DCT transform This commit enables SSE2 4x4 foward hybrid transform. The runtime goes from 249 cycles down to 74 cycles. Overall around 2% speed-up at no compression performance change. Change-Id: Iad4d526346e05c7be896466c05500711bb763660	2013-06-28 17:24:43 -07:00
Ronald S. Bultje	af660715c0	Make coefficient skip condition an explicit RD choice. This commit replaces zrun_zbin_boost, a method of biasing non-zero coefficients following runs of zero-coefficients to be rounded towards zero, with an explicit skip-block choice in the RD loop. The logic is basically that if individual coefficients should be rounded towards zero (from a RD point of view), the trellis/optimize loop should take care of it. If whole blocks should be zero (from a RD point of view), a single RD check is much more efficient than a complete serialization of the quantization loop. Quality change: derf +0.5% psnr, +1.6% ssim; yt +0.6% psnr, +1.1% ssim. SIMD for quantize will follow in a separate patch. Results for other test sets pending. Change-Id: Ife5fa641163ac5150ac428011e87188f1937c1f4	2013-06-28 10:28:49 -07:00
Frank Galligan	1d6dc1b702	Add Neon optimized loop filter functions. - Added vp9_loop_filter_horizontal_edge_neon and vp9_loop_filter_vertical_edge_neon. - The functions are based off the vp8 loopfilter functions. - Matches x86 md5 checksum. Change-Id: Id1c4dddb03584227e5ecd29f574a6ac27738fdd0	2013-06-27 16:14:45 -07:00
Jingning Han	3cc8c8c3a0	Merge "Refactor intra predictor block"	2013-06-25 19:46:55 -07:00
Jingning Han	d19ea3861d	Refactor intra predictor block Remove vp9_intra4x4_predict(). Use the common intra prediction function for all block sizes. Change-Id: Ibd19d51dfa3da8bbdfb79ddeb81530b2e2089560	2013-06-25 16:33:13 -07:00
Ronald S. Bultje	c24d922396	Add averaging-SAD functions for 8-point comp-inter motion search. Makes first 50 frames of bus @ 1500kbps encode from 3min22.7 to 3min18.2, i.e. 2.3% faster. In addition, use the sub_pixel_avg functions to calc the variance of the averaging predictor. This is slightly suboptimal because the function is subpixel-position-aware, but it will (at least for the SSE2 version) not actually use a bilinear filter for a full-pixel position, thus leading to approximately the same performance compared to if we implemented an actual average-aware full-pixel variance function. That gains another 0.3 seconds (i.e. encode time goes to 3min17.4), thus leading to a total gain of 2.7%. Change-Id: I3f059d2b04243921868cfed2568d4fa65d7b5acd	2013-06-25 12:57:28 -07:00
Yaowu Xu	b9c934df8e	Merge "Enable sse2 implmentation of 8x8 ADST/DCT"	2013-06-25 09:13:22 -07:00
Jingning Han	a32a086d23	Enable sse2 implmentation of 8x8 ADST/DCT This commit makes use of the butterfly structure to enable the sse2 version implementation of 8x8 ADST/DCT hybrid transform coding. The runtime of hybrid transform module goes down from 1170 cycles to 245 cycles. Overall speed-up around 1.5%. Change-Id: Ic808ffd21ece8a9d0410d8c0243d7b6c28ac3b3f	2013-06-24 18:41:33 -07:00
John Koleszar	ece724ae16	Merge "Remove unused vp9_build_intra_predictors_sb{y,uv}_s"	2013-06-24 15:08:58 -07:00
John Koleszar	9e7019f7df	Remove unused vp9_build_intra_predictors_sb{y,uv}_s The functions no longer referenced. Change-Id: If2705dfbc607f79ec8ec2242d5e03bec27a35aaf	2013-06-21 16:10:05 -07:00
Ronald S. Bultje	54b2a59623	Implement SSE2 block_error. Change vp9_block_error() to return a 64bit error variable, change all callers to expect a 64bit return value (this will prevent overflows, which we basically don't check for at all right now). Remove duplicate block_error() function, which fixed that through truncation. Remove old (incompatible) mmx/sse2 block_error SIMD versions and replace with a new one that returns a 64bit value. Encoding time of first 50 frames of bus @ 1500kbps goes from 3min29 to 3min23, i.e. a 3% overall speedup. Change-Id: Ib71ac5508b5ee8a80f1753cd85d72df1629abe68	2013-06-21 12:54:52 -07:00
Ronald S. Bultje	25c588b1e4	Add subtract_block SSE2 version and unit test. 3% faster overall (3min35.0 to 3min28.5). Change-Id: I5ff8a5c2c91586b6632ca5009ad1ea51ce94af5e	2013-06-21 09:35:37 -07:00

1 2 3 4 5

202 Commits