generic-library/vpx

Author	SHA1	Message	Date
Jingning Han	dac5891a1a	Merge "SSE2 4x4 invserse ADST/DCT transform"	2013-07-11 14:17:23 -07:00
Johann	158c80cbb0	convolve8 optimizations for neon Independent horizontal and vertical implementations. Requires that blocks be built from 4x4 and [xy]_step_q4 == 16 6-10% improvement. CIF improved the least. Change-Id: I137f5ceae4440adc0960bf88e4453e55a618bcda	2013-07-11 11:08:19 -07:00
hkuang	c9b25dcae4	Add neon optimize vp9_dc_only_idct_add. Change-Id: Iae84ab945cc9662a0ddd839aa2b9ca59f2ae5423	2013-07-11 10:30:47 -07:00
Jim Bankoski	5000cdf0ff	Merge "Wide loopfilter 16 pix at a time"	2013-07-11 06:44:02 -07:00
Jingning Han	49b6302044	SSE2 4x4 invserse ADST/DCT transform Enable SSE2 4x4 inverse ADST/DCT transform. The runtime goes from 292 cycles down to 89 cycles. Running bus_cif at 2000 kbps, the overall runtime of speed 0 goes from 301s to 295s (2% speed-up). Change-Id: I24098136e7fee7ab2fbf1c11755bdf2ca37f3628	2013-07-10 20:16:02 -07:00
Ronald S. Bultje	decead7336	Replace copy_memNxM functions with a generic copy/avg function. Change-Id: I3ce849452ed4f08527de9565a9914d5ee36170aa	2013-07-10 18:27:24 -07:00
John Koleszar	64f7a4d8cb	Wide loopfilter 16 pix at a time Where possible, do the 16 pixel wide filter while doing the horizontal filtering pass. The same approach can be taken for the mbloop_filter when that's implemented. Doing so on the vertical pass is a little more involved, but possible. Change-Id: I010cb505e623464247ae8f67fa25a0cdac091320	2013-07-10 16:32:44 -07:00
Ronald S. Bultje	e6f955251f	Merge "SSSE3 assembly for 4x4/8x8/16x16/32x32 H intra prediction."	2013-07-10 14:52:23 -07:00
Ronald S. Bultje	6a60249071	Merge "SSE/SSE2 assembly for 4x4/8x8/16x16/32x32 TM intra prediction."	2013-07-10 14:52:19 -07:00
Jingning Han	114423538f	SSE2 16x16 ADST/DCT hybrid transform This commit enables 16x16 ADST/DCT forward hybrid transform using SSE2 operations. It reduces the runtime from 5433 cycles to 1621 cycles, at no compression performance loss. Change-Id: I75fd7f1984e9e28846af459f810ff0d6ae125230	2013-07-10 12:14:53 -07:00
Ronald S. Bultje	7fd643264a	SSSE3 assembly for 4x4/8x8/16x16/32x32 H intra prediction. Change-Id: Iad70966b986f65259329070e258f76ef0af816b4	2013-07-10 09:28:03 -07:00
Ronald S. Bultje	8dade638a1	SSE/SSE2 assembly for 4x4/8x8/16x16/32x32 TM intra prediction. Change-Id: I3441c059214c2956e8261331bbf521525a617a86	2013-07-10 09:28:03 -07:00
Ronald S. Bultje	75b33c68c7	SSE/SSE2 assembly for 4x4/8x8/16x16/32x32 V intra prediction. Change-Id: I55a6cfa2daba738cbc0c4a02f806893f7e556997	2013-07-10 09:28:03 -07:00
Ronald S. Bultje	92c5d3665d	SSE/SSE2 assembly for 4x4/8x8/16x16/32x32 DC intra prediction. Change-Id: Ibe1690afc5459f3b3beca401e7734fcd03da6dd0	2013-07-10 09:28:03 -07:00
Frank Galligan	198fa6d0a0	Add Neon horizontal and vertical vp9_mbloop_filter - The vp9 mbfilter C code will branch on flat and mask. This CL will perform both branches and combine the data. A later CL will perform a check to see if all patch will take one branch. - These functions are about 1.75 times faster than the C code on Nexus 7. PS #3 - Changed all functions to dub limit, blimit, and thresh from vld {dx[]}, freeing up r4-r6. - Changed code to use vbif to reduce one instruction and free up a d register. Change-Id: I028dae0e434dc9891c3677bdb182e201ffb04777	2013-07-09 12:40:05 -07:00
Ronald S. Bultje	8350e7fe38	Make intra prediction pointers RTCD-based. This probably has a mildly negative impact on performance, but will (in future commits - or possibly merged with this one) allow SIMD implementations of individual intra prediction functions. We may perhaps want to consider having separate functions per txfm-size also (i.e. 4x4, 8x8, 16x16 and 32x32 intra prediction functions for each intra prediction mode), but I haven't played much with that yet. Change-Id: Ie739985eee0a3fcbb7aed29ee6910fdb653ea269	2013-07-08 17:25:51 -07:00
Ronald S. Bultje	c8defcfdee	Update quantize SSSE3 SIMD to cover 32x32 transform case also. Encode time of bus (speed 0) 50 frames @ 1500kbps goes from 2min14.4 to 2min10.1, i.e. a 2.3% overall speed increase. Change-Id: I3699580e74ec26c7d24e03681bc47ba25ee1ee87	2013-07-01 11:36:33 -07:00
Ronald S. Bultje	7353ceab9d	Quantize (64-bit only, for now) SSSE3 SIMD. Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is x86-64 only, it needs some minor modifications to be 32bit compatible, because it uses 15 xmm registers, whereas 32bit only has 8. Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904	2013-07-01 11:36:07 -07:00
Jingning Han	993942ce0c	Merge "Enable SSE2 4x4 ADST/DCT transform"	2013-06-29 15:57:04 -07:00
Christian Duvivier	466e0cf303	SSE2 version of vp9_short_fdct32x32_rd. 43,000 -> 5,750 cycles, about 7.5x faster. Change-Id: Ibfd92821b9603f4ed9c256e0ececec14fa4565d0	2013-06-29 13:53:00 -07:00
chm	a83cfd4da1	add Neon optimized add constant residual functions - Add add_constant_residual_8x8 16x16 32x32 functions - Tested under RealView debugger enviroment Change-Id: I5c3a432f651b49bf375de6496353706a33e3e68e	2013-06-28 19:06:51 -07:00
Jingning Han	1109b6b888	Enable SSE2 4x4 ADST/DCT transform This commit enables SSE2 4x4 foward hybrid transform. The runtime goes from 249 cycles down to 74 cycles. Overall around 2% speed-up at no compression performance change. Change-Id: Iad4d526346e05c7be896466c05500711bb763660	2013-06-28 17:24:43 -07:00
Ronald S. Bultje	af660715c0	Make coefficient skip condition an explicit RD choice. This commit replaces zrun_zbin_boost, a method of biasing non-zero coefficients following runs of zero-coefficients to be rounded towards zero, with an explicit skip-block choice in the RD loop. The logic is basically that if individual coefficients should be rounded towards zero (from a RD point of view), the trellis/optimize loop should take care of it. If whole blocks should be zero (from a RD point of view), a single RD check is much more efficient than a complete serialization of the quantization loop. Quality change: derf +0.5% psnr, +1.6% ssim; yt +0.6% psnr, +1.1% ssim. SIMD for quantize will follow in a separate patch. Results for other test sets pending. Change-Id: Ife5fa641163ac5150ac428011e87188f1937c1f4	2013-06-28 10:28:49 -07:00
Frank Galligan	1d6dc1b702	Add Neon optimized loop filter functions. - Added vp9_loop_filter_horizontal_edge_neon and vp9_loop_filter_vertical_edge_neon. - The functions are based off the vp8 loopfilter functions. - Matches x86 md5 checksum. Change-Id: Id1c4dddb03584227e5ecd29f574a6ac27738fdd0	2013-06-27 16:14:45 -07:00
Jingning Han	3cc8c8c3a0	Merge "Refactor intra predictor block"	2013-06-25 19:46:55 -07:00
Jingning Han	d19ea3861d	Refactor intra predictor block Remove vp9_intra4x4_predict(). Use the common intra prediction function for all block sizes. Change-Id: Ibd19d51dfa3da8bbdfb79ddeb81530b2e2089560	2013-06-25 16:33:13 -07:00
Ronald S. Bultje	c24d922396	Add averaging-SAD functions for 8-point comp-inter motion search. Makes first 50 frames of bus @ 1500kbps encode from 3min22.7 to 3min18.2, i.e. 2.3% faster. In addition, use the sub_pixel_avg functions to calc the variance of the averaging predictor. This is slightly suboptimal because the function is subpixel-position-aware, but it will (at least for the SSE2 version) not actually use a bilinear filter for a full-pixel position, thus leading to approximately the same performance compared to if we implemented an actual average-aware full-pixel variance function. That gains another 0.3 seconds (i.e. encode time goes to 3min17.4), thus leading to a total gain of 2.7%. Change-Id: I3f059d2b04243921868cfed2568d4fa65d7b5acd	2013-06-25 12:57:28 -07:00
Yaowu Xu	b9c934df8e	Merge "Enable sse2 implmentation of 8x8 ADST/DCT"	2013-06-25 09:13:22 -07:00
Jingning Han	a32a086d23	Enable sse2 implmentation of 8x8 ADST/DCT This commit makes use of the butterfly structure to enable the sse2 version implementation of 8x8 ADST/DCT hybrid transform coding. The runtime of hybrid transform module goes down from 1170 cycles to 245 cycles. Overall speed-up around 1.5%. Change-Id: Ic808ffd21ece8a9d0410d8c0243d7b6c28ac3b3f	2013-06-24 18:41:33 -07:00
John Koleszar	ece724ae16	Merge "Remove unused vp9_build_intra_predictors_sb{y,uv}_s"	2013-06-24 15:08:58 -07:00
John Koleszar	9e7019f7df	Remove unused vp9_build_intra_predictors_sb{y,uv}_s The functions no longer referenced. Change-Id: If2705dfbc607f79ec8ec2242d5e03bec27a35aaf	2013-06-21 16:10:05 -07:00
Ronald S. Bultje	54b2a59623	Implement SSE2 block_error. Change vp9_block_error() to return a 64bit error variable, change all callers to expect a 64bit return value (this will prevent overflows, which we basically don't check for at all right now). Remove duplicate block_error() function, which fixed that through truncation. Remove old (incompatible) mmx/sse2 block_error SIMD versions and replace with a new one that returns a 64bit value. Encoding time of first 50 frames of bus @ 1500kbps goes from 3min29 to 3min23, i.e. a 3% overall speedup. Change-Id: Ib71ac5508b5ee8a80f1753cd85d72df1629abe68	2013-06-21 12:54:52 -07:00
Ronald S. Bultje	25c588b1e4	Add subtract_block SSE2 version and unit test. 3% faster overall (3min35.0 to 3min28.5). Change-Id: I5ff8a5c2c91586b6632ca5009ad1ea51ce94af5e	2013-06-21 09:35:37 -07:00
Ronald S. Bultje	1e6a32f1af	SSE2/SSSE3 optimizations and unit test for sub_pixel_avg_variance(). Encoding of bus @ 1500kbps (first 50 frames) goes from 3min57 to 3min35, i.e. approximately a 10.5% speedup. Note that the SIMD versions which use a bilinear filter (x_offset & 7 \|\| y_offset & 7) aren't perfectly interleaved, and can probably be improved further in the future. I've marked this with a few TODOs/FIXMEs in the code. Change-Id: I5c9e900c0f0d32e431a50fecae213b510b2549f9	2013-06-20 15:59:48 -07:00
Ronald S. Bultje	8fb6c58191	Implement sse2 and ssse3 versions for all sub_pixel_variance sizes. Overall speedup around 5% (bus @ 1500kbps first 50 frames 4min10 -> 3min58). Specific changes to timings for each function compared to original assembly-optimized versions (or just new version timings if no previous assembly-optimized version was available): sse2 4x4: 99 -> 82 cycles sse2 4x8: 128 cycles sse2 8x4: 121 cycles sse2 8x8: 149 -> 129 cycles sse2 8x16: 235 -> 245 cycles (?) sse2 16x8: 269 -> 203 cycles sse2 16x16: 441 -> 349 cycles sse2 16x32: 641 cycles sse2 32x16: 643 cycles sse2 32x32: 1733 -> 1154 cycles sse2 32x64: 2247 cycles sse2 64x32: 2323 cycles sse2 64x64: 6984 -> 4442 cycles ssse3 4x4: 100 cycles (?) ssse3 4x8: 103 cycles ssse3 8x4: 71 cycles ssse3 8x8: 147 cycles ssse3 8x16: 158 cycles ssse3 16x8: 188 -> 162 cycles ssse3 16x16: 316 -> 273 cycles ssse3 16x32: 535 cycles ssse3 32x16: 564 cycles ssse3 32x32: 973 cycles ssse3 32x64: 1930 cycles ssse3 64x32: 1922 cycles ssse3 64x64: 3760 cycles Change-Id: I81ff6fe51daf35a40d19785167004664d7e0c59d	2013-06-20 09:34:25 -07:00
Jingning Han	7088426976	Merge "Make fdct32 computation flow within 16bit range"	2013-06-18 11:40:14 -07:00
Jingning Han	a41a4860c0	Make fdct32 computation flow within 16bit range This commit makes use of dual fdct32x32 versions for rate-distortion optimization loop and encoding process, respectively. The one for rd loop requires only 16 bits precision for intermediate steps. The original fdct32x32 that allows higher intermediate precision (18 bits) was retained for the encoding process only. This allows speed-up for fdct32x32 in the rd loop. No performance loss observed. Change-Id: I3237770e39a8f87ed17ae5513c87228533397cc3	2013-06-18 09:46:24 -07:00
Jingning Han	0b7910b9ff	Merge "Enable sse2 version of sad8x4/4x8"	2013-06-14 13:15:49 -07:00
Jingning Han	c43af9a8a3	Enable sse2 version of sad8x4/4x8 The encoding time for bus at CIF goes from 661s to 625s. This commit also enabled unit test of sad8x4/4x8 in sad_test.cc. Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1	2013-06-14 09:19:28 -07:00
Jingning Han	15f50e7b42	Enable sse2 version of sad8x4/4x8 The encoding time for bus at CIF goes from 661s to 625s. This commit also enabled unit test of sad8x4/4x8 in sad_test.cc. Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1	2013-06-13 16:18:18 -07:00
Scott LaVarnway	a81bd12a2e	Quick modifications to mb loopfilter intrinsic functions Modified to work with 8x8 blocks of memory. Will revisit later for further optimizations. For the HD clip used, the decoder improved by almost 20%. Change-Id: Iaa4785be293a32a42e8db07141bd699f504b8c67	2013-06-12 19:23:03 -04:00
Yaowu Xu	d682243012	Merge "Quick modifications to wide loopfilter intrinsic functions"	2013-06-12 15:16:11 -07:00
Ronald S. Bultje	fa96eeb835	Implement SSE version for sad4x8x4d and SSE2 version for sad8x4x4d. Encoding time of crew (CIF, first 50 frames) @ 1500kbps goes from 4min56 to 4min42. Change-Id: I92c0c8b32980d2ae7c6dafc8b883a2c7fcd14a9f	2013-06-12 17:40:01 -04:00
Scott LaVarnway	26496c52bf	Quick modifications to wide loopfilter intrinsic functions Modified to work with 8x8 blocks of memory. Will revisit later for further optimizations. For the HD clip used, the decoder improved my 20%. Change-Id: Ia0057f55d66d1445882351ea6c43b595a5a980e5	2013-06-12 16:49:08 -04:00
John Koleszar	ceee4563d6	Remove unused vp9_idct_add_{y,uv}_block These functions are not used, and appear to have been superceded. Change-Id: I86fe51b088264f6b1b8d4d232bba97b371b98120	2013-06-12 12:24:22 -07:00
John Koleszar	0e1e16db90	Enable mmx loop filter routines The mmx routines work as expected for the loop filter, so enable them. Change-Id: I2bbd9b99a4445fcba17bb95002f1fb6e01fe8f85	2013-06-12 11:28:21 -07:00
John Koleszar	44db42c114	Merge the new loopfilter experiment Change-Id: I524ba98841f2e1850e3276ac365c501cea31546d	2013-06-10 12:30:12 -07:00
Ronald S. Bultje	073c7d5eec	Fix firstpass if framesize is not a multiple of 16. Change-Id: Iec41736c2b6140715f90f40de5ae6cf52497a9b8	2013-06-08 13:32:05 -07:00
John Koleszar	736c7b804a	Merge "Reimplementation of loop filter" into experimental	2013-06-06 17:34:26 -07:00
John Koleszar	043d348aae	Reimplementation of loop filter This version of the loop filter supports non-4:2:0 subsampling and a fourth plane, as well as changing the filtering order to be more friendly to hardware implementations. The filters are applied first to all vertical edges within the 64x64 SB, followed by the top horizontal edge and any internal horizontal edges. Since filtering is applied on each 4x4 edge serially, a dependency is created from filtering one block edge to the next. It would be possible to remove this depencnecy by building all filtering decisions from the unfiltered reconstruction data. Change-Id: I08f3e9683eb7bded8a76651cbc50fc0dfdd05fa7	2013-06-06 08:45:45 -07:00

1 2 3 4

185 Commits