Orc table creation from spark sql with snappy compression

12/15/2023

WIth Snappy compression, you can save lot of disk space as well as performance in Hive.

So far ZLIB and Snappy Compression techniques are allowed. Partitioning with bucketing allows Hive to use local joins and improves performance of a number of queries. So, agent_information table had 320GB data whereas agent_information_ORC table had 79.5GB of data.Īlso while quering the ORC table, aggregations like count,max,min,sum does not require to run the MR jobs as the ORC table itself stores these aggregations at column level.īelow is a comparison details of disk space usage of a Hive DB against regular vs ORC. We used ORC file with Snappy compression. INSERT OVERWRITE TABLE agent_information_ORC SELECT * FROM agent_information Step 4:īefore doing the Step 4, I have actually validated the disk space for both tables. STORED AS ORC tblproperties ("orc.compress"="ZLIB") Step 3: LOCATION '/DEV/IVANS/TODAYS_DATE' Step 2:ĬREATE EXTERNAL TABLE agent_information_ORC ( ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' Step 3: Load data to ORC table from the Temp table Step 2: Create a ORC foramtted table in Hive Use Spark DataFrameReader’s orc () method to read ORC file into DataFrame. If the Parquet table is a non-partitioned one, set the spark.sql. I have practically achieved the result and have seen the effective performance of hive ORC table. For example, set tblproperties in the table creation statement: pressionsnappy. ORC format improves the performance when Hive is processing the data. Please let me know if you need more information.An ORC file contains group of rows data which is called as Stripes along with a file footer. I am still investigating what is the best way to handle VARCHAR/CHAR types through Spark dataframe. I noticed some columns are defined as VARCHAR(35) and I think those columns may be the issue.Īfter I made the change from VARCHAR to String and CHAR to String, it worked fine. if use the same ORC but use hive to create a table using second query even then I am getting the same error. Data in ORC files doesnt remain compressed after it. But I use an existing table alter table with a new coulmn using the Spark Hive context and save as ORC with snappy compression, I am getting the following error ORC does not support type conversion from STRING to VARCHAR. BigQuery supports the following compression codecs for ORC file contents: Zlib Snappy LZO LZ4. if I store ORC file with snappy compression and use hive to create table using script 1 then it is working fine. If use the first script using spark sql and store the file as ORC with snappy compression it is working.

The original table create when we scooped the data from SQL server using SQOOP importĬREATE TABLE `testtabledim`( `person_key` bigint, `pat_last` varchar(35), `pat_first` varchar(35), `pat_dob` timestamp, `pat_zip` char(5), `pat_gender` char(1), `pat_chksum1` bigint, `pat_chksum2` bigint, `dimcreatedgmt` timestamp, `pat_mi` char(1), `h_keychksum` string, `patmd5` string) ROW FORMAT SERDE '.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT '.ql.io.orc.OrcInputFormat' OUTPUTFORMAT '.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://hdp-cent7-01:8020/apps/hive/warehouse/datawarehouse.db/testtabledim' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='false', 'last_modified_by'='hdfs', 'last_modified_time'='1469026541', 'numFiles'='1', 'numRows'='-1', 'orc.compress'='SNAPPY', 'rawDataSize'='-1', 'totalSize'='11144909', 'transient_lastDdlTime'='1469026541') A) The following is the show create table testtable results ( this table is created with Spark SQLĬREATE TABLE `testtabletmp1`( `person_key` bigint, `pat_last` string, `pat_first` string, `pat_dob` timestamp, `pat_zip` string, `pat_gender` string, `pat_chksum1` bigint, `pat_chksum2` bigint, `dimcreatedgmt` timestamp, `pat_mi` string, `h_keychksum` string, `patmd5` string) ROW FORMAT SERDE '.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT '.ql.io.orc.OrcInputFormat' OUTPUTFORMAT '.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://hdp-cent7-01:8020/apps/hive/warehouse/datawarehouse.db/testtabledimtmp1' | TBLPROPERTIES ( 'orc.compress'='SNAPPY', 'transient_lastDdlTime'='1469207216')Ģ.

0 Comments

Orc table creation from spark sql with snappy compression

Leave a Reply.

Author

Archives

Categories