美文网首页
讲解:CS3114、java、Java、DNA sequence

讲解:CS3114、java、Java、DNA sequence

作者: borenwai | 来源:发表于2020-01-12 15:47 被阅读0次

    CS3114 (Spring 2019)PROGRAMMING ASSIGNMENT #4Due Tuesday, May 7th @ 11:00 PM for 100 pointsEarly bonus date: Sunday, May 5th @ 11:00 PM for a 10 point bonusIn this project you will implement a simple database system for DNA sequences.Your database system will include a disk-based hash table using a simple bucket hash, tosupport searches by sequence identifier. The bulk of the database will be stored in abinary file on disk, with a memory manager that stores both sequences and sequenceIDs.As with Project 2, define DNA sequences to be strings on the alphabet A, C, G, and T. Inthis project, you will store data records consisting of two parts. The first part will be theidentifier. The identifier is a relatively short string of characters from the A, C, G, Talphabet. The second part is the sequence. The sequence is a relatively long string (couldbe thousands of characters) from the A, C, G, T alphabet.Input and Output:The program will be invoked from the command-line as:java DNAdbase The name of the program is DNAdbase. Parameter command-file is the name ofthe input file that holds the commands to be processed by the program. Parameter hashfileis the name of the file that holds the hash table. Parameter hash-table-size defines thesize of the hash table. This number must be a multiple of 32. The hash table neverchanges in size once the program starts.Parameter memory-file is the name of the file used by the memory manager tostore strings. The input for this project will consist of a series of commands (some withassociated parameters, separated by spaces), with no more than one command on a line.A blank line may appear anywhere in the command file (except within the insertcommand, see below), and any number of spaces may separate parameters. You need notworry about checking for syntactic errors. That is, only the specified commands willappear in the file, and the specified parameters will always appear. However, you mustcheck for logical errors. The commands will be read from a file, and the output from thecommands will be written to standard output. The program should terminate after readingthe EOF mark.The commands will be as follows:The insert command consists of two lines (no blank lines will come betweenthese two lines). The first line will have the format:insert sequenceId lengthThe sequenceId is a string (from A, C, G, T) that is used as the sequence identifier.The length field indicates how long the sequence itself will be. This sequence appearson the second line. The sequence line will contain no spaces (that is, the sequence is notpreceeded or followed by spaces any space on that line, and no spaces appear within thesequence). The sequence consists only of the letters A, C, G, T. The sequence can (andoften will) be thousands of characters long.remove sequenceIDRemove the sequence associated with sequenceID from the hash table and fromthe memory manager, if it exists. Print a suitable message if sequenceID is not in thedatabase. If a sequence is removed, then print the complete sequence.printPrint out a list of all sequenceIDs in the database. For each one, indicate whichslot of the hash table it is stored in. Also, print out a list of the free blocks currently in thefile. For each such free block, indicate its starting byte position and its size. Such blocksshould be listed from lowest to highest in order of byte position (the same order that theyare stored on the freelist).search sequenceIDPrint out the sequence associated with sequenceID if it there is one stored in thedatabase.Implementation:Your database system will consist of three main parts: An indexing structure tosupport the searches, etc., the binary file that stores the large sequences, and a linked listused by the memory manager to track the free blocks within the disk file.For the indexing structure, you will use a hash table to store and retrieve sequencerecords. Each slot of the hash table will store two memory handles. One memory handlewill be for the sequenceID, the other memory handle will be for the sequence. In thisproject, sequenceIDs will be stored in the memory manager, so that the hash table canstore fixed-length records. You may choose not to use a hash table but the DNAtree inproject 2 with a 20-point penalty.One main implementation component for this project is support for storing thelarge sequences. You will store sequences on disk in a binary file, with the file namegiven by the . A major consideration is deciding where to store a givensequence in the binary file. This will be controlled by a memory manager implementingFirst Fit. See Section 11.04 of the textbook for a description of how this works. You willuse a linked list to keep track of the free sections of the binary file. Initially the binary filewill be empty (have no size). Whenever a new sequence is to be inserted, it should beinserted into a free section within the existing bounds of the file if possible. When this isnot possible, the size of the file should grow as necessary to accommodate the newsequence. Whenever a sequence is removed from the file, the memory manager shouldmerge together adjacent free sections into a single larger section if possible.Within the binary file you will not store the sequences as ASCII characters.Instead, you will store each letter of the sequence as two-bit code values, where A hascode 00, C has code 01, G has code 10, and T has code 11. Four letters of the sequencewill be packed into a single byte of the file. Space for sequences will always be allocatedas full bytes. If the length of a sequence is not a multiple of 4, then the last few bits of thelast byte storing the sequence will be unused.Your memory manager should operate by receiving a sequence to store, andreturning a “handle to its caller. The caller stores the handle so that it can later retrieve the sequence. Memory handles are defined to be two 4-byte integers. The first is the byteposition in the file for the associated string. The second is the length (in characters) of theassociated string. All strings will actually be converted to 2-bit codes when stored on disk,with 4 codes per byte.Be careful about allocating space for storing sequences. When reading in thesequence for the insert command, you are given the sequence length. You can use this toallocate a buffer into which the sequence can be read. Likewise, when requesting asequence from the memory manager, a buffer will be created in which to place thesequence.The hash tCS3114作业代写、代写java程序语言作业、代做Java实验作业、代写DNA sequences留学生作业 代做R语able will use a modified bucket hash compatible with disk-basedhashing. The first step will be to hash a sequenceID using a version of the sfold hashfunction that will be posted with this assignment. Each bucket will be 512 bytes, or 32table slots since each slot stores two memory handles, each of which is two 4-byteintegers in length. Collision resolution will use simple linear probing, with wrap-aroundat the bottom of the current bucket. For example, if a string hashes to slot 60 in the table,the probe sequence will be slots 61, then 62, then 63, which is the bottom slot of thatbucket. The next probe will wrap to the top of the bucket, or slot 32, then to slot 33, andso on. If the bucket is completely full, then the insert request will be rejected. Note that ifthe insert fails, the corresponding sequence and sequenceID strings must be removedfrom the memory managers memory pool as well.Source code:For this project you may not use any Java library classes for file-based memorymanagement, or hashing. You may use source code distributed with the textbook, or anycode that you wrote yourself for previous projects (so long as such code does not rely onlibrary classes for those forbidden source code).Programming Standards:You must conform to good programming/documentation standards. Somespecifics: You must include a header comment, preceding main(), specifying the compilerand operating system used and the date completed. Your header comment must describe what your program does; dont justplagiarize language from this spec. You must include a comment explaining the purpose of every variable or namedconstant you use in your program. You must use meaningful identi_er names that suggest the meaning or purpose ofthe constant, variable, function, etc. Always use named constants or enumerated types instead of literal constants inthe code. Precede every major block of your code with a comment explaining its purpose.You dont have to describe how it works unless you do something so sneaky itdeserves special recognition. You must use indentation and blank lines to make control structures morereadable. Precede each function and/or class method with a header comment describing what the function does, the logical significance of each parameter (if any), andpre- and post-conditions. Decompose your design logically, identifying which components should beobjects and what operations should be encapsulated for each.Neither the GTAs nor the instructors will help any student debug an implementationunless it is properly documented and exhibits good programming style. Be sure to beginyour internal documentation right from the start. You may only use code you have written, either specifically for this project or forearlier programs, or code taken from the textbook. Note that the textbook code is notdesigned for the specific purpose of this assignment, and is therefore likely to requiremodification. It may, however, provide a useful starting point.Testing:A sample data file will be posted to the website to help you test your program.This is not the data file that will be used in grading your program. The test data providedto you will attempt to exercise the various syntactic elements of the commandspecifications. It makes no effort to be comprehensive in terms of testing the datastructures required by the program. Thus, while the test data provided should be useful,you should also do testing on your own test data to ensure that your program workscorrectly.Deliverables:You will implement your project using Eclipse, and you will submit your projectusing the Eclipse plugin to Web-CAT. Links to the Web-CAT client are posted at theclass website. If you make multiple submissions, only your last submission will beevaluated. There is no limit to the number of submissions that you may make.You are required to submit your own test cases (should cover at least 70% of yourcode) with your program. Of course, your program must pass your own test cases. Yourgrade will be fully determined by test cases that are provided by the graders (TAs). WebCATwill report to you which test files have passed correctly, and which have not. Notethat you will not be given a copy of these test files, only a brief description of what eachaccomplished in order to guide your own testing process in case you did not pass one ofour tests.To be able to write tests that can be run by webcat, you should go to the followinglink:http://web-cat.org/junit-quickstart/and download the student.jar file. You then need to reference this file as anexternal library. The library extends the existing JUnit library, so you can write testsusing regular JUnit syntax. The webpage also includes some helpful hints to help you testbetter.When structuring the source files of your project, use a directory structure; that is,your source files will all be contained in the project “src directory. Any subdirectories inthe project will be ignored. You are permitted (and encouraged) to work with a partner on this project. Whenyou work with a partner, then only one member of the pair will make a submission. Bothnames and emails must be included in the documentation and selected on any Web-CATsubmission. The last submission from either of the pair members will be graded.Pledge:Your project submission must include a statement, pledging your conformance tothe Honor Code requirements for this course. Specifically, you must include thefollowing pledge statement near the beginning of the file containing the function main()in your program. The text of the pledge will also be posted online.// On my honor://// - I have not used source code obtained from another student,// or any other unauthorized source, either modified or// unmodified.//// - All source code and documentation used in my program is// either my original work, or was derived by me from the// source code published in the textbook for this course.//// - I have not discussed coding details about this project with// anyone other than my partner (in the case of a joint// submission), instructor, ACM/UPE tutors or the TAs assigned// to this course. I understand that I may discuss the concepts// of this program with other students, and that another student// may help me debug my program so long as neither of us writes// anything during the discussion or modifies any computer file// during the discussion. I have violated neither the spirit nor// letter of this restriction.Programs that do not contain this pledge will not be graded.转自:http://www.7daixie.com/2019050430752617.html

    相关文章

      网友评论

          本文标题:讲解:CS3114、java、Java、DNA sequence

          本文链接:https://www.haomeiwen.com/subject/dmziactx.html